Matt:
Thanks for the update.
Cacheable interface is defined in:
src/main/java/org/apache/hadoop/hbase/io/hfile/Cacheable.java

You can find the implementation at:
src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java

I will browse your code later.

On Tue, Sep 13, 2011 at 12:44 AM, Matt Corgan <[email protected]> wrote:

> Hi devs,
>
> I put a "developer preview" of a prefix compression algorithm on github.
>  It
> still needs some details worked out, a full set of iterators, about 200
> optimizations, and a bunch of other stuff...  but, it successfully passes
> some preliminary tests so I thought I'd get it in front of more eyeballs
> sooner than later.
>
> https://github.com/hotpads/hbase-prefix-trie
>
> It depends on HBase's Bytes.java and KeyValue.java, which depends on
> hadoop.
>  Those jars are in there, along with some others for full HFile testing in
> the near future.
>
> A good place to start looking at the code
> is org.apache.hadoop.hbase.keyvalue.trie.builder.KeyValuePtBuilder.  It's
> the main driver of the compaction side of things, taking KeyValues (in
> sorted order), and generates a byte[] to be saved as a disk block.  Then
> for
> reading it back in, there is trie.compact.read.PtIterator which takes the
> byte[] and spits out KeyValues.  The main test class is
> trie.test.row.RowBuilderTests which round-trips a bunch of KeyValues to
> make
> sure the outputs are the same as the inputs.
>  trie.compact.row.node.PtRowNode is the format of a single trie node in the
> underlying byte[].
>
> The current version generates a lot of little objects (garbage) while
> writing and reading.  I plan to optimize it to the point where most
> variables are primitives on the stack (at least when reading), but I think
> these intermediate classes are important for debugging.  I'll probably try
> to keep them around going forward and develop a more compact version in
> parallel.
>
> It uses trie style compression for the row keys and column qualifiers,
> where
> pointers between nodes are compacted ints.  It keeps a list of compacted,
> de-duped deltas for timestamps, and if they're all the same, it stores only
> one (long) per block.  If all KeyValue.TYPE operations are the same, then
> it
> only stores one (byte) per block.
>
> It's designed around efficient cpu cache usage and elimination of 8 byte
> pointers, so should be fast.  Get calls can traverse the trie nodes to dig
> up a row key while barely loading anything from memory to cache, as opposed
> to current hbase which may load the better part of a block into cache.
>  Scanners/Filters/Comparators can all be designed to be trie-aware so they
> can iterate through 20 columnQualifiers in the same row without constantly
> re-scanning the rowKey bytes... etc...
>
>
> Here are a few questions we'll need to answer at some point:
>
> * where should this code go?
>    - i'd vote for keeping it isolated and limiting references back to the
> main hbase project.  sort of like the gzip/lzo algorithms.
>    - if it's strictly isolated, it'll be easier to keep it well tested for
> correctness/performance and let more people tinker with it to make it
> faster.  it'll also force us to come up with the right interfaces to allow
> other compression implementations.
>
> * current HFileWriter sends KeyValues to the OutputStream as soon as
> they're
> processed, but would it be ok to queue up a whole block in memory and write
> it all at once?
>    - i'd vote for yes.  it makes it easier to arrange the data to be more
> read-friendly.  also, we're only talking about one block of data which is
> presumably a fraction of the block cache's size
>
> * should the block bytes be accessed as a byte[] of ByteBuffer.  i know
> there's been work on off-heap cache, but i've read the blocks are pulled
> on-heap before they're parsed (??)
>    - see org.apache.hadoop.hbase.keyvalue.trie.profile.MemoryAccessProfiler
> for a comparison of byte[] vs ByteBuffer speed tests.  ByteBuffer looks
> ~2-10x slower, but some people need the off-heap cache
>    - i'll propose maintaining separate reading algorithms for each, given
> that the underlying bytes are in the exact same format.  should be possible
> to copy/paste the code and replace bytes[i] with ByteBuffer.get(i), and
> create parallel versions of static util methods
>
> * each block has some metadata wrapped in a class called PtBlockMeta.  does
> HBase currently have a way to store its values as java primitives on the
> heap rather than parsing them out of the byte[]/ByteBuffer on every access?
>  if they have to be parsed out on every block access, that could take more
> time than the Get query it's trying to perform.
>    - I know there's a new Cachable interface or something like that.  maybe
> that already supports it or could be enhanced
>
>
> What jiras do you think i should make?
>
> Look forward to hearing people's feedback,
> Matt
>

Reply via email to