Matt: Thanks for the update. Cacheable interface is defined in: src/main/java/org/apache/hadoop/hbase/io/hfile/Cacheable.java
You can find the implementation at: src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java I will browse your code later. On Tue, Sep 13, 2011 at 12:44 AM, Matt Corgan <[email protected]> wrote: > Hi devs, > > I put a "developer preview" of a prefix compression algorithm on github. > It > still needs some details worked out, a full set of iterators, about 200 > optimizations, and a bunch of other stuff... but, it successfully passes > some preliminary tests so I thought I'd get it in front of more eyeballs > sooner than later. > > https://github.com/hotpads/hbase-prefix-trie > > It depends on HBase's Bytes.java and KeyValue.java, which depends on > hadoop. > Those jars are in there, along with some others for full HFile testing in > the near future. > > A good place to start looking at the code > is org.apache.hadoop.hbase.keyvalue.trie.builder.KeyValuePtBuilder. It's > the main driver of the compaction side of things, taking KeyValues (in > sorted order), and generates a byte[] to be saved as a disk block. Then > for > reading it back in, there is trie.compact.read.PtIterator which takes the > byte[] and spits out KeyValues. The main test class is > trie.test.row.RowBuilderTests which round-trips a bunch of KeyValues to > make > sure the outputs are the same as the inputs. > trie.compact.row.node.PtRowNode is the format of a single trie node in the > underlying byte[]. > > The current version generates a lot of little objects (garbage) while > writing and reading. I plan to optimize it to the point where most > variables are primitives on the stack (at least when reading), but I think > these intermediate classes are important for debugging. I'll probably try > to keep them around going forward and develop a more compact version in > parallel. > > It uses trie style compression for the row keys and column qualifiers, > where > pointers between nodes are compacted ints. It keeps a list of compacted, > de-duped deltas for timestamps, and if they're all the same, it stores only > one (long) per block. If all KeyValue.TYPE operations are the same, then > it > only stores one (byte) per block. > > It's designed around efficient cpu cache usage and elimination of 8 byte > pointers, so should be fast. Get calls can traverse the trie nodes to dig > up a row key while barely loading anything from memory to cache, as opposed > to current hbase which may load the better part of a block into cache. > Scanners/Filters/Comparators can all be designed to be trie-aware so they > can iterate through 20 columnQualifiers in the same row without constantly > re-scanning the rowKey bytes... etc... > > > Here are a few questions we'll need to answer at some point: > > * where should this code go? > - i'd vote for keeping it isolated and limiting references back to the > main hbase project. sort of like the gzip/lzo algorithms. > - if it's strictly isolated, it'll be easier to keep it well tested for > correctness/performance and let more people tinker with it to make it > faster. it'll also force us to come up with the right interfaces to allow > other compression implementations. > > * current HFileWriter sends KeyValues to the OutputStream as soon as > they're > processed, but would it be ok to queue up a whole block in memory and write > it all at once? > - i'd vote for yes. it makes it easier to arrange the data to be more > read-friendly. also, we're only talking about one block of data which is > presumably a fraction of the block cache's size > > * should the block bytes be accessed as a byte[] of ByteBuffer. i know > there's been work on off-heap cache, but i've read the blocks are pulled > on-heap before they're parsed (??) > - see org.apache.hadoop.hbase.keyvalue.trie.profile.MemoryAccessProfiler > for a comparison of byte[] vs ByteBuffer speed tests. ByteBuffer looks > ~2-10x slower, but some people need the off-heap cache > - i'll propose maintaining separate reading algorithms for each, given > that the underlying bytes are in the exact same format. should be possible > to copy/paste the code and replace bytes[i] with ByteBuffer.get(i), and > create parallel versions of static util methods > > * each block has some metadata wrapped in a class called PtBlockMeta. does > HBase currently have a way to store its values as java primitives on the > heap rather than parsing them out of the byte[]/ByteBuffer on every access? > if they have to be parsed out on every block access, that could take more > time than the Get query it's trying to perform. > - I know there's a new Cachable interface or something like that. maybe > that already supports it or could be enhanced > > > What jiras do you think i should make? > > Look forward to hearing people's feedback, > Matt >
