Re: prefix compression implementation

Ryan Rawson Fri, 16 Sep 2011 19:34:50 -0700

On Fri, Sep 16, 2011 at 7:29 PM, Matt Corgan <mcor...@hotpads.com> wrote:
> Ryan - thanks for the feedback.  The situation I'm thinking of where it's
> useful to parse DirectBB without copying to heap is when you are serving
> small random values out of the block cache.  At HotPads, we'd like to store
> hundreds of GB of real estate listing data in memory so it can be quickly
> served up at random.  We want to access many small values that are already
> in memory, so basically skipping step 1 of 3 because values are already in
> memory.  That being said, the DirectBB are not essential for us since we
> haven't run into gb problems, i just figured it would be nice to support
> them since they seem to be important to other people.
>
> My motivation for doing this is to make hbase a viable candidate for a
> large, auto-partitioned, sorted, *in-memory* database.  Not the usual
> analytics use case, but i think hbase would be great for this.


What exactly about the current system makes it not a viable candidate?





>
>
> On Fri, Sep 16, 2011 at 7:08 PM, Ryan Rawson <ryano...@gmail.com> wrote:
>
>> On Fri, Sep 16, 2011 at 6:47 PM, Matt Corgan <mcor...@hotpads.com> wrote:
>> > I'm a little confused over the direction of the DBBs in general, hence
>> the
>> > lack of clarity in my code.
>> >
>> > I see value in doing fine-grained parsing of the DBB if you're going to
>> have
>> > a large block of data and only want to retrieve a small KV from the
>> middle
>> > of it.  With this trie design, you can navigate your way through the DBB
>> > without copying hardly anything to the heap.  It would be a shame blow
>> away
>> > your entire L1 cache by loading a whole 256KB block onto heap if you only
>> > want to read 200 bytes out of the middle... it can be done
>> > ultra-efficiently.
>>
>> This paragraph is not factually correct.  The DirectByteBuffer vs main
>> heap has nothing to do with the CPU cache.  Consider the following
>> scenario:
>>
>> - read block from DFS
>> - scan block in ram
>> - prepare result set for client
>>
>> Pretty simple, we have a choice in step 1:
>> - write to java heap
>> - write to DirectByteBuffer off-heap controlled memory
>>
>> in either case, you are copying to memory, and therefore cycling thru
>> the cpu cache (of course).  The difference is whether the Java GC has
>> to deal with the aftermath or not.
>>
>> So the question "DBB or not" is not one about CPU caches, but one
>> about garbage collection.  Of course, nothing is free, and dealing
>> with DBB requires extensive in-situ bounds checking (look at the
>> source code for that class!), and also requires manual memory
>> management on the behalf of the programmer.  So you are faced with an
>> expensive API (getByte is not as good at an array get), and a lot more
>> homework to do.  I have decided it's not worth it personally and
>> aren't chasing that line as a potential performance improvement, and I
>> also would encourage you not to as well.
>>
>> Ultimately the DFS speed issues need to be solved by the DFS - HDFS
>> needs more work, but alternatives are already there and are a lot
>> faster.
>>
>>
>>
>>
>>
>> >
>> > The problem is if you're going to iterate through an entire block made of
>> > 5000 small KV's doing thousands of DBB.get(index) calls.  Those are like
>> 10x
>> > slower than byte[index] calls.  In that case, if it's a DBB, you want to
>> > copy the full block on-heap and access it through the byte[] interface.
>>  If
>> > it's a HeapBB, then you already have access to the underlying byte[].
>>
>> Yes this is the issue - you have to take an extra copy one way or
>> another.  Doing effective prefix compression with DBB is not really
>> feasible imo, and that's another reason why I have given up on DBBs.
>>
>> >
>> > So there's possibly value in implementing both methods.  The main problem
>> i
>> > see is a lack of interfaces in the current code base.  I'll throw one
>> > suggestion out there as food for thought.  Create a new interface:
>> >
>> > interface HCell{
>> >  byte[] getRow();
>> >  byte[] getFamily();
>> >  byte[] getQualifier();
>> >  long getTimestamp();
>> >  byte getType();
>> >  byte[] getValue();
>> >
>> >  //plus an endless list of convenience methods:
>> >  int getKeyLength();
>> >  KeyValue getKeyValue();
>> >  boolean isDelete();
>> >  //etc, etc (or put these in sub-interface)
>> > }
>> >
>> > We could start by making KeyValue implement that interface and then
>> slowly
>> > change pieces of the code base to use HCell.  That will allow us to start
>> > elegantly working in different implementations.
>> > PtKeyValue<
>> https://github.com/hotpads/hbase-prefix-trie/blob/master/src/org/apache/hadoop/hbase/keyvalue/trie/compact/read/PtKeyValue.java
>> >would
>> > be one of them.  During the transition, you can always call
>> > PtKeyValue.getCopiedKeyValue() which will instantiate a new byte[] in the
>> > traditional KeyValue format.
>>
>> I am not really super keen here, and while the interface of course
>> makes plenty of sense, the issue is that you will need to turn an
>> array of KeyValues (aka a Result instance) in to a bunch of bytes on
>> the wire.  So there HAS to be a method that returns a ByteBuffer that
>> the IO layer can then use to write out (via scatter/gather type
>> network APIs) to the wire.
>>
>> A better choice I think would be to remove this method:
>>  public byte [] getBuffer()
>>
>> then deal with the places that use this - there is a bunch, but
>> nothing looks super impossible (ie: no interface changes to filters,
>> etc).
>>
>> Once you have that, making multiple implementations of key value is
>> easier.  I'd rather that key value becomes the abstract base class,
>> and subclasses implement concrete details.
>>
>> >
>> > We'd also want an interface for HFileBlock, and a few others...
>> >
>> > Some of this stuff is overwhelming to think about in parallel with the
>> > existing hbase code, but it's actually not very complicated from a
>> > standalone perspective.  If you can isolate it into a module behind an
>> > interface, then it's just a bunch of converting things to bytes and back.
>> >  There are (hopefully) no exceptions, gc pauses, cascading failures,
>> etc...
>> > all the things that are hard to handle to begin with and especially time
>> > consuming to debug, emulate, and write tests for.  There's not even
>> > multi-threading!  It's pretty easy to write tests for it and then never
>> look
>> > at it again.
>>
>> Yes, unit tests for basic logic is good, but ultimately hbase is an
>> integrated whole, and the concurrency problems have been really tough
>> to crack.  Things are better than they have ever been, but still a lot
>> of testing to do.
>>
>> >
>> > Matt
>> >
>> > On Fri, Sep 16, 2011 at 6:08 PM, Ryan Rawson <ryano...@gmail.com> wrote:
>> >
>> >> Hey this stuff looks really interesting!
>> >>
>> >> On the ByteBuffer, the 'array' byte[] access to the underlying data is
>> >> totally incompatible with the 'off heap' features that are implemented
>> >> by DirectByteBuffer.  While people talk about DBB in terms of nio
>> >> performance, if you have to roundtrip the data thru java code, I'm not
>> >> sure it buys you much - you still need to move data in and out of the
>> >> main Java heap.  Typically this is geared more towards apps which read
>> >> and write from/to socket/files with minimal processing.
>> >>
>> >> While in the past I have been pretty bullish on off-heap caching for
>> >> HBase, I have since changed my mind due to the poor API (ByteBuffer is
>> >> a sucky way to access data structures in ram), and other reasons (ping
>> >> me off list if you want).  The KeyValue code pretty much presumes that
>> >> data is in byte[] anyways, and I had thought that even with off-heap
>> >> caching, we'd still have to copy KeyValues into main-heap during
>> >> scanning anyways.
>> >>
>> >> Given the minimal size of the hfile blocks, I really dont see an issue
>> >> with buffering a block output - especially if the savings is fairly
>> >> substantial.
>> >>
>> >> Thanks,
>> >> -ryan
>> >>
>> >> On Fri, Sep 16, 2011 at 5:48 PM, Matt Corgan <mcor...@hotpads.com>
>> wrote:
>> >> > Jacek,
>> >> >
>> >> > Thanks for helping out with this.  I implemented most of the
>> DeltaEncoder
>> >> > and DeltaEncoderSeeker.  I haven't taken the time to generate a good
>> set
>> >> of
>> >> > test data for any of this, but it does pass on some very small input
>> data
>> >> > that aims to cover the edge cases i can think of.  Perhaps you have
>> full
>> >> > HFiles you can run through it.
>> >> >
>> >> >
>> >>
>> https://github.com/hotpads/hbase-prefix-trie/tree/master/src/org/apache/hadoop/hbase/keyvalue/trie/deltaencoder
>> >> >
>> >> > I also put some notes on the PtDeltaEncoder regarding how the prefix
>> trie
>> >> > should be optimally used.  I can't think of a situation where you'd
>> want
>> >> to
>> >> > blow it up into the full uncompressed KeyValue ByteBuffer, so
>> >> implementing
>> >> > the DeltaEncoder interface is a mismatch, but I realize it's only a
>> >> starting
>> >> > point.
>> >> >
>> >> > You also would never really have a full ByteBuffer of KeyValues to
>> pass
>> >> to
>> >> > it for compression.  Typically, you'd be passing individual KeyValues
>> >> from
>> >> > the memstore flush or from a collection of HFiles being merged through
>> a
>> >> > PriorityQueue.
>> >> >
>> >> > The end goal is to operate on the encoded trie without decompressing
>> it.
>> >> >  Long term, and in certain circumstances, it may even be possible to
>> pass
>> >> > the compressed trie over the wire to the client who can then decode
>> it.
>> >> >
>> >> > Let me know if I implemented that the way you had in mind.  I haven't
>> >> done
>> >> > the seekTo method yet, but will try to do that next week.
>> >> >
>> >> > Matt
>> >> >
>> >> > On Wed, Sep 14, 2011 at 3:43 PM, Jacek Migdal <ja...@fb.com> wrote:
>> >> >
>> >> >> Matt,
>> >> >>
>> >> >> Thanks a lot for the code. Great job!
>> >> >>
>> >> >> As I mentioned in JIRA I work full time on the delta encoding [1].
>> Right
>> >> >> now the code and integration is almost done. Most of the parts are
>> under
>> >> >> review. Since it is a big change will plan to test it very carefully.
>> >> >> After that, It will be ported to trunk and open sourced.
>> >> >>
>> >> >> I have a quick glimpse I have taken the different approach. I
>> >> implemented
>> >> >> a few different algorithms which are simpler. They also aims mostly
>> to
>> >> >> save space while having fast decompress/compress code. However the
>> >> access
>> >> >> is still sequential. The goal of my project is to save some RAM by
>> >> having
>> >> >> compressed BlockCache in memory.
>> >> >>
>> >> >> On the other hand, it seems that you are most concerned about seeking
>> >> >> performance.
>> >> >>
>> >> >> I will read your code more carefully. A quick glimpse: we both
>> >> implemented
>> >> >> some routines (like vint), but expect that there is no overlap.
>> >> >>
>> >> >> I also seen that you spend some time investigating ByteBuffer vs.
>> >> Byte[].
>> >> >> I experienced significant negative performance impact when I switched
>> to
>> >> >> ByteBuffer. However I postpone this optimization.
>> >> >>
>> >> >> Right now I think the easiest way to go would be that you will
>> implement
>> >> >> DeltaEncoder interface after my change:
>> >> >> http://pastebin.com/Y8UxUByG
>> >> >> (note, there might be some minor changes)
>> >> >>
>> >> >> That way, you will reuse my integration with existing code for
>> "free".
>> >> >>
>> >> >> Jacek
>> >> >>
>> >> >> [1] - I prefer to call it that way. Prefix is one of the algorithm,
>> but
>> >> >> there are also different approach.
>> >> >>
>> >> >> On 9/13/11 1:36 AM, "Ted Yu" <yuzhih...@gmail.com> wrote:
>> >> >>
>> >> >> >Matt:
>> >> >> >Thanks for the update.
>> >> >> >Cacheable interface is defined in:
>> >> >> >src/main/java/org/apache/hadoop/hbase/io/hfile/Cacheable.java
>> >> >> >
>> >> >> >You can find the implementation at:
>> >> >> >src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java
>> >> >> >
>> >> >> >I will browse your code later.
>> >> >> >
>> >> >> >On Tue, Sep 13, 2011 at 12:44 AM, Matt Corgan <mcor...@hotpads.com>
>> >> >> wrote:
>> >> >> >
>> >> >> >> Hi devs,
>> >> >> >>
>> >> >> >> I put a "developer preview" of a prefix compression algorithm on
>> >> github.
>> >> >> >>  It
>> >> >> >> still needs some details worked out, a full set of iterators,
>> about
>> >> 200
>> >> >> >> optimizations, and a bunch of other stuff...  but, it successfully
>> >> >> >>passes
>> >> >> >> some preliminary tests so I thought I'd get it in front of more
>> >> eyeballs
>> >> >> >> sooner than later.
>> >> >> >>
>> >> >> >> https://github.com/hotpads/hbase-prefix-trie
>> >> >> >>
>> >> >> >> It depends on HBase's Bytes.java and KeyValue.java, which depends
>> on
>> >> >> >> hadoop.
>> >> >> >>  Those jars are in there, along with some others for full HFile
>> >> testing
>> >> >> >>in
>> >> >> >> the near future.
>> >> >> >>
>> >> >> >> A good place to start looking at the code
>> >> >> >> is
>> org.apache.hadoop.hbase.keyvalue.trie.builder.KeyValuePtBuilder.
>> >> >> >>It's
>> >> >> >> the main driver of the compaction side of things, taking KeyValues
>> >> (in
>> >> >> >> sorted order), and generates a byte[] to be saved as a disk block.
>> >>  Then
>> >> >> >> for
>> >> >> >> reading it back in, there is trie.compact.read.PtIterator which
>> takes
>> >> >> >>the
>> >> >> >> byte[] and spits out KeyValues.  The main test class is
>> >> >> >> trie.test.row.RowBuilderTests which round-trips a bunch of
>> KeyValues
>> >> to
>> >> >> >> make
>> >> >> >> sure the outputs are the same as the inputs.
>> >> >> >>  trie.compact.row.node.PtRowNode is the format of a single trie
>> node
>> >> in
>> >> >> >>the
>> >> >> >> underlying byte[].
>> >> >> >>
>> >> >> >> The current version generates a lot of little objects (garbage)
>> while
>> >> >> >> writing and reading.  I plan to optimize it to the point where
>> most
>> >> >> >> variables are primitives on the stack (at least when reading), but
>> I
>> >> >> >>think
>> >> >> >> these intermediate classes are important for debugging.  I'll
>> >> probably
>> >> >> >>try
>> >> >> >> to keep them around going forward and develop a more compact
>> version
>> >> in
>> >> >> >> parallel.
>> >> >> >>
>> >> >> >> It uses trie style compression for the row keys and column
>> >> qualifiers,
>> >> >> >> where
>> >> >> >> pointers between nodes are compacted ints.  It keeps a list of
>> >> >> >>compacted,
>> >> >> >> de-duped deltas for timestamps, and if they're all the same, it
>> >> stores
>> >> >> >>only
>> >> >> >> one (long) per block.  If all KeyValue.TYPE operations are the
>> same,
>> >> >> >>then
>> >> >> >> it
>> >> >> >> only stores one (byte) per block.
>> >> >> >>
>> >> >> >> It's designed around efficient cpu cache usage and elimination of
>> 8
>> >> byte
>> >> >> >> pointers, so should be fast.  Get calls can traverse the trie
>> nodes
>> >> to
>> >> >> >>dig
>> >> >> >> up a row key while barely loading anything from memory to cache,
>> as
>> >> >> >>opposed
>> >> >> >> to current hbase which may load the better part of a block into
>> >> cache.
>> >> >> >>  Scanners/Filters/Comparators can all be designed to be trie-aware
>> so
>> >> >> >>they
>> >> >> >> can iterate through 20 columnQualifiers in the same row without
>> >> >> >>constantly
>> >> >> >> re-scanning the rowKey bytes... etc...
>> >> >> >>
>> >> >> >>
>> >> >> >> Here are a few questions we'll need to answer at some point:
>> >> >> >>
>> >> >> >> * where should this code go?
>> >> >> >>    - i'd vote for keeping it isolated and limiting references back
>> to
>> >> >> >>the
>> >> >> >> main hbase project.  sort of like the gzip/lzo algorithms.
>> >> >> >>    - if it's strictly isolated, it'll be easier to keep it well
>> >> tested
>> >> >> >>for
>> >> >> >> correctness/performance and let more people tinker with it to make
>> it
>> >> >> >> faster.  it'll also force us to come up with the right interfaces
>> to
>> >> >> >>allow
>> >> >> >> other compression implementations.
>> >> >> >>
>> >> >> >> * current HFileWriter sends KeyValues to the OutputStream as soon
>> as
>> >> >> >> they're
>> >> >> >> processed, but would it be ok to queue up a whole block in memory
>> and
>> >> >> >>write
>> >> >> >> it all at once?
>> >> >> >>    - i'd vote for yes.  it makes it easier to arrange the data to
>> be
>> >> >> >>more
>> >> >> >> read-friendly.  also, we're only talking about one block of data
>> >> which
>> >> >> >>is
>> >> >> >> presumably a fraction of the block cache's size
>> >> >> >>
>> >> >> >> * should the block bytes be accessed as a byte[] of ByteBuffer.  i
>> >> know
>> >> >> >> there's been work on off-heap cache, but i've read the blocks are
>> >> pulled
>> >> >> >> on-heap before they're parsed (??)
>> >> >> >>    - see
>> >> >> >>org.apache.hadoop.hbase.keyvalue.trie.profile.MemoryAccessProfiler
>> >> >> >> for a comparison of byte[] vs ByteBuffer speed tests.  ByteBuffer
>> >> looks
>> >> >> >> ~2-10x slower, but some people need the off-heap cache
>> >> >> >>    - i'll propose maintaining separate reading algorithms for
>> each,
>> >> >> >>given
>> >> >> >> that the underlying bytes are in the exact same format.  should be
>> >> >> >>possible
>> >> >> >> to copy/paste the code and replace bytes[i] with
>> ByteBuffer.get(i),
>> >> and
>> >> >> >> create parallel versions of static util methods
>> >> >> >>
>> >> >> >> * each block has some metadata wrapped in a class called
>> PtBlockMeta.
>> >> >> >>does
>> >> >> >> HBase currently have a way to store its values as java primitives
>> on
>> >> the
>> >> >> >> heap rather than parsing them out of the byte[]/ByteBuffer on
>> every
>> >> >> >>access?
>> >> >> >>  if they have to be parsed out on every block access, that could
>> take
>> >> >> >>more
>> >> >> >> time than the Get query it's trying to perform.
>> >> >> >>    - I know there's a new Cachable interface or something like
>> that.
>> >> >> >>maybe
>> >> >> >> that already supports it or could be enhanced
>> >> >> >>
>> >> >> >>
>> >> >> >> What jiras do you think i should make?
>> >> >> >>
>> >> >> >> Look forward to hearing people's feedback,
>> >> >> >> Matt
>> >> >> >>
>> >> >>
>> >> >>
>> >> >
>> >>
>> >
>>
>

Re: prefix compression implementation

Reply via email to