Re: RPC KeyValue encoding

Matt Corgan Mon, 03 Sep 2012 16:55:59 -0700

I created separate reviewboard requests for the hbase-common module and the
hbase-prefix-tree module.  First one has the Cell interface,
CellOutputStream, CellScanner, etc mentioned above.


hbase-common: https://reviews.apache.org/r/6897/
hbase-prefix-tree: https://reviews.apache.org/r/6898/

Will leave tests out for now.  They're on github.

On Mon, Sep 3, 2012 at 3:17 PM, lars hofhansl <[email protected]> wrote:

> That reminds me of another thought to occurred to me while looking at
> ScanQueryMatcher.
> I was marveling at the relative complexity of it (together with
> StoreScanner) - admittedly some of this is my fault (see HBASE-4536 and
> HBASE-4071)
>
> It would be so much easier to read if we had proper iterator trees (at
> least at some places in the code), similar to how relational database
> express execution plans (using scanner and iterator interchangeably).
>
> Then:
> - StoreScanner would just read from HFiles and pass KVs up, in fact it
> might no longer be needed
> - Filters can be expressed as an iterator over those KVs.
> - Handing deleted KVs would be another iterator
> - So would be the version handling/counting
> - Scanner would not need to passed List<KeyValue> to accumulate KVs in,
> but simply return KVs as they are encountered.
>
> RegionScanner and StoreScanner would retain the KeyValueHeap to mergesort
> their sub scanners.
> The overall complexity would remain the same, but the parts would be
> isolated better.
>
> Something like this:
> RegionScanner -> HeapIt ->-> VersionCounterIt -> FilterIt -> TimeRangeIt->
> DeletesIt -> StoreScanner -> HeapIt ->-> StoreFileIt -> ...
> (Should rename some of the things)
>
> All iterators would issue (re)seeks when possible.
>
> The iterator interface would be something like <init>, KV next(), close(),
> seek(KV), reseek(KV). Many Iterators would be stateful.
>
> Would probably be a major refactoring, and the devil would be in the
> details. We would need to be careful to keep the current performance
> (currently ScanQueryMatcher is efficient, because it does a bunch of steps
> at the same time).
>
> Just "blue skying".
>
> -- Lars
>
>
> ________________________________
> From: Matt Corgan <[email protected]>
> To: [email protected]; lars hofhansl <[email protected]>
> Sent: Monday, September 3, 2012 11:24 AM
> Subject: Re: RPC KeyValue encoding
>
> >
> > For CellAppender, is compile() equivalent to flushing ?
>
> Yes.  I'll rename CellAppender to CellOutputStream.  The concept is very
> similar to a GzipOutputStream where you write bytes to it and periodically
> call flush() which spits out a compressed byte[] behind the scenes.  The
> server would write Cells to a CellOutputStream, flush them to a byte[] and
> send the byte[] to the client.  There could be a default encoding, and the
> client could send a flag to override the default.
>
> Greg, you mention omitting fields that are repeated from one KeyValue to
> the next.  I think this is basically what the existing DataBlockEncoders
> are doing for KeyValues stored on disk (see PrefixKeyDeltaEncoder for
> example).  I'm thinking we can use the same encoders for encoding on the
> wire.  Different implementations will have different performance
> characteristics where some may be better for disk and others for RPC, but
> the overall intent is the same.
>
> Matt
>
> On Sun, Sep 2, 2012 at 2:56 PM, lars hofhansl <[email protected]> wrote:
>
> > Your "coarse grain" options is what I had in mind in my email. I love the
> > option of not needing to get it all right in 0.96.
> >
> > You, Matt, and I could talk and work out the details and get it done.
> >
> >
> > -- Lars
> >
> >
> > ----- Original Message -----
> > From: Gregory Chanan <[email protected]>
> > To: [email protected]
> > Cc:
> > Sent: Sunday, September 2, 2012 12:52 PM
> > Subject: Re: RPC KeyValue encoding
> >
> > Lars,
> >
> > If we make the KeyValue wire format flexible enough I think we'll be able
> > to tackle the KV as an interface work later.  Just throwing out some
> ideas
> > here:
> >
> > We could have a byte at the front of each KV serialization format that
> > gives various options in each bit e.g.
> > Omits Rows / Omits Family / Omits Qualifier / Omits Timestamp / Omits
> Value
> > / plus some extra bytes for compression options and extensions.  Then we
> > just need to define where the KV gets its field if it is omitted, e.g.
> from
> > the previous KV in the RPC that had that field filled in.  We sort of
> have
> > this with the optional fields already, although I don't recall exactly
> how
> > protobuf handles those (we'd probably have to do some small
> restructuring);
> > what's new is defining what it means when a field is omitted.
> >
> > There's some overhead with the above for small KVs, so you could also go
> > coarser grain, e.g. the Get request/response could have a similar options
> > byte like:
> > All Share Same Row / All Share Same Family / ... / and one of the bits
> > could turn on the finer grain options above (per KeyValue).
> >
> > The advantage of this is that all we'd have to get right in 0.96.0 is the
> > deserialization.  The serialization could just send without any of the
> > options turned on.  And we could experiment later with each specific RPC
> > call what the best options to use are, as well as what storage to
> actually
> > use client/server side, which you discuss.
> >
> > Greg
> >
> > On Sun, Sep 2, 2012 at 9:04 AM, Ted Yu <[email protected]> wrote:
> >
> > > Thanks for the update, Matt.
> > >
> > > w.r.t. Cell class, since it is so fundamental, should it reside in org.
> > > apache.hadoop.hbase namespace as KeyValue class does ?
> > > For CellAppender, is compile() equivalent to flushing ?
> > >
> > > Looking forward to your publishing on the reviewboard.
> > >
> > > On Sat, Sep 1, 2012 at 11:29 PM, Matt Corgan <[email protected]>
> > wrote:
> > >
> > > > RPC encoding would be really nice since there is sometimes
> significant
> > > wire
> > > > traffic that could be reduced many-fold.  I have a particular table
> > that
> > > i
> > > > scan and stream to a gzipped output file on S3, and i've noticed that
> > > while
> > > > the app server's network input is 100Mbps, the gzipped output can be
> > > 2Mbps!
> > > >
> > > > Finishing the PrefixTree has been slow because I've saved a couple
> > tricky
> > > > issues to the end and am light on time.  i'll try to put it on
> > > reviewboard
> > > > monday despite a known bug.  It is built with some of the ideas you
> > > mention
> > > > in mind, Lars.  Take a look at the
> > > > Cell<
> > > >
> > >
> >
> https://github.com/hotpads/hbase/blob/prefix-tree/hbase-common/src/main/java/org/apache/hadoop/hbase/cell/Cell.java
> > > > >
> > > >  and CellAppender<
> > > >
> > >
> >
> https://github.com/hotpads/hbase/blob/prefix-tree/hbase-common/src/main/java/org/apache/hadoop/hbase/cell/appender/CellAppender.java
> > > > >
> > > > classes
> > > > and their comments.  The idea with the CellAppender is to stream
> cells
> > > into
> > > > it and periodically compile()/flush() into a byte[] which can be
> saved
> > to
> > > > an HFile or (eventually) sent over the wire.  For example, in
> > > > HRegion.get(..), the CellAppender would replace the
> > "ArrayList<KeyValue>
> > > > results" collection.
> > > >
> > > > After introducing the Cell interface, the trick to extending the
> > encoded
> > > > cells up the HBase stack will be to reduce the reliance on
> stand-alone
> > > > KeyValues.  We'll want things like the Filters and KeyValueHeap to be
> > > able
> > > > to operate on reused Cells without materializing them into full
> > > KeyValues.
> > > >  That means that something like StoreFileScanner.peek() will not work
> > > > because the scanner cannot maintain the state of the currrent and
> next
> > > > Cells at the same time.  See
> > > > CellCollator<
> > > >
> > >
> >
> https://github.com/hotpads/hbase/blob/prefix-tree/hbase-common/src/main/java/org/apache/hadoop/hbase/cell/collator/CellCollator.java
> > > > >
> > > > for
> > > > a possible replacement for KeyValueHeap.  The good news is that this
> > can
> > > be
> > > > done in stages without major disruptions to the code base.
> > > >
> > > > Looking at PtDataBlockEncoderSeeker<
> > > >
> > >
> >
> https://github.com/hotpads/hbase/blob/prefix-tree/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/PtDataBlockEncoderSeeker.java
> > > > >,
> > > > this would mean transitioning from the getKeyValue() method that
> > creates
> > > > and fills a new KeyValue every time it's called to the
> getCurrentCell()
> > > > method which returns a reference to a Cell buffer that is reused as
> the
> > > > scanner proceeds.  Modifying a reusable Cell buffer rather than
> rapidly
> > > > shooting off new KeyValues should drastically reduce byte[] copying
> and
> > > > garbage churn.
> > > >
> > > > I wish I understood the protocol buffers more so I could comment
> > > > specifically on that.  The result sent to the client can possibly be
> a
> > > > plain old encoded data block (byte[]/ByteBuffer) with a similar
> header
> > to
> > > > the one encoded blocks have on disk (2 byte DataBlockEncoding id).
> The
> > > > client then uses the same
> > > > CellScanner<
> > > >
> > >
> >
> https://github.com/hotpads/hbase/blob/prefix-tree/hbase-common/src/main/java/org/apache/hadoop/hbase/cell/scanner/CellScanner.java
> > > > >that
> > > > the server uses when reading blocks from the block cache.  A nice
> > > > side-effect of sending the client an encoded byte[] is that the java
> > > client
> > > > can run the same decoder that the server uses which should be
> > > tremendously
> > > > faster and more memory efficient than the current method of building
> a
> > > > pointer-heavy result map.  I had envisioned this kind of thing being
> > > baked
> > > > into ClientV2, but i guess it could be wrangled into the current one
> if
> > > > someone wanted.
> > > >
> > > > food for thought... cheers,
> > > > Matt
> > > >
> > > > ps - i'm travelling tomorrow so may be silent on email
> > > >
> > > > On Sat, Sep 1, 2012 at 9:03 PM, lars hofhansl <[email protected]>
> > > wrote:
> > > >
> > > > > In 0.96 we changing the wire protocol to use protobufs.
> > > > >
> > > > > While we're at it, I am wondering whether we can optimize a few
> > things:
> > > > >
> > > > >
> > > > > 1. A Put or Delete can send many KeyValues, all of which have the
> > same
> > > > row
> > > > > key and many will likely have the same column family.
> > > > > 2. Likewise a Scan result or Get is for a single row. Each KV will
> > > again
> > > > > will have the same row key and many will have the same column
> family.
> > > > > 3. The client and server do not need to share the same KV
> > > implementation
> > > > > as long as they are (de)serialized the same. KVs on the server will
> > be
> > > > > backed by a shared larger byte[] (the block reads from disk), the
> KVs
> > > in
> > > > > the memstore will probably have the same implementation (to use
> slab,
> > > but
> > > > > maybe even here it would be benificial to store the row key and CF
> > > > > separately and share between KV where possible). Client KVs on the
> > > other
> > > > > hand could share a row key and or column family.
> > > > >
> > > > > This would require a KeyValue interface and two different
> > > > implementations;
> > > > > one backed by a byte[] another that stores the pieces separately.
> > Once
> > > > that
> > > > > is done one could even envision KVs backed by a byte buffer.
> > > > >
> > > > > Both (de)serialize the same, so when the server serializes the KVs
> it
> > > > > would send the row key first, then the CF, then column, TS, finally
> > > > > followed by the value. The client could deserialize this and
> directly
> > > > reuse
> > > > > the shared part in its KV implementation.
> > > > > That has the potentially to siginificantly cut down client/server
> > > network
> > > > > IO and save memory on the client, especially with wide columns.
> > > > >
> > > > > Turning KV into an interface is a major undertaking. Would it be
> > worth
> > > > the
> > > > > effort? Or maybe the RPC should just be compressed?
> > > > >
> > > > >
> > > > > We'd have to do that before 0.96.0 (I think), because even protobuf
> > > would
> > > > > not provide enough flexibility to make such a change later - which
> > > > > incidentally leads to another discussion about whether client and
> > > server
> > > > > should do an initial handshake to detect each others version, but
> > that
> > > > is a
> > > > > different story.
> > > > >
> > > > >
> > > > > -- Lars
> > > > >
> > > > >
> > > >
> > >
> >
> >
>

Re: RPC KeyValue encoding

Reply via email to