Lars Is it possible to have hooks in this layer while we handle version counting so that we can know what kvs were omitted due to versioning?
Regards Ram > -----Original Message----- > From: Matt Corgan [mailto:[email protected]] > Sent: Tuesday, September 04, 2012 5:25 AM > To: [email protected]; lars hofhansl > Subject: Re: RPC KeyValue encoding > > I created separate reviewboard requests for the hbase-common module and > the > hbase-prefix-tree module. First one has the Cell interface, > CellOutputStream, CellScanner, etc mentioned above. > > hbase-common: https://reviews.apache.org/r/6897/ > hbase-prefix-tree: https://reviews.apache.org/r/6898/ > > Will leave tests out for now. They're on github. > > On Mon, Sep 3, 2012 at 3:17 PM, lars hofhansl <[email protected]> > wrote: > > > That reminds me of another thought to occurred to me while looking at > > ScanQueryMatcher. > > I was marveling at the relative complexity of it (together with > > StoreScanner) - admittedly some of this is my fault (see HBASE-4536 > and > > HBASE-4071) > > > > It would be so much easier to read if we had proper iterator trees > (at > > least at some places in the code), similar to how relational database > > express execution plans (using scanner and iterator interchangeably). > > > > Then: > > - StoreScanner would just read from HFiles and pass KVs up, in fact > it > > might no longer be needed > > - Filters can be expressed as an iterator over those KVs. > > - Handing deleted KVs would be another iterator > > - So would be the version handling/counting > > - Scanner would not need to passed List<KeyValue> to accumulate KVs > in, > > but simply return KVs as they are encountered. > > > > RegionScanner and StoreScanner would retain the KeyValueHeap to > mergesort > > their sub scanners. > > The overall complexity would remain the same, but the parts would be > > isolated better. > > > > Something like this: > > RegionScanner -> HeapIt ->-> VersionCounterIt -> FilterIt -> > TimeRangeIt-> > > DeletesIt -> StoreScanner -> HeapIt ->-> StoreFileIt -> ... > > (Should rename some of the things) > > > > All iterators would issue (re)seeks when possible. > > > > The iterator interface would be something like <init>, KV next(), > close(), > > seek(KV), reseek(KV). Many Iterators would be stateful. > > > > Would probably be a major refactoring, and the devil would be in the > > details. We would need to be careful to keep the current performance > > (currently ScanQueryMatcher is efficient, because it does a bunch of > steps > > at the same time). > > > > Just "blue skying". > > > > -- Lars > > > > > > ________________________________ > > From: Matt Corgan <[email protected]> > > To: [email protected]; lars hofhansl <[email protected]> > > Sent: Monday, September 3, 2012 11:24 AM > > Subject: Re: RPC KeyValue encoding > > > > > > > > For CellAppender, is compile() equivalent to flushing ? > > > > Yes. I'll rename CellAppender to CellOutputStream. The concept is > very > > similar to a GzipOutputStream where you write bytes to it and > periodically > > call flush() which spits out a compressed byte[] behind the scenes. > The > > server would write Cells to a CellOutputStream, flush them to a > byte[] and > > send the byte[] to the client. There could be a default encoding, > and the > > client could send a flag to override the default. > > > > Greg, you mention omitting fields that are repeated from one KeyValue > to > > the next. I think this is basically what the existing > DataBlockEncoders > > are doing for KeyValues stored on disk (see PrefixKeyDeltaEncoder for > > example). I'm thinking we can use the same encoders for encoding on > the > > wire. Different implementations will have different performance > > characteristics where some may be better for disk and others for RPC, > but > > the overall intent is the same. > > > > Matt > > > > On Sun, Sep 2, 2012 at 2:56 PM, lars hofhansl <[email protected]> > wrote: > > > > > Your "coarse grain" options is what I had in mind in my email. I > love the > > > option of not needing to get it all right in 0.96. > > > > > > You, Matt, and I could talk and work out the details and get it > done. > > > > > > > > > -- Lars > > > > > > > > > ----- Original Message ----- > > > From: Gregory Chanan <[email protected]> > > > To: [email protected] > > > Cc: > > > Sent: Sunday, September 2, 2012 12:52 PM > > > Subject: Re: RPC KeyValue encoding > > > > > > Lars, > > > > > > If we make the KeyValue wire format flexible enough I think we'll > be able > > > to tackle the KV as an interface work later. Just throwing out > some > > ideas > > > here: > > > > > > We could have a byte at the front of each KV serialization format > that > > > gives various options in each bit e.g. > > > Omits Rows / Omits Family / Omits Qualifier / Omits Timestamp / > Omits > > Value > > > / plus some extra bytes for compression options and extensions. > Then we > > > just need to define where the KV gets its field if it is omitted, > e.g. > > from > > > the previous KV in the RPC that had that field filled in. We sort > of > > have > > > this with the optional fields already, although I don't recall > exactly > > how > > > protobuf handles those (we'd probably have to do some small > > restructuring); > > > what's new is defining what it means when a field is omitted. > > > > > > There's some overhead with the above for small KVs, so you could > also go > > > coarser grain, e.g. the Get request/response could have a similar > options > > > byte like: > > > All Share Same Row / All Share Same Family / ... / and one of the > bits > > > could turn on the finer grain options above (per KeyValue). > > > > > > The advantage of this is that all we'd have to get right in 0.96.0 > is the > > > deserialization. The serialization could just send without any of > the > > > options turned on. And we could experiment later with each > specific RPC > > > call what the best options to use are, as well as what storage to > > actually > > > use client/server side, which you discuss. > > > > > > Greg > > > > > > On Sun, Sep 2, 2012 at 9:04 AM, Ted Yu <[email protected]> wrote: > > > > > > > Thanks for the update, Matt. > > > > > > > > w.r.t. Cell class, since it is so fundamental, should it reside > in org. > > > > apache.hadoop.hbase namespace as KeyValue class does ? > > > > For CellAppender, is compile() equivalent to flushing ? > > > > > > > > Looking forward to your publishing on the reviewboard. > > > > > > > > On Sat, Sep 1, 2012 at 11:29 PM, Matt Corgan > <[email protected]> > > > wrote: > > > > > > > > > RPC encoding would be really nice since there is sometimes > > significant > > > > wire > > > > > traffic that could be reduced many-fold. I have a particular > table > > > that > > > > i > > > > > scan and stream to a gzipped output file on S3, and i've > noticed that > > > > while > > > > > the app server's network input is 100Mbps, the gzipped output > can be > > > > 2Mbps! > > > > > > > > > > Finishing the PrefixTree has been slow because I've saved a > couple > > > tricky > > > > > issues to the end and am light on time. i'll try to put it on > > > > reviewboard > > > > > monday despite a known bug. It is built with some of the ideas > you > > > > mention > > > > > in mind, Lars. Take a look at the > > > > > Cell< > > > > > > > > > > > > > > https://github.com/hotpads/hbase/blob/prefix-tree/hbase- > common/src/main/java/org/apache/hadoop/hbase/cell/Cell.java > > > > > > > > > > > and CellAppender< > > > > > > > > > > > > > > https://github.com/hotpads/hbase/blob/prefix-tree/hbase- > common/src/main/java/org/apache/hadoop/hbase/cell/appender/CellAppender > .java > > > > > > > > > > > classes > > > > > and their comments. The idea with the CellAppender is to > stream > > cells > > > > into > > > > > it and periodically compile()/flush() into a byte[] which can > be > > saved > > > to > > > > > an HFile or (eventually) sent over the wire. For example, in > > > > > HRegion.get(..), the CellAppender would replace the > > > "ArrayList<KeyValue> > > > > > results" collection. > > > > > > > > > > After introducing the Cell interface, the trick to extending > the > > > encoded > > > > > cells up the HBase stack will be to reduce the reliance on > > stand-alone > > > > > KeyValues. We'll want things like the Filters and KeyValueHeap > to be > > > > able > > > > > to operate on reused Cells without materializing them into full > > > > KeyValues. > > > > > That means that something like StoreFileScanner.peek() will > not work > > > > > because the scanner cannot maintain the state of the currrent > and > > next > > > > > Cells at the same time. See > > > > > CellCollator< > > > > > > > > > > > > > > https://github.com/hotpads/hbase/blob/prefix-tree/hbase- > common/src/main/java/org/apache/hadoop/hbase/cell/collator/CellCollator > .java > > > > > > > > > > > for > > > > > a possible replacement for KeyValueHeap. The good news is that > this > > > can > > > > be > > > > > done in stages without major disruptions to the code base. > > > > > > > > > > Looking at PtDataBlockEncoderSeeker< > > > > > > > > > > > > > > https://github.com/hotpads/hbase/blob/prefix-tree/hbase-prefix- > tree/src/main/java/org/apache/hbase/codec/prefixtree/PtDataBlockEncoder > Seeker.java > > > > > >, > > > > > this would mean transitioning from the getKeyValue() method > that > > > creates > > > > > and fills a new KeyValue every time it's called to the > > getCurrentCell() > > > > > method which returns a reference to a Cell buffer that is > reused as > > the > > > > > scanner proceeds. Modifying a reusable Cell buffer rather than > > rapidly > > > > > shooting off new KeyValues should drastically reduce byte[] > copying > > and > > > > > garbage churn. > > > > > > > > > > I wish I understood the protocol buffers more so I could > comment > > > > > specifically on that. The result sent to the client can > possibly be > > a > > > > > plain old encoded data block (byte[]/ByteBuffer) with a similar > > header > > > to > > > > > the one encoded blocks have on disk (2 byte DataBlockEncoding > id). > > The > > > > > client then uses the same > > > > > CellScanner< > > > > > > > > > > > > > > https://github.com/hotpads/hbase/blob/prefix-tree/hbase- > common/src/main/java/org/apache/hadoop/hbase/cell/scanner/CellScanner.j > ava > > > > > >that > > > > > the server uses when reading blocks from the block cache. A > nice > > > > > side-effect of sending the client an encoded byte[] is that the > java > > > > client > > > > > can run the same decoder that the server uses which should be > > > > tremendously > > > > > faster and more memory efficient than the current method of > building > > a > > > > > pointer-heavy result map. I had envisioned this kind of thing > being > > > > baked > > > > > into ClientV2, but i guess it could be wrangled into the > current one > > if > > > > > someone wanted. > > > > > > > > > > food for thought... cheers, > > > > > Matt > > > > > > > > > > ps - i'm travelling tomorrow so may be silent on email > > > > > > > > > > On Sat, Sep 1, 2012 at 9:03 PM, lars hofhansl > <[email protected]> > > > > wrote: > > > > > > > > > > > In 0.96 we changing the wire protocol to use protobufs. > > > > > > > > > > > > While we're at it, I am wondering whether we can optimize a > few > > > things: > > > > > > > > > > > > > > > > > > 1. A Put or Delete can send many KeyValues, all of which have > the > > > same > > > > > row > > > > > > key and many will likely have the same column family. > > > > > > 2. Likewise a Scan result or Get is for a single row. Each KV > will > > > > again > > > > > > will have the same row key and many will have the same column > > family. > > > > > > 3. The client and server do not need to share the same KV > > > > implementation > > > > > > as long as they are (de)serialized the same. KVs on the > server will > > > be > > > > > > backed by a shared larger byte[] (the block reads from disk), > the > > KVs > > > > in > > > > > > the memstore will probably have the same implementation (to > use > > slab, > > > > but > > > > > > maybe even here it would be benificial to store the row key > and CF > > > > > > separately and share between KV where possible). Client KVs > on the > > > > other > > > > > > hand could share a row key and or column family. > > > > > > > > > > > > This would require a KeyValue interface and two different > > > > > implementations; > > > > > > one backed by a byte[] another that stores the pieces > separately. > > > Once > > > > > that > > > > > > is done one could even envision KVs backed by a byte buffer. > > > > > > > > > > > > Both (de)serialize the same, so when the server serializes > the KVs > > it > > > > > > would send the row key first, then the CF, then column, TS, > finally > > > > > > followed by the value. The client could deserialize this and > > directly > > > > > reuse > > > > > > the shared part in its KV implementation. > > > > > > That has the potentially to siginificantly cut down > client/server > > > > network > > > > > > IO and save memory on the client, especially with wide > columns. > > > > > > > > > > > > Turning KV into an interface is a major undertaking. Would it > be > > > worth > > > > > the > > > > > > effort? Or maybe the RPC should just be compressed? > > > > > > > > > > > > > > > > > > We'd have to do that before 0.96.0 (I think), because even > protobuf > > > > would > > > > > > not provide enough flexibility to make such a change later - > which > > > > > > incidentally leads to another discussion about whether client > and > > > > server > > > > > > should do an initial handshake to detect each others version, > but > > > that > > > > > is a > > > > > > different story. > > > > > > > > > > > > > > > > > > -- Lars > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
