I created separate reviewboard requests for the hbase-common module and the hbase-prefix-tree module. First one has the Cell interface, CellOutputStream, CellScanner, etc mentioned above.
hbase-common: https://reviews.apache.org/r/6897/ hbase-prefix-tree: https://reviews.apache.org/r/6898/ Will leave tests out for now. They're on github. On Mon, Sep 3, 2012 at 3:17 PM, lars hofhansl <[email protected]> wrote: > That reminds me of another thought to occurred to me while looking at > ScanQueryMatcher. > I was marveling at the relative complexity of it (together with > StoreScanner) - admittedly some of this is my fault (see HBASE-4536 and > HBASE-4071) > > It would be so much easier to read if we had proper iterator trees (at > least at some places in the code), similar to how relational database > express execution plans (using scanner and iterator interchangeably). > > Then: > - StoreScanner would just read from HFiles and pass KVs up, in fact it > might no longer be needed > - Filters can be expressed as an iterator over those KVs. > - Handing deleted KVs would be another iterator > - So would be the version handling/counting > - Scanner would not need to passed List<KeyValue> to accumulate KVs in, > but simply return KVs as they are encountered. > > RegionScanner and StoreScanner would retain the KeyValueHeap to mergesort > their sub scanners. > The overall complexity would remain the same, but the parts would be > isolated better. > > Something like this: > RegionScanner -> HeapIt ->-> VersionCounterIt -> FilterIt -> TimeRangeIt-> > DeletesIt -> StoreScanner -> HeapIt ->-> StoreFileIt -> ... > (Should rename some of the things) > > All iterators would issue (re)seeks when possible. > > The iterator interface would be something like <init>, KV next(), close(), > seek(KV), reseek(KV). Many Iterators would be stateful. > > Would probably be a major refactoring, and the devil would be in the > details. We would need to be careful to keep the current performance > (currently ScanQueryMatcher is efficient, because it does a bunch of steps > at the same time). > > Just "blue skying". > > -- Lars > > > ________________________________ > From: Matt Corgan <[email protected]> > To: [email protected]; lars hofhansl <[email protected]> > Sent: Monday, September 3, 2012 11:24 AM > Subject: Re: RPC KeyValue encoding > > > > > For CellAppender, is compile() equivalent to flushing ? > > Yes. I'll rename CellAppender to CellOutputStream. The concept is very > similar to a GzipOutputStream where you write bytes to it and periodically > call flush() which spits out a compressed byte[] behind the scenes. The > server would write Cells to a CellOutputStream, flush them to a byte[] and > send the byte[] to the client. There could be a default encoding, and the > client could send a flag to override the default. > > Greg, you mention omitting fields that are repeated from one KeyValue to > the next. I think this is basically what the existing DataBlockEncoders > are doing for KeyValues stored on disk (see PrefixKeyDeltaEncoder for > example). I'm thinking we can use the same encoders for encoding on the > wire. Different implementations will have different performance > characteristics where some may be better for disk and others for RPC, but > the overall intent is the same. > > Matt > > On Sun, Sep 2, 2012 at 2:56 PM, lars hofhansl <[email protected]> wrote: > > > Your "coarse grain" options is what I had in mind in my email. I love the > > option of not needing to get it all right in 0.96. > > > > You, Matt, and I could talk and work out the details and get it done. > > > > > > -- Lars > > > > > > ----- Original Message ----- > > From: Gregory Chanan <[email protected]> > > To: [email protected] > > Cc: > > Sent: Sunday, September 2, 2012 12:52 PM > > Subject: Re: RPC KeyValue encoding > > > > Lars, > > > > If we make the KeyValue wire format flexible enough I think we'll be able > > to tackle the KV as an interface work later. Just throwing out some > ideas > > here: > > > > We could have a byte at the front of each KV serialization format that > > gives various options in each bit e.g. > > Omits Rows / Omits Family / Omits Qualifier / Omits Timestamp / Omits > Value > > / plus some extra bytes for compression options and extensions. Then we > > just need to define where the KV gets its field if it is omitted, e.g. > from > > the previous KV in the RPC that had that field filled in. We sort of > have > > this with the optional fields already, although I don't recall exactly > how > > protobuf handles those (we'd probably have to do some small > restructuring); > > what's new is defining what it means when a field is omitted. > > > > There's some overhead with the above for small KVs, so you could also go > > coarser grain, e.g. the Get request/response could have a similar options > > byte like: > > All Share Same Row / All Share Same Family / ... / and one of the bits > > could turn on the finer grain options above (per KeyValue). > > > > The advantage of this is that all we'd have to get right in 0.96.0 is the > > deserialization. The serialization could just send without any of the > > options turned on. And we could experiment later with each specific RPC > > call what the best options to use are, as well as what storage to > actually > > use client/server side, which you discuss. > > > > Greg > > > > On Sun, Sep 2, 2012 at 9:04 AM, Ted Yu <[email protected]> wrote: > > > > > Thanks for the update, Matt. > > > > > > w.r.t. Cell class, since it is so fundamental, should it reside in org. > > > apache.hadoop.hbase namespace as KeyValue class does ? > > > For CellAppender, is compile() equivalent to flushing ? > > > > > > Looking forward to your publishing on the reviewboard. > > > > > > On Sat, Sep 1, 2012 at 11:29 PM, Matt Corgan <[email protected]> > > wrote: > > > > > > > RPC encoding would be really nice since there is sometimes > significant > > > wire > > > > traffic that could be reduced many-fold. I have a particular table > > that > > > i > > > > scan and stream to a gzipped output file on S3, and i've noticed that > > > while > > > > the app server's network input is 100Mbps, the gzipped output can be > > > 2Mbps! > > > > > > > > Finishing the PrefixTree has been slow because I've saved a couple > > tricky > > > > issues to the end and am light on time. i'll try to put it on > > > reviewboard > > > > monday despite a known bug. It is built with some of the ideas you > > > mention > > > > in mind, Lars. Take a look at the > > > > Cell< > > > > > > > > > > https://github.com/hotpads/hbase/blob/prefix-tree/hbase-common/src/main/java/org/apache/hadoop/hbase/cell/Cell.java > > > > > > > > > and CellAppender< > > > > > > > > > > https://github.com/hotpads/hbase/blob/prefix-tree/hbase-common/src/main/java/org/apache/hadoop/hbase/cell/appender/CellAppender.java > > > > > > > > > classes > > > > and their comments. The idea with the CellAppender is to stream > cells > > > into > > > > it and periodically compile()/flush() into a byte[] which can be > saved > > to > > > > an HFile or (eventually) sent over the wire. For example, in > > > > HRegion.get(..), the CellAppender would replace the > > "ArrayList<KeyValue> > > > > results" collection. > > > > > > > > After introducing the Cell interface, the trick to extending the > > encoded > > > > cells up the HBase stack will be to reduce the reliance on > stand-alone > > > > KeyValues. We'll want things like the Filters and KeyValueHeap to be > > > able > > > > to operate on reused Cells without materializing them into full > > > KeyValues. > > > > That means that something like StoreFileScanner.peek() will not work > > > > because the scanner cannot maintain the state of the currrent and > next > > > > Cells at the same time. See > > > > CellCollator< > > > > > > > > > > https://github.com/hotpads/hbase/blob/prefix-tree/hbase-common/src/main/java/org/apache/hadoop/hbase/cell/collator/CellCollator.java > > > > > > > > > for > > > > a possible replacement for KeyValueHeap. The good news is that this > > can > > > be > > > > done in stages without major disruptions to the code base. > > > > > > > > Looking at PtDataBlockEncoderSeeker< > > > > > > > > > > https://github.com/hotpads/hbase/blob/prefix-tree/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/PtDataBlockEncoderSeeker.java > > > > >, > > > > this would mean transitioning from the getKeyValue() method that > > creates > > > > and fills a new KeyValue every time it's called to the > getCurrentCell() > > > > method which returns a reference to a Cell buffer that is reused as > the > > > > scanner proceeds. Modifying a reusable Cell buffer rather than > rapidly > > > > shooting off new KeyValues should drastically reduce byte[] copying > and > > > > garbage churn. > > > > > > > > I wish I understood the protocol buffers more so I could comment > > > > specifically on that. The result sent to the client can possibly be > a > > > > plain old encoded data block (byte[]/ByteBuffer) with a similar > header > > to > > > > the one encoded blocks have on disk (2 byte DataBlockEncoding id). > The > > > > client then uses the same > > > > CellScanner< > > > > > > > > > > https://github.com/hotpads/hbase/blob/prefix-tree/hbase-common/src/main/java/org/apache/hadoop/hbase/cell/scanner/CellScanner.java > > > > >that > > > > the server uses when reading blocks from the block cache. A nice > > > > side-effect of sending the client an encoded byte[] is that the java > > > client > > > > can run the same decoder that the server uses which should be > > > tremendously > > > > faster and more memory efficient than the current method of building > a > > > > pointer-heavy result map. I had envisioned this kind of thing being > > > baked > > > > into ClientV2, but i guess it could be wrangled into the current one > if > > > > someone wanted. > > > > > > > > food for thought... cheers, > > > > Matt > > > > > > > > ps - i'm travelling tomorrow so may be silent on email > > > > > > > > On Sat, Sep 1, 2012 at 9:03 PM, lars hofhansl <[email protected]> > > > wrote: > > > > > > > > > In 0.96 we changing the wire protocol to use protobufs. > > > > > > > > > > While we're at it, I am wondering whether we can optimize a few > > things: > > > > > > > > > > > > > > > 1. A Put or Delete can send many KeyValues, all of which have the > > same > > > > row > > > > > key and many will likely have the same column family. > > > > > 2. Likewise a Scan result or Get is for a single row. Each KV will > > > again > > > > > will have the same row key and many will have the same column > family. > > > > > 3. The client and server do not need to share the same KV > > > implementation > > > > > as long as they are (de)serialized the same. KVs on the server will > > be > > > > > backed by a shared larger byte[] (the block reads from disk), the > KVs > > > in > > > > > the memstore will probably have the same implementation (to use > slab, > > > but > > > > > maybe even here it would be benificial to store the row key and CF > > > > > separately and share between KV where possible). Client KVs on the > > > other > > > > > hand could share a row key and or column family. > > > > > > > > > > This would require a KeyValue interface and two different > > > > implementations; > > > > > one backed by a byte[] another that stores the pieces separately. > > Once > > > > that > > > > > is done one could even envision KVs backed by a byte buffer. > > > > > > > > > > Both (de)serialize the same, so when the server serializes the KVs > it > > > > > would send the row key first, then the CF, then column, TS, finally > > > > > followed by the value. The client could deserialize this and > directly > > > > reuse > > > > > the shared part in its KV implementation. > > > > > That has the potentially to siginificantly cut down client/server > > > network > > > > > IO and save memory on the client, especially with wide columns. > > > > > > > > > > Turning KV into an interface is a major undertaking. Would it be > > worth > > > > the > > > > > effort? Or maybe the RPC should just be compressed? > > > > > > > > > > > > > > > We'd have to do that before 0.96.0 (I think), because even protobuf > > > would > > > > > not provide enough flexibility to make such a change later - which > > > > > incidentally leads to another discussion about whether client and > > > server > > > > > should do an initial handshake to detect each others version, but > > that > > > > is a > > > > > different story. > > > > > > > > > > > > > > > -- Lars > > > > > > > > > > > > > > > > > > > > > >
