Thanks for the update, Matt. w.r.t. Cell class, since it is so fundamental, should it reside in org. apache.hadoop.hbase namespace as KeyValue class does ? For CellAppender, is compile() equivalent to flushing ?
Looking forward to your publishing on the reviewboard. On Sat, Sep 1, 2012 at 11:29 PM, Matt Corgan <[email protected]> wrote: > RPC encoding would be really nice since there is sometimes significant wire > traffic that could be reduced many-fold. I have a particular table that i > scan and stream to a gzipped output file on S3, and i've noticed that while > the app server's network input is 100Mbps, the gzipped output can be 2Mbps! > > Finishing the PrefixTree has been slow because I've saved a couple tricky > issues to the end and am light on time. i'll try to put it on reviewboard > monday despite a known bug. It is built with some of the ideas you mention > in mind, Lars. Take a look at the > Cell< > https://github.com/hotpads/hbase/blob/prefix-tree/hbase-common/src/main/java/org/apache/hadoop/hbase/cell/Cell.java > > > and CellAppender< > https://github.com/hotpads/hbase/blob/prefix-tree/hbase-common/src/main/java/org/apache/hadoop/hbase/cell/appender/CellAppender.java > > > classes > and their comments. The idea with the CellAppender is to stream cells into > it and periodically compile()/flush() into a byte[] which can be saved to > an HFile or (eventually) sent over the wire. For example, in > HRegion.get(..), the CellAppender would replace the "ArrayList<KeyValue> > results" collection. > > After introducing the Cell interface, the trick to extending the encoded > cells up the HBase stack will be to reduce the reliance on stand-alone > KeyValues. We'll want things like the Filters and KeyValueHeap to be able > to operate on reused Cells without materializing them into full KeyValues. > That means that something like StoreFileScanner.peek() will not work > because the scanner cannot maintain the state of the currrent and next > Cells at the same time. See > CellCollator< > https://github.com/hotpads/hbase/blob/prefix-tree/hbase-common/src/main/java/org/apache/hadoop/hbase/cell/collator/CellCollator.java > > > for > a possible replacement for KeyValueHeap. The good news is that this can be > done in stages without major disruptions to the code base. > > Looking at PtDataBlockEncoderSeeker< > https://github.com/hotpads/hbase/blob/prefix-tree/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/PtDataBlockEncoderSeeker.java > >, > this would mean transitioning from the getKeyValue() method that creates > and fills a new KeyValue every time it's called to the getCurrentCell() > method which returns a reference to a Cell buffer that is reused as the > scanner proceeds. Modifying a reusable Cell buffer rather than rapidly > shooting off new KeyValues should drastically reduce byte[] copying and > garbage churn. > > I wish I understood the protocol buffers more so I could comment > specifically on that. The result sent to the client can possibly be a > plain old encoded data block (byte[]/ByteBuffer) with a similar header to > the one encoded blocks have on disk (2 byte DataBlockEncoding id). The > client then uses the same > CellScanner< > https://github.com/hotpads/hbase/blob/prefix-tree/hbase-common/src/main/java/org/apache/hadoop/hbase/cell/scanner/CellScanner.java > >that > the server uses when reading blocks from the block cache. A nice > side-effect of sending the client an encoded byte[] is that the java client > can run the same decoder that the server uses which should be tremendously > faster and more memory efficient than the current method of building a > pointer-heavy result map. I had envisioned this kind of thing being baked > into ClientV2, but i guess it could be wrangled into the current one if > someone wanted. > > food for thought... cheers, > Matt > > ps - i'm travelling tomorrow so may be silent on email > > On Sat, Sep 1, 2012 at 9:03 PM, lars hofhansl <[email protected]> wrote: > > > In 0.96 we changing the wire protocol to use protobufs. > > > > While we're at it, I am wondering whether we can optimize a few things: > > > > > > 1. A Put or Delete can send many KeyValues, all of which have the same > row > > key and many will likely have the same column family. > > 2. Likewise a Scan result or Get is for a single row. Each KV will again > > will have the same row key and many will have the same column family. > > 3. The client and server do not need to share the same KV implementation > > as long as they are (de)serialized the same. KVs on the server will be > > backed by a shared larger byte[] (the block reads from disk), the KVs in > > the memstore will probably have the same implementation (to use slab, but > > maybe even here it would be benificial to store the row key and CF > > separately and share between KV where possible). Client KVs on the other > > hand could share a row key and or column family. > > > > This would require a KeyValue interface and two different > implementations; > > one backed by a byte[] another that stores the pieces separately. Once > that > > is done one could even envision KVs backed by a byte buffer. > > > > Both (de)serialize the same, so when the server serializes the KVs it > > would send the row key first, then the CF, then column, TS, finally > > followed by the value. The client could deserialize this and directly > reuse > > the shared part in its KV implementation. > > That has the potentially to siginificantly cut down client/server network > > IO and save memory on the client, especially with wide columns. > > > > Turning KV into an interface is a major undertaking. Would it be worth > the > > effort? Or maybe the RPC should just be compressed? > > > > > > We'd have to do that before 0.96.0 (I think), because even protobuf would > > not provide enough flexibility to make such a change later - which > > incidentally leads to another discussion about whether client and server > > should do an initial handshake to detect each others version, but that > is a > > different story. > > > > > > -- Lars > > > > >
