On Fri, Sep 16, 2011 at 7:29 PM, Matt Corgan <mcor...@hotpads.com> wrote: > Ryan - thanks for the feedback. The situation I'm thinking of where it's > useful to parse DirectBB without copying to heap is when you are serving > small random values out of the block cache. At HotPads, we'd like to store > hundreds of GB of real estate listing data in memory so it can be quickly > served up at random. We want to access many small values that are already > in memory, so basically skipping step 1 of 3 because values are already in > memory. That being said, the DirectBB are not essential for us since we > haven't run into gb problems, i just figured it would be nice to support > them since they seem to be important to other people. > > My motivation for doing this is to make hbase a viable candidate for a > large, auto-partitioned, sorted, *in-memory* database. Not the usual > analytics use case, but i think hbase would be great for this.
What exactly about the current system makes it not a viable candidate? > > > On Fri, Sep 16, 2011 at 7:08 PM, Ryan Rawson <ryano...@gmail.com> wrote: > >> On Fri, Sep 16, 2011 at 6:47 PM, Matt Corgan <mcor...@hotpads.com> wrote: >> > I'm a little confused over the direction of the DBBs in general, hence >> the >> > lack of clarity in my code. >> > >> > I see value in doing fine-grained parsing of the DBB if you're going to >> have >> > a large block of data and only want to retrieve a small KV from the >> middle >> > of it. With this trie design, you can navigate your way through the DBB >> > without copying hardly anything to the heap. It would be a shame blow >> away >> > your entire L1 cache by loading a whole 256KB block onto heap if you only >> > want to read 200 bytes out of the middle... it can be done >> > ultra-efficiently. >> >> This paragraph is not factually correct. The DirectByteBuffer vs main >> heap has nothing to do with the CPU cache. Consider the following >> scenario: >> >> - read block from DFS >> - scan block in ram >> - prepare result set for client >> >> Pretty simple, we have a choice in step 1: >> - write to java heap >> - write to DirectByteBuffer off-heap controlled memory >> >> in either case, you are copying to memory, and therefore cycling thru >> the cpu cache (of course). The difference is whether the Java GC has >> to deal with the aftermath or not. >> >> So the question "DBB or not" is not one about CPU caches, but one >> about garbage collection. Of course, nothing is free, and dealing >> with DBB requires extensive in-situ bounds checking (look at the >> source code for that class!), and also requires manual memory >> management on the behalf of the programmer. So you are faced with an >> expensive API (getByte is not as good at an array get), and a lot more >> homework to do. I have decided it's not worth it personally and >> aren't chasing that line as a potential performance improvement, and I >> also would encourage you not to as well. >> >> Ultimately the DFS speed issues need to be solved by the DFS - HDFS >> needs more work, but alternatives are already there and are a lot >> faster. >> >> >> >> >> >> > >> > The problem is if you're going to iterate through an entire block made of >> > 5000 small KV's doing thousands of DBB.get(index) calls. Those are like >> 10x >> > slower than byte[index] calls. In that case, if it's a DBB, you want to >> > copy the full block on-heap and access it through the byte[] interface. >> If >> > it's a HeapBB, then you already have access to the underlying byte[]. >> >> Yes this is the issue - you have to take an extra copy one way or >> another. Doing effective prefix compression with DBB is not really >> feasible imo, and that's another reason why I have given up on DBBs. >> >> > >> > So there's possibly value in implementing both methods. The main problem >> i >> > see is a lack of interfaces in the current code base. I'll throw one >> > suggestion out there as food for thought. Create a new interface: >> > >> > interface HCell{ >> > byte[] getRow(); >> > byte[] getFamily(); >> > byte[] getQualifier(); >> > long getTimestamp(); >> > byte getType(); >> > byte[] getValue(); >> > >> > //plus an endless list of convenience methods: >> > int getKeyLength(); >> > KeyValue getKeyValue(); >> > boolean isDelete(); >> > //etc, etc (or put these in sub-interface) >> > } >> > >> > We could start by making KeyValue implement that interface and then >> slowly >> > change pieces of the code base to use HCell. That will allow us to start >> > elegantly working in different implementations. >> > PtKeyValue< >> https://github.com/hotpads/hbase-prefix-trie/blob/master/src/org/apache/hadoop/hbase/keyvalue/trie/compact/read/PtKeyValue.java >> >would >> > be one of them. During the transition, you can always call >> > PtKeyValue.getCopiedKeyValue() which will instantiate a new byte[] in the >> > traditional KeyValue format. >> >> I am not really super keen here, and while the interface of course >> makes plenty of sense, the issue is that you will need to turn an >> array of KeyValues (aka a Result instance) in to a bunch of bytes on >> the wire. So there HAS to be a method that returns a ByteBuffer that >> the IO layer can then use to write out (via scatter/gather type >> network APIs) to the wire. >> >> A better choice I think would be to remove this method: >> public byte [] getBuffer() >> >> then deal with the places that use this - there is a bunch, but >> nothing looks super impossible (ie: no interface changes to filters, >> etc). >> >> Once you have that, making multiple implementations of key value is >> easier. I'd rather that key value becomes the abstract base class, >> and subclasses implement concrete details. >> >> > >> > We'd also want an interface for HFileBlock, and a few others... >> > >> > Some of this stuff is overwhelming to think about in parallel with the >> > existing hbase code, but it's actually not very complicated from a >> > standalone perspective. If you can isolate it into a module behind an >> > interface, then it's just a bunch of converting things to bytes and back. >> > There are (hopefully) no exceptions, gc pauses, cascading failures, >> etc... >> > all the things that are hard to handle to begin with and especially time >> > consuming to debug, emulate, and write tests for. There's not even >> > multi-threading! It's pretty easy to write tests for it and then never >> look >> > at it again. >> >> Yes, unit tests for basic logic is good, but ultimately hbase is an >> integrated whole, and the concurrency problems have been really tough >> to crack. Things are better than they have ever been, but still a lot >> of testing to do. >> >> > >> > Matt >> > >> > On Fri, Sep 16, 2011 at 6:08 PM, Ryan Rawson <ryano...@gmail.com> wrote: >> > >> >> Hey this stuff looks really interesting! >> >> >> >> On the ByteBuffer, the 'array' byte[] access to the underlying data is >> >> totally incompatible with the 'off heap' features that are implemented >> >> by DirectByteBuffer. While people talk about DBB in terms of nio >> >> performance, if you have to roundtrip the data thru java code, I'm not >> >> sure it buys you much - you still need to move data in and out of the >> >> main Java heap. Typically this is geared more towards apps which read >> >> and write from/to socket/files with minimal processing. >> >> >> >> While in the past I have been pretty bullish on off-heap caching for >> >> HBase, I have since changed my mind due to the poor API (ByteBuffer is >> >> a sucky way to access data structures in ram), and other reasons (ping >> >> me off list if you want). The KeyValue code pretty much presumes that >> >> data is in byte[] anyways, and I had thought that even with off-heap >> >> caching, we'd still have to copy KeyValues into main-heap during >> >> scanning anyways. >> >> >> >> Given the minimal size of the hfile blocks, I really dont see an issue >> >> with buffering a block output - especially if the savings is fairly >> >> substantial. >> >> >> >> Thanks, >> >> -ryan >> >> >> >> On Fri, Sep 16, 2011 at 5:48 PM, Matt Corgan <mcor...@hotpads.com> >> wrote: >> >> > Jacek, >> >> > >> >> > Thanks for helping out with this. I implemented most of the >> DeltaEncoder >> >> > and DeltaEncoderSeeker. I haven't taken the time to generate a good >> set >> >> of >> >> > test data for any of this, but it does pass on some very small input >> data >> >> > that aims to cover the edge cases i can think of. Perhaps you have >> full >> >> > HFiles you can run through it. >> >> > >> >> > >> >> >> https://github.com/hotpads/hbase-prefix-trie/tree/master/src/org/apache/hadoop/hbase/keyvalue/trie/deltaencoder >> >> > >> >> > I also put some notes on the PtDeltaEncoder regarding how the prefix >> trie >> >> > should be optimally used. I can't think of a situation where you'd >> want >> >> to >> >> > blow it up into the full uncompressed KeyValue ByteBuffer, so >> >> implementing >> >> > the DeltaEncoder interface is a mismatch, but I realize it's only a >> >> starting >> >> > point. >> >> > >> >> > You also would never really have a full ByteBuffer of KeyValues to >> pass >> >> to >> >> > it for compression. Typically, you'd be passing individual KeyValues >> >> from >> >> > the memstore flush or from a collection of HFiles being merged through >> a >> >> > PriorityQueue. >> >> > >> >> > The end goal is to operate on the encoded trie without decompressing >> it. >> >> > Long term, and in certain circumstances, it may even be possible to >> pass >> >> > the compressed trie over the wire to the client who can then decode >> it. >> >> > >> >> > Let me know if I implemented that the way you had in mind. I haven't >> >> done >> >> > the seekTo method yet, but will try to do that next week. >> >> > >> >> > Matt >> >> > >> >> > On Wed, Sep 14, 2011 at 3:43 PM, Jacek Migdal <ja...@fb.com> wrote: >> >> > >> >> >> Matt, >> >> >> >> >> >> Thanks a lot for the code. Great job! >> >> >> >> >> >> As I mentioned in JIRA I work full time on the delta encoding [1]. >> Right >> >> >> now the code and integration is almost done. Most of the parts are >> under >> >> >> review. Since it is a big change will plan to test it very carefully. >> >> >> After that, It will be ported to trunk and open sourced. >> >> >> >> >> >> I have a quick glimpse I have taken the different approach. I >> >> implemented >> >> >> a few different algorithms which are simpler. They also aims mostly >> to >> >> >> save space while having fast decompress/compress code. However the >> >> access >> >> >> is still sequential. The goal of my project is to save some RAM by >> >> having >> >> >> compressed BlockCache in memory. >> >> >> >> >> >> On the other hand, it seems that you are most concerned about seeking >> >> >> performance. >> >> >> >> >> >> I will read your code more carefully. A quick glimpse: we both >> >> implemented >> >> >> some routines (like vint), but expect that there is no overlap. >> >> >> >> >> >> I also seen that you spend some time investigating ByteBuffer vs. >> >> Byte[]. >> >> >> I experienced significant negative performance impact when I switched >> to >> >> >> ByteBuffer. However I postpone this optimization. >> >> >> >> >> >> Right now I think the easiest way to go would be that you will >> implement >> >> >> DeltaEncoder interface after my change: >> >> >> http://pastebin.com/Y8UxUByG >> >> >> (note, there might be some minor changes) >> >> >> >> >> >> That way, you will reuse my integration with existing code for >> "free". >> >> >> >> >> >> Jacek >> >> >> >> >> >> [1] - I prefer to call it that way. Prefix is one of the algorithm, >> but >> >> >> there are also different approach. >> >> >> >> >> >> On 9/13/11 1:36 AM, "Ted Yu" <yuzhih...@gmail.com> wrote: >> >> >> >> >> >> >Matt: >> >> >> >Thanks for the update. >> >> >> >Cacheable interface is defined in: >> >> >> >src/main/java/org/apache/hadoop/hbase/io/hfile/Cacheable.java >> >> >> > >> >> >> >You can find the implementation at: >> >> >> >src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java >> >> >> > >> >> >> >I will browse your code later. >> >> >> > >> >> >> >On Tue, Sep 13, 2011 at 12:44 AM, Matt Corgan <mcor...@hotpads.com> >> >> >> wrote: >> >> >> > >> >> >> >> Hi devs, >> >> >> >> >> >> >> >> I put a "developer preview" of a prefix compression algorithm on >> >> github. >> >> >> >> It >> >> >> >> still needs some details worked out, a full set of iterators, >> about >> >> 200 >> >> >> >> optimizations, and a bunch of other stuff... but, it successfully >> >> >> >>passes >> >> >> >> some preliminary tests so I thought I'd get it in front of more >> >> eyeballs >> >> >> >> sooner than later. >> >> >> >> >> >> >> >> https://github.com/hotpads/hbase-prefix-trie >> >> >> >> >> >> >> >> It depends on HBase's Bytes.java and KeyValue.java, which depends >> on >> >> >> >> hadoop. >> >> >> >> Those jars are in there, along with some others for full HFile >> >> testing >> >> >> >>in >> >> >> >> the near future. >> >> >> >> >> >> >> >> A good place to start looking at the code >> >> >> >> is >> org.apache.hadoop.hbase.keyvalue.trie.builder.KeyValuePtBuilder. >> >> >> >>It's >> >> >> >> the main driver of the compaction side of things, taking KeyValues >> >> (in >> >> >> >> sorted order), and generates a byte[] to be saved as a disk block. >> >> Then >> >> >> >> for >> >> >> >> reading it back in, there is trie.compact.read.PtIterator which >> takes >> >> >> >>the >> >> >> >> byte[] and spits out KeyValues. The main test class is >> >> >> >> trie.test.row.RowBuilderTests which round-trips a bunch of >> KeyValues >> >> to >> >> >> >> make >> >> >> >> sure the outputs are the same as the inputs. >> >> >> >> trie.compact.row.node.PtRowNode is the format of a single trie >> node >> >> in >> >> >> >>the >> >> >> >> underlying byte[]. >> >> >> >> >> >> >> >> The current version generates a lot of little objects (garbage) >> while >> >> >> >> writing and reading. I plan to optimize it to the point where >> most >> >> >> >> variables are primitives on the stack (at least when reading), but >> I >> >> >> >>think >> >> >> >> these intermediate classes are important for debugging. I'll >> >> probably >> >> >> >>try >> >> >> >> to keep them around going forward and develop a more compact >> version >> >> in >> >> >> >> parallel. >> >> >> >> >> >> >> >> It uses trie style compression for the row keys and column >> >> qualifiers, >> >> >> >> where >> >> >> >> pointers between nodes are compacted ints. It keeps a list of >> >> >> >>compacted, >> >> >> >> de-duped deltas for timestamps, and if they're all the same, it >> >> stores >> >> >> >>only >> >> >> >> one (long) per block. If all KeyValue.TYPE operations are the >> same, >> >> >> >>then >> >> >> >> it >> >> >> >> only stores one (byte) per block. >> >> >> >> >> >> >> >> It's designed around efficient cpu cache usage and elimination of >> 8 >> >> byte >> >> >> >> pointers, so should be fast. Get calls can traverse the trie >> nodes >> >> to >> >> >> >>dig >> >> >> >> up a row key while barely loading anything from memory to cache, >> as >> >> >> >>opposed >> >> >> >> to current hbase which may load the better part of a block into >> >> cache. >> >> >> >> Scanners/Filters/Comparators can all be designed to be trie-aware >> so >> >> >> >>they >> >> >> >> can iterate through 20 columnQualifiers in the same row without >> >> >> >>constantly >> >> >> >> re-scanning the rowKey bytes... etc... >> >> >> >> >> >> >> >> >> >> >> >> Here are a few questions we'll need to answer at some point: >> >> >> >> >> >> >> >> * where should this code go? >> >> >> >> - i'd vote for keeping it isolated and limiting references back >> to >> >> >> >>the >> >> >> >> main hbase project. sort of like the gzip/lzo algorithms. >> >> >> >> - if it's strictly isolated, it'll be easier to keep it well >> >> tested >> >> >> >>for >> >> >> >> correctness/performance and let more people tinker with it to make >> it >> >> >> >> faster. it'll also force us to come up with the right interfaces >> to >> >> >> >>allow >> >> >> >> other compression implementations. >> >> >> >> >> >> >> >> * current HFileWriter sends KeyValues to the OutputStream as soon >> as >> >> >> >> they're >> >> >> >> processed, but would it be ok to queue up a whole block in memory >> and >> >> >> >>write >> >> >> >> it all at once? >> >> >> >> - i'd vote for yes. it makes it easier to arrange the data to >> be >> >> >> >>more >> >> >> >> read-friendly. also, we're only talking about one block of data >> >> which >> >> >> >>is >> >> >> >> presumably a fraction of the block cache's size >> >> >> >> >> >> >> >> * should the block bytes be accessed as a byte[] of ByteBuffer. i >> >> know >> >> >> >> there's been work on off-heap cache, but i've read the blocks are >> >> pulled >> >> >> >> on-heap before they're parsed (??) >> >> >> >> - see >> >> >> >>org.apache.hadoop.hbase.keyvalue.trie.profile.MemoryAccessProfiler >> >> >> >> for a comparison of byte[] vs ByteBuffer speed tests. ByteBuffer >> >> looks >> >> >> >> ~2-10x slower, but some people need the off-heap cache >> >> >> >> - i'll propose maintaining separate reading algorithms for >> each, >> >> >> >>given >> >> >> >> that the underlying bytes are in the exact same format. should be >> >> >> >>possible >> >> >> >> to copy/paste the code and replace bytes[i] with >> ByteBuffer.get(i), >> >> and >> >> >> >> create parallel versions of static util methods >> >> >> >> >> >> >> >> * each block has some metadata wrapped in a class called >> PtBlockMeta. >> >> >> >>does >> >> >> >> HBase currently have a way to store its values as java primitives >> on >> >> the >> >> >> >> heap rather than parsing them out of the byte[]/ByteBuffer on >> every >> >> >> >>access? >> >> >> >> if they have to be parsed out on every block access, that could >> take >> >> >> >>more >> >> >> >> time than the Get query it's trying to perform. >> >> >> >> - I know there's a new Cachable interface or something like >> that. >> >> >> >>maybe >> >> >> >> that already supports it or could be enhanced >> >> >> >> >> >> >> >> >> >> >> >> What jiras do you think i should make? >> >> >> >> >> >> >> >> Look forward to hearing people's feedback, >> >> >> >> Matt >> >> >> >> >> >> >> >> >> >> >> >> > >> >> >> > >> >