High-level this sounds like a great. Inline below is some feedback and a bit of history on how we got here in case it helps:
On Thu, Jun 2, 2011 at 3:28 PM, Matt Corgan <mcor...@hotpads.com> wrote: > * refer to prefix compression as "compaction" to avoid interfering with > traditional "compression" > 'Compaction' is already a term in hbase. Your suggestion below may reproduce an issue we currently have where the term 'split' can refer to the splitting of logs on regionserver recovery or spit of a region because its grown too large. Just a headsup. > * main goal is to reduce size of data in memory to increase effective > capacity of block index, block cache, and memstores Ok. At some cost in CPU. > * disk access takes thousands of times longer than memory access, > especially over hdfs/network > Yes. I would underline the clause that starts 'especially' in the above. > * secondary goal is to speed up the processing of in memory data > * memory access takes 200+ cpu cycles, so must use cache friendly > algorithms > * structure trie node sizes around typical 64B cache line size How would you measure our effectiveness here? > * investigate java's memory prefetching capabilities, possibly > increasing to ~256B nodes Do you have a pointer on this phenomeon (I'm ignorant and would learn more)? > * almost always use variable length integer/long encoding > In my limited experience I've found these to be CPU intensive making the parse of a compound byte array of rowkey+family+ etc. such as KV is now complicated. On other hand, our Benoit reports having made significant improvements over the hadoop varlength I was using. > * need separate compacted structures/algorithms for rowKey, colQualifier, > and timestamp Why you think that Matt? Wouldn't rowKey and say colQualifier share an algorithm because both are arbitrary byte arrays? Or, are you talking about context; the fact that the previous KV may have same row and family say? (We've talked of using a code for column family keeping a table of codes to family name...). > * structures will need to interact with varInt pointers to link > rowKey<->colQualifier<->timestamp > > * i think in the medium to long term (and maybe even short term) it would be > best to code this from scratch > I'd suggest some code sketches apart from hbase would be cool illustrating what you are thinking. KV is pervasive; from client API all the ways out to entries in HFile. Changing it is going to be a PITA. > > Static row key compaction for block cache and data blocks: > * avoid schemes where each row is a diff of the previous (expensive cpu) Yes. I think this is likely the case. > * utilize the fact that row keys are always sorted (unlike column > qualifiers) Column qualifiers are also sorted. > * this makes for fast tree construction during flushing and compaction Because you are filling the tree w/ sorted data? > * low hanging fruit: (rowKeyCompaction=simple) > * pull off the common prefix of all rows In a block context? As the context's change, would we have a different means of doing this (Examples of contexts: memstore, block cache, hfile). > * modified to accept full byte range 0-255, store full byte strings in > the tree, remove all pointers, optimize for cache line size, etc, but > similar idea How well can this be done in a language such as java? Does its indirection make it so its not possible to be cache line size consious? > * CSS trees are considered sub-par because they are not easily modified, > but that is ok in hbase's block cache and data blocks And for the memstore? Do you have an idea for what kind of structure we could use here, one that exploits 'compressed' values? > * input is a sorted list of byte[], output is a byte[] containing the entire > prefix trie > * possibly put all keys in the front of the block, and all values at the > end, with value offsets/lengths in the keys > How do you think this would improve things? > Backwards compatibility > * block cache could quickly compute the trie when loaded, therefore working > with HFile v1 I don't think we need to be backward compatible. We just need to be able to migrate from old to the new (in fact, I'd say do not be backward compatible -- having to do so in my experience makes the code more complex and inevitable compromises an implementation) > * eventually want to store the computed trie on disk > * HFile v2 will need to store information about the pluggable data block > compaction schemes > Yes. As has been said v2 has a hierarchical index. That might be hard to make pluggable? What do you think? > > Memstore dynamic row key compaction: > * this will be much more difficult than the static compaction because it > needs to accept unsorted byte[]'s and support row locking Row 'locking' is done using MVCC -- a sequence/edit number is added to KVs in-memory only and only edits with a sequence/edit number older than a current moving point are viewable. This mechanism or one like it could be used going forward/ > * use modified > C-Burstsort<http://ww2.cs.mu.oz.au/~rsinha/papers/SinhaRingZobel-2006.pdf> > for > dynamic unsorted data > * fast and compact (best dynamic string sorting algorithm i could find) > * fragmentation friendly (not original goal, but ends up working like MSLAB) > * allocates large (customizable) byte[] for holding multiple leaf values > * "bursts" the byte[] when they fill up, creating multiple child arrays > * will need to add some sort of lockable object to the leaf of the rowKey > trie > Do we need too? (Thanks for the paper pointers) > > Column qualifier compaction: > * config param colQualifierCompaction=(none, lookupTable, cssTree64, etc) > * use lookupTable when column qualifiers are a small bounded set > * store the lookup table in memory (similar to blockIndex or > fixedFileTrailer) > * on second thought, this could be difficult without 2-pass compaction, > but maybe still possible > * use cssTree when a large or unbounded set, like when appending a numerical > suffix in wide rows > > > Timestamp compaction: > * timestampCompaction=(none, hfileDiff, cssTree, flatten) > * each hfile needs a lowestTimestamp stored in the trailer > * hflieDiff would store a varLong relative to the lowestTimestamp > * cssTree would do the same as it does for rowKeys, but the trie overhead > might wipe out the space savings with such small values Yes. Should we be doing 64bit timestamps? Then it might not? > * flatten would not store a timestamp in each record, rather use the hfile's > lowestTimestamp for every record > * many use cases don't care about the timestamp > * i think this works for maxVersions=1. may need more elaborate scheme > for multiple versions > > > I'll stop there... sorry for the huge email! I haven't seen much > discussion on email or jira about how prefix compression should be done, so > I hope this provides somewhat of a starting point. > Thanks for the provocative mail Matt. We arrived at KV because previously we had an Object of key, value and timestamp which we were copying multiple times on the way in and then again on the way. KV was an attempt at cutting down on the copies and object creation. At the time we thought byte arrays raw and fast. We looked at varint'ing it but savings didn't seem worth the bother or the extra CPU and it make parse of the KV byte array harder. St.Ack