Thanks for the feedback Stack.  Some inline responses:

On Thu, Jun 2, 2011 at 9:48 PM, Stack <st...@duboce.net> wrote:

> High-level this sounds like a great.
>
> Inline below is some feedback and a bit of history on how we got here
> in case it helps:
>
> On Thu, Jun 2, 2011 at 3:28 PM, Matt Corgan <mcor...@hotpads.com> wrote:
> > * refer to prefix compression as "compaction" to avoid interfering with
> > traditional "compression"
> >
>
> 'Compaction' is already a term in hbase.  Your suggestion below may
> reproduce an issue we currently have where the term 'split' can refer
> to the splitting of logs on regionserver recovery or spit of a region
> because its grown too large.  Just a headsup.
>
> ha, yes, not sure how i overlooked that.  Then maybe name the config
params: rowKeyFormat, colQualifierFormat, and timestampFormat

>
> > * main goal is to reduce size of data in memory to increase effective
> > capacity of block index, block cache, and memstores
>
>
> Ok.  At some cost in CPU.
>
>
> >    * disk access takes thousands of times longer than memory access,
> > especially over hdfs/network
> >
>
> Yes.  I would underline the clause that starts 'especially' in the above.
>
>
> > * secondary goal is to speed up the processing of in memory data
> >    * memory access takes 200+ cpu cycles, so must use cache friendly
> > algorithms
> >    * structure trie node sizes around typical 64B cache line size
>
> How would you measure our effectiveness here?
>

The most important thing is to make the format pluggable, then we can run
HFile performance tests to find the optimal tuning for different
hardware/workloads.

The idea with the 64B is that if you're going to go to memory and get
64B (or 8 words) back, you might as well have designed your data structure
so that you can make use of all 64B.  This requires that you structure the
prefix tries a little differently than older literature recommends.


>
>
> >    * investigate java's memory prefetching capabilities, possibly
> > increasing to ~256B nodes
>
> Do you have a pointer on this phenomeon (I'm ignorant and would learn
> more)?
>
>
Paying attention to pre-fetching would be a super-optimization, but it is a
very real thing that can buy you another ~20% performance.  64B is the
normal memory fetch size right now, but sometimes the hardware will bring
back even larger chunks.  Some languages offer specific commands to
prefetch... java doesn't, but the compiler is adding them behind the scenes
so it could possibly be coerced.  The best reading on the topic is probably
the paper i mentioned earlier
(link<http://www.google.com/url?sa=t&source=web&cd=5&sqi=2&ved=0CDcQFjAE&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.68.4872%26rep%3Drep1%26type%3Dpdf&rct=j&q=databases%20prefetching&ei=7tnnTcSWLuXTiALSpNCUDA&usg=AFQjCNFpVC9oDsG1-UG0x6ZsqsIHLqvGDA&sig2=o096l8cNrnX69oeBGChUKg>),
and then its references if you're still interested.


>
> >    * almost always use variable length integer/long encoding
> >
>
> In my limited experience I've found these to be CPU intensive making
> the parse of a compound byte array of rowkey+family+ etc. such as KV
> is now complicated.
>
> On other hand, our Benoit reports having made significant improvements
> over the hadoop varlength I was using.
>
> Pluggable formats would help here so you could tune for mem vs cpu.  My
thought is that most hbase vints would be very short (1-2 byte internal
array offsets) which are faster to decode, but more importantly, save a lot
of memory.  This topic is covered a lot in discussions of inverted index
posting list compression.  Not perfectly relevant, but here's a good
paper<http://www2008.org/papers/pdf/p387-zhangA.pdf>about compressing
integers/longs.  I have no data to back it up, but I feel
like if search engines are primarily storing lists of longs and they decide
it's best to compress them, then that's probably the best method for other
pieces of software as well.


> > * need separate compacted structures/algorithms for rowKey, colQualifier,
> > and timestamp
>
> Why you think that Matt?  Wouldn't rowKey and say colQualifier share
> an algorithm because both are arbitrary byte arrays?  Or, are you
> talking about context; the fact that the previous KV may have same row
> and family say?  (We've talked of using a code for column family
> keeping a table of codes to family name...).
>

All three parts of the KV could share the same trie algorithm for
compressing arbitrary byte arrays.  However, I think sometimes you can
compress colQualifiers even further (and faster) with a lookup table.  And
timestamps even further with a delta scheme or by completely omitting them.


>
> >    * structures will need to interact with varInt pointers to link
> > rowKey<->colQualifier<->timestamp
> >
> > * i think in the medium to long term (and maybe even short term) it would
> be
> > best to code this from scratch
> >
>
> I'd suggest some code sketches apart from hbase would be cool
> illustrating what you are thinking.
>

Agreed - i'm working on some simple code examples in my free time.
 Hopefully can post in a couple weeks.


> KV is pervasive; from client API all the ways out to entries in HFile.
>  Changing it is going to be a PITA.
>
> >
> > Static row key compaction for block cache and data blocks:
> > * avoid schemes where each row is a diff of the previous (expensive cpu)
>
> Yes.  I think this is likely the case.
>
>
> > * utilize the fact that row keys are always sorted (unlike column
> > qualifiers)
>
> Column qualifiers are also sorted.
>
> Col qualifiers are sorted within a row, but the unit of compression is the
data block.  If you take a data block and split it into 3 parallel lists of
equal length: a list of rowKeys, a list of colQualifiers, and a list of
timestamps, the rowKeys will be automatically be sorted, but the other two
will not be (unless the whole block is contained in one row).  Does that
make sense?


> >    * this makes for fast tree construction during flushing and compaction
>
> Because you are filling the tree w/ sorted data?


> > * low hanging fruit: (rowKeyCompaction=simple)
> >    * pull off the common prefix of all rows
>
> In a block context?  As the context's change, would we have a
> different means of doing this  (Examples of contexts: memstore, block
> cache, hfile).
>

For memstore and block index, you take the common prefix of the first and
last rowKey.  For data block, take the common prefix of first and last key.
 As tables get really large, they are more and more likely to have longer
common prefixes.

>
>
> >    * modified to accept full byte range 0-255, store full byte strings in
> > the tree, remove all pointers, optimize for cache line size, etc, but
> > similar idea
>
>
> How well can this be done in a language such as java?  Does its
> indirection make it so its not possible to be cache line size
> consious?
>

I think it's the hardware that determines the cache line size and does some
of the prefetching, so java is directly affected by those.  Unfortunately,
I'm not sure it's possible to align the start of an array on a cache line in
java like it is in C/C++, however, on average you can make use of half the
cache line... and there's always hope for the future.

>
>
> >    * CSS trees are considered sub-par because they are not easily
> modified,
> > but that is ok in hbase's block cache and data blocks
>
> And for the memstore?  Do you have an idea for what kind of structure
> we could use here, one that exploits 'compressed' values?
>
> > * input is a sorted list of byte[], output is a byte[] containing the
> entire
> > prefix trie
> > * possibly put all keys in the front of the block, and all values at the
> > end, with value offsets/lengths in the keys
> >
>
> How do you think this would improve things?
>

Not sure it's always best to do this, but the idea is to not pollute the
cache with value data unless it's actually needed.  Think of the case of a
scanner with a high block cache hit ratio but returning sparse results.
 This may be an improvement on the case where prefix compression isn't
needed to begin with.  Another case for pluggable formats and testing...

>
>
> > Backwards compatibility
> > * block cache could quickly compute the trie when loaded, therefore
> working
> > with HFile v1
>
>
> I don't think we need to be backward compatible.  We just need to be
> able to migrate from old to the new (in fact, I'd say do not be
> backward compatible -- having to do so in my experience makes the code
> more complex and inevitable compromises an implementation)
>

Agree about backwards compatibility being a bad idea (especially in a
pre-1.0 product).  I guess I meant "current compatibility", like it could be
included in 0.90.4 since it doesn't require changes in serialization or file
formats.  Not pushing for that, just noting...


> >    * eventually want to store the computed trie on disk
> > * HFile v2 will need to store information about the pluggable data block
> > compaction schemes
> >
>
> Yes.
>
> As has been said v2 has a hierarchical index.  That might be hard to
> make pluggable?  What do you think?
>
> It's not as important for the block index to be pluggable as it is for the
data blocks or memstore.  It could probably be done by having another
columnFamily setting "blockIndexFormat".


> >
> > Memstore dynamic row key compaction:
> > * this will be much more difficult than the static compaction because it
> > needs to accept unsorted byte[]'s and support row locking
>
> Row 'locking' is done using MVCC -- a sequence/edit number is added to
> KVs in-memory only and only edits with a sequence/edit number older
> than a current moving point are viewable.  This mechanism or one like
> it could be used going forward/


>
> > * use modified C-Burstsort<
> http://ww2.cs.mu.oz.au/~rsinha/papers/SinhaRingZobel-2006.pdf>
> > for
> > dynamic unsorted data
> > * fast and compact (best dynamic string sorting algorithm i could find)
> > * fragmentation friendly (not original goal, but ends up working like
> MSLAB)
> >    * allocates large (customizable) byte[] for holding multiple leaf
> values
> >    * "bursts" the byte[] when they fill up, creating multiple child
> arrays
> > * will need to add some sort of lockable object to the leaf of the rowKey
> > trie
> >
>
> Do we need too?
>

Oops, guess not.  But then we'll need some place to store the version of
each KV since each KV is not a standalone thing.  I think this isn't a big
deal... disregard that comment.

>
> (Thanks for the paper pointers)
>
>
> >
> > Column qualifier compaction:
> > * config param colQualifierCompaction=(none, lookupTable, cssTree64, etc)
> > * use lookupTable when column qualifiers are a small bounded set
> >    * store the lookup table in memory (similar to blockIndex or
> > fixedFileTrailer)
> >    * on second thought, this could be difficult without 2-pass
> compaction,
> > but maybe still possible
> > * use cssTree when a large or unbounded set, like when appending a
> numerical
> > suffix in wide rows
> >
> >
> > Timestamp compaction:
> > * timestampCompaction=(none, hfileDiff, cssTree, flatten)
> > * each hfile needs a lowestTimestamp stored in the trailer
> > * hflieDiff would store a varLong relative to the lowestTimestamp
> > * cssTree would do the same as it does for rowKeys, but the trie overhead
> > might wipe out the space savings with such small values
>
> Yes.  Should we be doing 64bit timestamps?  Then it might not?
>

Not sure what you mean here?  Maybe that same trie structure would be fine.
 It does require a few extra bytes per entry though, so even if you compress
to 2 bytes, you're creeping back up towards the original 8 bytes.


>
> > * flatten would not store a timestamp in each record, rather use the
> hfile's
> > lowestTimestamp for every record
> >    * many use cases don't care about the timestamp
> >    * i think this works for maxVersions=1.  may need more elaborate
> scheme
> > for multiple versions
> >
> >
> > I'll stop there...  sorry for the huge email!  I haven't seen much
> > discussion on email or jira about how prefix compression should be done,
> so
> > I hope this provides somewhat of a starting point.
> >
>
> Thanks for the provocative mail Matt.
>
> We arrived at KV because previously we had an Object of key, value and
> timestamp which we were copying multiple times on the way in and then
> again on the way.  KV was an attempt at cutting down on the copies and
> object creation.  At the time we thought byte arrays raw and fast.  We
> looked at varint'ing it but savings didn't seem worth the bother or
> the extra CPU and it make parse of the KV byte array harder.

St.Ack
>

Reply via email to