Re: prefix compression

Stack Thu, 02 Jun 2011 21:49:05 -0700

High-level this sounds like a great.

Inline below is some feedback and a bit of history on how we got here
in case it helps:

On Thu, Jun 2, 2011 at 3:28 PM, Matt Corgan <[email protected]> wrote:
> * refer to prefix compression as "compaction" to avoid interfering with
> traditional "compression"
>

'Compaction' is already a term in hbase.  Your suggestion below may
reproduce an issue we currently have where the term 'split' can refer
to the splitting of logs on regionserver recovery or spit of a region
because its grown too large.  Just a headsup.

> * main goal is to reduce size of data in memory to increase effective
> capacity of block index, block cache, and memstores

Ok.  At some cost in CPU.

>    * disk access takes thousands of times longer than memory access,
> especially over hdfs/network
>

Yes.  I would underline the clause that starts 'especially' in the above.

> * secondary goal is to speed up the processing of in memory data
>    * memory access takes 200+ cpu cycles, so must use cache friendly
> algorithms
>    * structure trie node sizes around typical 64B cache line size

How would you measure our effectiveness here?

>    * investigate java's memory prefetching capabilities, possibly
> increasing to ~256B nodes

Do you have a pointer on this phenomeon (I'm ignorant and would learn more)?

>    * almost always use variable length integer/long encoding
>

In my limited experience I've found these to be CPU intensive making
the parse of a compound byte array of rowkey+family+ etc. such as KV
is now complicated.

On other hand, our Benoit reports having made significant improvements
over the hadoop varlength I was using.

> * need separate compacted structures/algorithms for rowKey, colQualifier,
> and timestamp

Why you think that Matt?  Wouldn't rowKey and say colQualifier share
an algorithm because both are arbitrary byte arrays?  Or, are you
talking about context; the fact that the previous KV may have same row
and family say?  (We've talked of using a code for column family
keeping a table of codes to family name...).

>    * structures will need to interact with varInt pointers to link
> rowKey<->colQualifier<->timestamp
>
> * i think in the medium to long term (and maybe even short term) it would be
> best to code this from scratch
>

I'd suggest some code sketches apart from hbase would be cool
illustrating what you are thinking.

KV is pervasive; from client API all the ways out to entries in HFile.
 Changing it is going to be a PITA.

>
> Static row key compaction for block cache and data blocks:
> * avoid schemes where each row is a diff of the previous (expensive cpu)

Yes.  I think this is likely the case.

> * utilize the fact that row keys are always sorted (unlike column
> qualifiers)

Column qualifiers are also sorted.

>    * this makes for fast tree construction during flushing and compaction

Because you are filling the tree w/ sorted data?

> * low hanging fruit: (rowKeyCompaction=simple)
>    * pull off the common prefix of all rows

In a block context?  As the context's change, would we have a
different means of doing this  (Examples of contexts: memstore, block
cache, hfile).

>    * modified to accept full byte range 0-255, store full byte strings in
> the tree, remove all pointers, optimize for cache line size, etc, but
> similar idea

How well can this be done in a language such as java?  Does its
indirection make it so its not possible to be cache line size
consious?

>    * CSS trees are considered sub-par because they are not easily modified,
> but that is ok in hbase's block cache and data blocks

And for the memstore?  Do you have an idea for what kind of structure
we could use here, one that exploits 'compressed' values?

> * input is a sorted list of byte[], output is a byte[] containing the entire
> prefix trie
> * possibly put all keys in the front of the block, and all values at the
> end, with value offsets/lengths in the keys
>

How do you think this would improve things?

> Backwards compatibility
> * block cache could quickly compute the trie when loaded, therefore working
> with HFile v1

I don't think we need to be backward compatible.  We just need to be
able to migrate from old to the new (in fact, I'd say do not be
backward compatible -- having to do so in my experience makes the code
more complex and inevitable compromises an implementation)

>    * eventually want to store the computed trie on disk
> * HFile v2 will need to store information about the pluggable data block
> compaction schemes
>

Yes.

As has been said v2 has a hierarchical index.  That might be hard to
make pluggable?  What do you think?

>
> Memstore dynamic row key compaction:
> * this will be much more difficult than the static compaction because it
> needs to accept unsorted byte[]'s and support row locking

Row 'locking' is done using MVCC -- a sequence/edit number is added to
KVs in-memory only and only edits with a sequence/edit number older
than a current moving point are viewable.  This mechanism or one like
it could be used going forward/

> * use modified 
> C-Burstsort<http://ww2.cs.mu.oz.au/~rsinha/papers/SinhaRingZobel-2006.pdf>
> for
> dynamic unsorted data
> * fast and compact (best dynamic string sorting algorithm i could find)
> * fragmentation friendly (not original goal, but ends up working like MSLAB)
>    * allocates large (customizable) byte[] for holding multiple leaf values
>    * "bursts" the byte[] when they fill up, creating multiple child arrays
> * will need to add some sort of lockable object to the leaf of the rowKey
> trie
>

Do we need too?

(Thanks for the paper pointers)

>
> Column qualifier compaction:
> * config param colQualifierCompaction=(none, lookupTable, cssTree64, etc)
> * use lookupTable when column qualifiers are a small bounded set
>    * store the lookup table in memory (similar to blockIndex or
> fixedFileTrailer)
>    * on second thought, this could be difficult without 2-pass compaction,
> but maybe still possible
> * use cssTree when a large or unbounded set, like when appending a numerical
> suffix in wide rows
>
>
> Timestamp compaction:
> * timestampCompaction=(none, hfileDiff, cssTree, flatten)
> * each hfile needs a lowestTimestamp stored in the trailer
> * hflieDiff would store a varLong relative to the lowestTimestamp
> * cssTree would do the same as it does for rowKeys, but the trie overhead
> might wipe out the space savings with such small values

Yes.  Should we be doing 64bit timestamps?  Then it might not?

> * flatten would not store a timestamp in each record, rather use the hfile's
> lowestTimestamp for every record
>    * many use cases don't care about the timestamp
>    * i think this works for maxVersions=1.  may need more elaborate scheme
> for multiple versions
>
>
> I'll stop there...  sorry for the huge email!  I haven't seen much
> discussion on email or jira about how prefix compression should be done, so
> I hope this provides somewhat of a starting point.
>

Thanks for the provocative mail Matt.

We arrived at KV because previously we had an Object of key, value and
timestamp which we were copying multiple times on the way in and then
again on the way.  KV was an attempt at cutting down on the copies and
object creation.  At the time we thought byte arrays raw and fast.  We
looked at varint'ing it but savings didn't seem worth the bother or
the extra CPU and it make parse of the KV byte array harder.

St.Ack

Re: prefix compression

Reply via email to