[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13086574#comment-13086574
 ] 

Matt Corgan commented on HBASE-4218:
------------------------------------

Sorry I haven't chimed in on this in a while, but I've made significant 
progress implementing some of the ideas I mentioned in the discussion you 
linked to.  Taking a sorted List<KeyValue>, converting to a compressed byte[], 
and then providing fast mechanisms for reading the byte[] back to KeyValues.  
It should work for block indexes and data blocks.

I don't think I'll be able to do the full integration into HBase, but I'm 
trying to get the code to a point where it's well designed, tested, and easy 
(possible) to start working in to the code base.  I'll try to get it on github 
in the next couple weeks.  I wish I could dedicate more time, but it's been a 
nights/weekends project.

Here's a quick storage format overview.  Class names begin with "Pt" for 
"Prefix Trie".  

A block of KeyValues gets converted to a byte[] composed of 5 sections:

1) PtBlockMeta stores some offsets into the block, the width of some 
byte-encoded integers, etc.. http://pastebin.com/iizJz3f4

2) PtRowNodes are the bulk of the complexity.  They store a trie structure for 
rebuilding the row keys in the block.  Each "Leaf" node has a list of offsets 
that point to the corresponding columns, timestamps, and data offsets/lengths 
in that row.  The row data is structured for efficient sequential iteration 
and/or individual row lookups.  http://pastebin.com/cb79N0Ge

3) PtColNodes store a trie structure that provides random access to column 
qualifiers.  A PtRowNode points at one of these and it traverses its parents 
backwards through the trie to rebuild the full column qualifier.  Important for 
wide rows.  http://pastebin.com/7rsq7epp

4) TimestampDeltas are byte-encoded deltas from the minimum timestamp in the 
block.  The PtRowNodes contain pointers to these deltas.  The width of all 
deltas is determined by the longest one.  Supports having all timestamps equal 
to the minTimestamp resulting in zero storage cost.

5) A data section made of all data values concatenated together.  The 
PtRowNodes contain the offsets/lengths.


My first priority is getting the storage format right.  Then optimizing the 
read path.  Then the write path.  I'd love to hear any comments, and will 
continue to work on getting the full code ready.

> Delta Encoding of KeyValues  (aka prefix compression)
> -----------------------------------------------------
>
>                 Key: HBASE-4218
>                 URL: https://issues.apache.org/jira/browse/HBASE-4218
>             Project: HBase
>          Issue Type: Improvement
>          Components: io
>            Reporter: Jacek Migdal
>              Labels: compression
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to