[ 
https://issues.apache.org/jira/browse/HBASE-5313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13226351#comment-13226351
 ] 

dhruba borthakur commented on HBASE-5313:
-----------------------------------------

I am guessing that initially, we keep this new "columnar encoding" completely 
isolated inside a HFileBlock. At table creation time, one can specify that the 
table be stored in columnar-encoded fashion.

A HFile will have information in the FixedFileTrailer that specifies whether 
the data inside the hfile is in columnar-format. A single HFileBlock will have 
four "column-entity": all the rowkeys will be laid out first, followed by all 
the cf, followed by all the "column names", followed by the timestamps, 
followed by the memstoreTS, followed by all the values.

If 'prefix-encoding' is enabled, then each column-entity will be prefix encoded 
individually. If compression (lzo, gz, etc) is enabled, the entire HFileBlock 
will be compressed accordingly.

Prefix-encoding will work well for the rowkey entity and the column-family 
entity. The column name entity may need to be sorted and then prefix encoded. 
The timestamp entity may need special kind of encoding. One option (suggested 
by a co-worker Chip Turner) is to take the first timestamp as the base and xor 
it with each of the following timestamps (thus, zeroing out the common bits) 
and then storing it. Another option is to find the minimum timestamp in the 
block and then store diffs from that minimum value. Yet another option is to 
make Jan-01-2012 as the hbase-epoch and store the difference from that number.

                
> Restructure hfiles layout for better compression
> ------------------------------------------------
>
>                 Key: HBASE-5313
>                 URL: https://issues.apache.org/jira/browse/HBASE-5313
>             Project: HBase
>          Issue Type: Improvement
>          Components: io
>            Reporter: dhruba borthakur
>            Assignee: dhruba borthakur
>
> A HFile block contain a stream of key-values. Can we can organize these kvs 
> on the disk in a better way so that we get much greater compression ratios?
> One option (thanks Prakash) is to store all the keys in the beginning of the 
> block (let's call this the key-section) and then store all their 
> corresponding values towards the end of the block. This will allow us to 
> not-even decompress the values when we are scanning and skipping over rows in 
> the block.
> Any other ideas? 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to