[jira] [Comment Edited] (HBASE-26353) Support loadable dictionaries in hbase-compression-zstd

Andrew Kyle Purtell (Jira) Thu, 21 Oct 2021 19:15:06 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-26353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17432759#comment-17432759
 ]


Andrew Kyle Purtell edited comment on HBASE-26353 at 10/22/21, 2:14 AM:
------------------------------------------------------------------------

Reopening. This is not quite ready yet. 

There are some code paths remaining where store configuration 
(CompoundConfiguration) is not passed into the block decoding context. Found 
with additional integration tests.

Add unit tests for region splitting and region merging with this option enabled.

SplitTableRegionProcedure and MergeTableRegionsProcedure must construct a 
suitable CompoundConfiguration as if the HFile manipulations are running in the 
regionserver context, or else we might not be able to read the store files if 
an essential detail from the CF or table schema is missing, such as path to 
dictionary file where an external dictionary was used to compress the HFile 
during write.

Longer term this information could be moved into HFile metadata somehow, to 
avoid the need for this additional plumbing (which can then be removed if 
desired). 


was (Author: apurtell):
Reopening. This is not quite ready yet. 

There are some code paths remaining where store configuration 
(CompoundConfiguration) is not passed into the block decoding context. Found 
with additional integration tests.

Add unit tests for region splitting with this option enabled.

> Support loadable dictionaries in hbase-compression-zstd
> -------------------------------------------------------
>
>                 Key: HBASE-26353
>                 URL: https://issues.apache.org/jira/browse/HBASE-26353
>             Project: HBase
>          Issue Type: Sub-task
>            Reporter: Andrew Kyle Purtell
>            Assignee: Andrew Kyle Purtell
>            Priority: Minor
>             Fix For: 2.5.0, 3.0.0-alpha-2
>
>
> ZStandard supports initialization of compressors and decompressors with a 
> precomputed dictionary, which can dramatically improve and speed up 
> compression of tables with small values. For more details, please see [The 
> Case For Small Data 
> Compression|https://github.com/facebook/zstd#the-case-for-small-data-compression].
>  
> If a table is going to have a lot of small values and the user can put 
> together a representative set of files that can be used to train a dictionary 
> for compressing those values, a dictionary can be trained with the {{zstd}} 
> command line utility, available in any zstandard package for your favorite OS:
> Training:
> {noformat}
> $ zstd --maxdict=1126400 --train-fastcover=shrink \
>     -o mytable.dict training_files/*
> Trying 82 different sets of parameters
> ...
> k=674                                      
> d=8
> f=20
> steps=40
> split=75
> accel=1
> Save dictionary of size 1126400 into file mytable.dict
> {noformat}
> Deploy the dictionary file to HDFS or S3, etc.
> Create the table:
> {noformat}
> hbase> create "mytable", 
>   ... ,
>   CONFIGURATION => {
>     'hbase.io.compress.zstd.level' => '6',
>     'hbase.io.compress.zstd.dictionary' => true,
>     'hbase.io.compress.zstd.dictonary.file' =>  
> 'hdfs://nn/zdicts/mytable.dict'
>   }
> {noformat}
> Now start storing data. Compression results even for small values will be 
> excellent.
> Note: Beware, if the dictionary is lost, the data will not be decompressable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (HBASE-26353) Support loadable dictionaries in hbase-compression-zstd

Reply via email to