apurtell commented on pull request #3748:
URL: https://github.com/apache/hbase/pull/3748#issuecomment-944805106


   Here is the performance test result.
   
   I wrote an integration test that simulates a location data tracking use 
case. It writes 10 million rows, each row has a 64-bit random row key (not 
important), one column family, with four qualifiers, one for: first name, last 
name, latitude (encoded as an integer with scale of 3), and longitude (also 
encoded as an integer with scale of 3). Details aren't really important except 
to say the character strings are short, corresponding with typical length for 
English first and last names, and there are two 32-bit integer values. The 
32-bit integer values are generated with a zipfian distribution to reduce 
entropy and allow for potentially successful dictionary compression. But they 
are also short. When creating the table the IT specified a block size of 1K. 
Perhaps not unreasonable for a heavily indexed use case with short values. I 
could have achieved a higher compression ratio if the row keys were sequential 
instead of completely random. This is not really important.
   
   I also wrote a simple utility that iterates over an HFile and saves each 
DATA or ENCODED_DATA block as a separate file somewhere else, just the block 
data. These files were used as the training set for `zstd`. I extracted a 
training set of 20,000 blocks to train a 1MB dictionary. The parameters I used 
for training with `zstd` were basic and not especially tuned. I am not deeply 
expert in this aspect of ZStandard so can't estimate how much additional gain 
is possible.  
   
   The results demonstrate compression speed improvements as expected (a 22-33% 
improvement), as described by the ZStandard documentation. They also 
demonstrate efficiency gains (a modest 6-8%), especially in combination with 
higher levels. Specifying higher levels is more affordable because of the 
relative speedups at each level. There is a demonstration of meaningful gains 
in just this simple case, with potential for more benefits when applied by 
someone with expert knowledge. It seems reasonable to support this feature. 
   
   **No Dictionary**
   
   |Level|On Disk Size|Compression|Compaction Time (sec)|
   |-- | -- | -- | --
   |-|1,686,075,803|-|-|
   |||||
   |1|767,926,618|54.5%|42|
   |3|756,427,617|55.1%|37|
   |5|746,302,550|55.7%|48|
   |6|744,741,449|55.8%|50|
   |7|744,701,778|55.8%|54|
   |12|731,150,341|56.6%|115|
   
   **With Dictionary**
   
   |Level|On Disk Size|Compression|Compaction Time (sec)|
   |-- | -- | -- | --
   |||||
   |1|679,408,139|59.7%|28|
   |3|652,587,956|61.3%|31|
   |5|630,927,508|62.6%|37|
   |6|632,251,996|62.5%|39|
   |7|625,972,642|62.9%|56|
   |12|626,293,580|62.9%|89|
   
   Let me clean up checkstyle and other review feedback and merge this, after 
merging the prerequisite PRs first.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to