apurtell commented on pull request #3748:
URL: https://github.com/apache/hbase/pull/3748#issuecomment-945176138


   Just to double check, I re-ran the earlier described test, except when 
generating the test data it only emitted:
   - 10 million rows
   - A 64-bit monotonically increasing row key
   - Two values, both 32 bit integers, generated using random number generators 
obeying a zipfian distribution (using our RandomDistribution.Zipf with a sigma 
of 1.2) 
   
   When training the dictionary I gave the trainer the parameters k=32 (bit 
width to enter into the dictionary) and d=8 (stride for walking over content, 
in bits). This is a good approximation of designing these parameters with 
intent in a real use case. The result demonstrates significant speedups in 
compression as advertised and allows for achieving a better overall compression 
by enabling higher compression levels given an equivalent time budget as a no 
dictionary case.
   
   **Integers Only, No Dictionary**
   
   |Level|On Disk Size|Compression|Compaction Time (sec)|
   |---|---|---|---|
   |1|261,658,729|68.3%|21|
   |3|251,343,431|69.6%|22|
   |5|251,968,603|69.5%|25|
   |6|251,467,677|69.5%|26|
   |7|251,509,580|69.5%|27|
   |12|235,410,126|71.5%|51|
   
   **Integers Only, With Dictionary (k=32,d=8)**
   
   |Level|On Disk Size|Compression|Compaction Time (sec)|
   |---|---|---|---|
   |1|248,971,553|69.8%|13|
   |3|248,528,035|69.9%|14|
   |5|245,846,087|70.2%|16|
   |6|245,705,224|70.2%|17|
   |7|226,998,954|72.5%|25|
   |12|226,796,109|72.5%|39|
   |15|226,553,944|72.6%|44|
   |~~18~~|~~216,373,878~~|~~73.8%~~|~~153~~|
   |~~22~~|~~216,373,736~~|~~73.8%~~|~~165~~|


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to