apurtell opened a new pull request #3691:
URL: https://github.com/apache/hbase/pull/3691


   This change introduces provided compression codecs to HBase as new Maven 
modules. Each module provides compression codec support without requiring new 
external dependencies. The codec implementations are shaded into the module 
artifacts. 
   
   Previously we relied entirely on Hadoop’s hadoop-common to provide 
compression support, which in turn relies on native code integration, which may 
or may not be available on a given hardware platform or in an operational 
environment. It would be good to have alternative codecs, provided in the HBase 
distribution, for users who for whatever reason cannot or do not wish to deploy 
the Hadoop native codecs. Therefore we introduce a few new options which offer 
a mix of pure Java and native accelerated options. Users can continue to use 
the Hadoop native codecs or choose what best meets their requirements of these 
new options. 

A JMH harness for measuring codec performance and microbenchmark 
results are attached to the JIRA. The test cases are realistic scenarios based 
on how we use compression in HFile.
   
There is some future opportunity to optimize the block codec compatibility 
shims in the various modules. Output might not need to be unconditionally 
buffered. Buffer allocation time was included in the microbenchmark, could 
maybe pool and re-use buffers. Data movement into and out of buffers might be 
done with Unsafe. Follow up work. For the initial version we want an 
implementation known to operate correctly under all tested conditions.
   
   - hbase-compression: Parent module

   
   - hbase-compression-aircompressor: Provides codecs for LZO, LZ4, Snappy, and 
ZStandard, using Aircompressor. These are all pure Java implementations. All 
except the ZStandard codec are data compatible with the Hadoop codecs. The 
ZStandard codec is limited by the algorithm implementation in Aircompressor, 
which is at an early stage and somewhat specific to Trino’s needs. These 
codecs, with the exception of ZStandard, offer reasonable performance relative 
to the Hadoop native codecs, for serving as alternatives. Up to 30% slower than 
Hadoop native on compression, up to 40% slower on decompression. This module is 
required to complete this work. 
   
   - hbase-compression-lz4: Provides a codec for LZ4, using lz4-java. The codec 
is data compatible with and is faster than the Hadoop native codec. Up to 40% 
faster than Hadoop native. This is the fastest compression codec of all 
alternatives including the Hadoop native codecs. This module is optional but is 
very nice to have.
   
   - hbase-compression-snappy: Provides a codec for Snappy, using the Xerial 
wrapper. The codec is data compatible with the Hadoop native codec. The 
integration is different, that is the advantage here. Xerial provides native 
support bundled into the jar. It does not require deployment and management of 
separate native link libraries. It is slower than the Hadoop native Snappy 
codec by up to 75% but faster than most other algorithms regardless of 
implementation. This module is optional. 
   
   - hbase-compression-xz: At my employer we have a private patch to Hadoop 
that provides a native codec for LZMA (XZ) and support for same in HBase. LZMA 
is an extremely aggressive and slow compression algorithm that is only suitable 
for certain specialized uses cases, but a pure Java option is provided by the 
algorithm’s developer as “XZ For Java”, so we can offer support for the LZMA 
algorithm now. This module is optional.

   
   The modular structure will enable us to offer additional alternatives at 
some future time, as desired.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to