apurtell opened a new pull request #3691: URL: https://github.com/apache/hbase/pull/3691
This change introduces provided compression codecs to HBase as new Maven modules. Each module provides compression codec support without requiring new external dependencies. The codec implementations are shaded into the module artifacts. Previously we relied entirely on Hadoop’s hadoop-common to provide compression support, which in turn relies on native code integration, which may or may not be available on a given hardware platform or in an operational environment. It would be good to have alternative codecs, provided in the HBase distribution, for users who for whatever reason cannot or do not wish to deploy the Hadoop native codecs. Therefore we introduce a few new options which offer a mix of pure Java and native accelerated options. Users can continue to use the Hadoop native codecs or choose what best meets their requirements of these new options. A JMH harness for measuring codec performance and microbenchmark results are attached to the JIRA. The test cases are realistic scenarios based on how we use compression in HFile. There is some future opportunity to optimize the block codec compatibility shims in the various modules. Output might not need to be unconditionally buffered. Buffer allocation time was included in the microbenchmark, could maybe pool and re-use buffers. Data movement into and out of buffers might be done with Unsafe. Follow up work. For the initial version we want an implementation known to operate correctly under all tested conditions. - hbase-compression: Parent module - hbase-compression-aircompressor: Provides codecs for LZO, LZ4, Snappy, and ZStandard, using Aircompressor. These are all pure Java implementations. All except the ZStandard codec are data compatible with the Hadoop codecs. The ZStandard codec is limited by the algorithm implementation in Aircompressor, which is at an early stage and somewhat specific to Trino’s needs. These codecs, with the exception of ZStandard, offer reasonable performance relative to the Hadoop native codecs, for serving as alternatives. Up to 30% slower than Hadoop native on compression, up to 40% slower on decompression. This module is required to complete this work. - hbase-compression-lz4: Provides a codec for LZ4, using lz4-java. The codec is data compatible with and is faster than the Hadoop native codec. Up to 40% faster than Hadoop native. This is the fastest compression codec of all alternatives including the Hadoop native codecs. This module is optional but is very nice to have. - hbase-compression-snappy: Provides a codec for Snappy, using the Xerial wrapper. The codec is data compatible with the Hadoop native codec. The integration is different, that is the advantage here. Xerial provides native support bundled into the jar. It does not require deployment and management of separate native link libraries. It is slower than the Hadoop native Snappy codec by up to 75% but faster than most other algorithms regardless of implementation. This module is optional. - hbase-compression-xz: At my employer we have a private patch to Hadoop that provides a native codec for LZMA (XZ) and support for same in HBase. LZMA is an extremely aggressive and slow compression algorithm that is only suitable for certain specialized uses cases, but a pure Java option is provided by the algorithm’s developer as “XZ For Java”, so we can offer support for the LZMA algorithm now. This module is optional. The modular structure will enable us to offer additional alternatives at some future time, as desired. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
