[ 
https://issues.apache.org/jira/browse/HBASE-26259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17417853#comment-17417853
 ] 

Andrew Kyle Purtell edited comment on HBASE-26259 at 9/21/21, 10:22 PM:
------------------------------------------------------------------------

This change introduces provided compression codecs to HBase as new Maven 
modules. Each module provides compression codec support without requiring new 
external dependencies. The codec implementations are shaded into the module 
artifacts.

Previously we relied entirely on Hadoop’s hadoop-common to provide compression 
support, which in turn relies on native code integration, which may or may not 
be available on a given hardware platform or in an operational environment. It 
would be good to have alternative codecs, provided in the HBase distribution, 
for users who for whatever reason cannot or do not wish to deploy the Hadoop 
native codecs. Therefore we introduce a few new options which offer a mix of 
pure Java and native accelerated options. Users can continue to use the Hadoop 
native codecs or choose what best meets their requirements of these new options.



A JMH harness for measuring codec performance and microbenchmark results are 
attached to this JIRA. The test cases are realistic scenarios based on how we 
use compression in HFile.


There is some future opportunity to optimize the block codec compatibility 
shims in the various modules. Output might not need to be unconditionally 
buffered. Buffer allocation time was included in the microbenchmark, could 
maybe pool and re-use buffers. Data movement into and out of buffers might be 
done with Unsafe. Follow up work. For the initial version we want an 
implementation known to operate correctly under all tested conditions.

- *hbase-compression*: Parent module


- *hbase-compression-aircompressor*: Provides codecs for LZO, LZ4, Snappy, and 
ZStandard, using Aircompressor. These are all pure Java implementations. All 
except the ZStandard codec are data compatible with the Hadoop codecs. The 
ZStandard codec is limited by the algorithm implementation in Aircompressor, 
which is at an early stage and somewhat specific to Trino’s needs. These 
codecs, with the exception of ZStandard, offer reasonable performance relative 
to the Hadoop native codecs, for serving as alternatives. Performance is on par 
with the Hadoop codecs. Depending on parameters and codec the results might be 
+- 50%. This module is required to complete this work.

- *hbase-compression-lz4*: Provides a codec for LZ4, using lz4-java. The codec 
is data compatible with and is faster than the Hadoop native codec. Up to 40% 
faster than Hadoop native. This is the fastest compression codec of all 
alternatives including the Hadoop native codecs. This module is optional but is 
very nice to have.

- *hbase-compression-snappy*: Provides a codec for Snappy, using the Xerial 
wrapper. The codec is data compatible with the Hadoop native codec. The 
integration is different, that is the advantage here. Xerial provides native 
support bundled into the jar. It does not require deployment and management of 
separate native link libraries. It is slower than the Hadoop native Snappy 
codec by up to 75% but faster than most other algorithms regardless of 
implementation. This module is optional.

- *hbase-compression-xz*: At my employer we have a private patch to Hadoop that 
provides a native codec for LZMA (XZ) and support for same in HBase. LZMA is an 
extremely aggressive and slow compression algorithm that is only suitable for 
certain specialized uses cases, but a pure Java option is provided by the 
algorithm’s developer as “XZ For Java”, so we can offer support for the LZMA 
algorithm now. This module is optional.


The modular structure will enable us to offer additional alternatives at some 
future time, as desired.


was (Author: apurtell):
This change introduces provided compression codecs to HBase as new Maven 
modules. Each module provides compression codec support without requiring new 
external dependencies. The codec implementations are shaded into the module 
artifacts.

Previously we relied entirely on Hadoop’s hadoop-common to provide compression 
support, which in turn relies on native code integration, which may or may not 
be available on a given hardware platform or in an operational environment. It 
would be good to have alternative codecs, provided in the HBase distribution, 
for users who for whatever reason cannot or do not wish to deploy the Hadoop 
native codecs. Therefore we introduce a few new options which offer a mix of 
pure Java and native accelerated options. Users can continue to use the Hadoop 
native codecs or choose what best meets their requirements of these new options.



A JMH harness for measuring codec performance and microbenchmark results are 
attached to this JIRA. The test cases are realistic scenarios based on how we 
use compression in HFile.


There is some future opportunity to optimize the block codec compatibility 
shims in the various modules. Output might not need to be unconditionally 
buffered. Buffer allocation time was included in the microbenchmark, could 
maybe pool and re-use buffers. Data movement into and out of buffers might be 
done with Unsafe. Follow up work. For the initial version we want an 
implementation known to operate correctly under all tested conditions.

- *hbase-compression*: Parent module


- *hbase-compression-aircompressor*: Provides codecs for LZO, LZ4, Snappy, and 
ZStandard, using Aircompressor. These are all pure Java implementations. All 
except the ZStandard codec are data compatible with the Hadoop codecs. The 
ZStandard codec is limited by the algorithm implementation in Aircompressor, 
which is at an early stage and somewhat specific to Trino’s needs. These 
codecs, with the exception of ZStandard, offer reasonable performance relative 
to the Hadoop native codecs, for serving as alternatives. Up to 30% slower than 
Hadoop native on compression, up to 40% slower on decompression. This module is 
required to complete this work.

- *hbase-compression-lz4*: Provides a codec for LZ4, using lz4-java. The codec 
is data compatible with and is faster than the Hadoop native codec. Up to 40% 
faster than Hadoop native. This is the fastest compression codec of all 
alternatives including the Hadoop native codecs. This module is optional but is 
very nice to have.

- *hbase-compression-snappy*: Provides a codec for Snappy, using the Xerial 
wrapper. The codec is data compatible with the Hadoop native codec. The 
integration is different, that is the advantage here. Xerial provides native 
support bundled into the jar. It does not require deployment and management of 
separate native link libraries. It is slower than the Hadoop native Snappy 
codec by up to 75% but faster than most other algorithms regardless of 
implementation. This module is optional.

- *hbase-compression-xz*: At my employer we have a private patch to Hadoop that 
provides a native codec for LZMA (XZ) and support for same in HBase. LZMA is an 
extremely aggressive and slow compression algorithm that is only suitable for 
certain specialized uses cases, but a pure Java option is provided by the 
algorithm’s developer as “XZ For Java”, so we can offer support for the LZMA 
algorithm now. This module is optional.


The modular structure will enable us to offer additional alternatives at some 
future time, as desired.

> Fallback support to pure Java compression
> -----------------------------------------
>
>                 Key: HBASE-26259
>                 URL: https://issues.apache.org/jira/browse/HBASE-26259
>             Project: HBase
>          Issue Type: Sub-task
>            Reporter: Andrew Kyle Purtell
>            Assignee: Andrew Kyle Purtell
>            Priority: Major
>             Fix For: 2.5.0, 3.0.0-alpha-2
>
>         Attachments: BenchmarkCodec.java, BenchmarksMain.java, 
> RandomDistribution.java, ac_lz4_results.pdf, ac_snappy_results.pdf, 
> ac_zstd_results.pdf, lz4_lz4-java_result.pdf, xerial_snappy_results.pdf
>
>
> Airlift’s aircompressor 
> (https://search.maven.org/artifact/io.airlift/aircompressor) is an Apache 2 
> licensed library, for Java 8 and up, available in Maven central, which 
> provides pure Java implementations of gzip, lz4, lzo, snappy, and zstd and 
> Hadoop compression codecs for same, claiming “_they are typically 300% faster 
> than the JNI wrappers_.” (https://github.com/airlift/aircompressor). This 
> library is under active development and up to date releases because it is 
> used by Trino.
> Proposed changes:
> * Modify Compression.java such that compression codec implementation classes 
> can be specified by configuration. Currently they are hardcoded as strings.
> * Pull in aircompressor as a ‘compile’ time dependency so it will be bundled 
> into our build and made available on the server classpath.
> * Modify Compression.java to fall back to an aircompressor pure Java 
> implementation if schema specifies a compression algorithm, a Hadoop native 
> codec was specified as desired implementation, but the requisite native 
> support is somehow not available.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to