[
https://issues.apache.org/jira/browse/HBASE-26259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17417853#comment-17417853
]
Andrew Kyle Purtell edited comment on HBASE-26259 at 9/21/21, 11:50 PM:
------------------------------------------------------------------------
This change introduces provided compression codecs to HBase as new Maven
modules. Each module provides compression codec support without requiring new
external dependencies. The codec implementations are shaded into the module
artifacts.
Previously we relied entirely on Hadoop’s hadoop-common to provide compression
support, which in turn relies on native code integration, which may or may not
be available on a given hardware platform or in an operational environment. It
would be good to have alternative codecs, provided in the HBase distribution,
for users who for whatever reason cannot or do not wish to deploy the Hadoop
native codecs. Therefore we introduce a few new options which offer a mix of
pure Java and native accelerated options. Users can continue to use the Hadoop
native codecs or choose what best meets their requirements of these new options.
A JMH harness for measuring codec performance and microbenchmark results are
attached to this JIRA. The test cases are realistic scenarios based on how we
use compression in HFile.
There is some future opportunity to optimize the block codec compatibility
shims in the various modules. Output might not need to be unconditionally
buffered. Buffer allocation time was included in the microbenchmark, could
maybe pool and re-use buffers. Data movement into and out of buffers might be
done with Unsafe. Follow up work. For the initial version we want an
implementation known to operate correctly under all tested conditions.
- *hbase-compression*: Parent module
- *hbase-compression-aircompressor*: Provides codecs for LZO, LZ4, Snappy, and
ZStandard, using Aircompressor. These are all pure Java implementations. All
except the ZStandard codec are data compatible with the Hadoop codecs. The
ZStandard codec is limited by the algorithm implementation in Aircompressor,
which is at an early stage and somewhat specific to Trino’s needs. These
codecs, with the exception of ZStandard, offer reasonable performance relative
to the Hadoop native codecs, for serving as alternatives. Performance is on par
with the Hadoop codecs. Depending on parameters and codec the results might be
+- 50%. This module is required to complete this work.
- *hbase-compression-lz4*: Provides a codec for LZ4, using lz4-java. The codec
is data compatible with and is faster than the Hadoop native codec. Up to 50%
faster than Hadoop native. This is the fastest compression codec of all
alternatives including the Hadoop native codecs. This module is optional but is
very nice to have.
- *hbase-compression-snappy*: Provides a codec for Snappy, using the Xerial
wrapper. The codec is data compatible with the Hadoop native codec. The
integration is different, that is the advantage here. Xerial provides native
support bundled into the jar. It does not require deployment and management of
separate native link libraries. It is slower than the Hadoop native Snappy
codec by up to 75% but faster than most other algorithms regardless of
implementation. This module is optional.
- *hbase-compression-xz*: At my employer we have a private patch to Hadoop that
provides a native codec for LZMA (XZ) and support for same in HBase. LZMA is an
extremely aggressive and slow compression algorithm that is only suitable for
certain specialized uses cases, but a pure Java option is provided by the
algorithm’s developer as “XZ For Java”, so we can offer support for the LZMA
algorithm now. This module is optional.
The modular structure will enable us to offer additional alternatives at some
future time, as desired.
was (Author: apurtell):
This change introduces provided compression codecs to HBase as new Maven
modules. Each module provides compression codec support without requiring new
external dependencies. The codec implementations are shaded into the module
artifacts.
Previously we relied entirely on Hadoop’s hadoop-common to provide compression
support, which in turn relies on native code integration, which may or may not
be available on a given hardware platform or in an operational environment. It
would be good to have alternative codecs, provided in the HBase distribution,
for users who for whatever reason cannot or do not wish to deploy the Hadoop
native codecs. Therefore we introduce a few new options which offer a mix of
pure Java and native accelerated options. Users can continue to use the Hadoop
native codecs or choose what best meets their requirements of these new options.
A JMH harness for measuring codec performance and microbenchmark results are
attached to this JIRA. The test cases are realistic scenarios based on how we
use compression in HFile.
There is some future opportunity to optimize the block codec compatibility
shims in the various modules. Output might not need to be unconditionally
buffered. Buffer allocation time was included in the microbenchmark, could
maybe pool and re-use buffers. Data movement into and out of buffers might be
done with Unsafe. Follow up work. For the initial version we want an
implementation known to operate correctly under all tested conditions.
- *hbase-compression*: Parent module
- *hbase-compression-aircompressor*: Provides codecs for LZO, LZ4, Snappy, and
ZStandard, using Aircompressor. These are all pure Java implementations. All
except the ZStandard codec are data compatible with the Hadoop codecs. The
ZStandard codec is limited by the algorithm implementation in Aircompressor,
which is at an early stage and somewhat specific to Trino’s needs. These
codecs, with the exception of ZStandard, offer reasonable performance relative
to the Hadoop native codecs, for serving as alternatives. Performance is on par
with the Hadoop codecs. Depending on parameters and codec the results might be
+- 50%. This module is required to complete this work.
- *hbase-compression-lz4*: Provides a codec for LZ4, using lz4-java. The codec
is data compatible with and is faster than the Hadoop native codec. Up to 40%
faster than Hadoop native. This is the fastest compression codec of all
alternatives including the Hadoop native codecs. This module is optional but is
very nice to have.
- *hbase-compression-snappy*: Provides a codec for Snappy, using the Xerial
wrapper. The codec is data compatible with the Hadoop native codec. The
integration is different, that is the advantage here. Xerial provides native
support bundled into the jar. It does not require deployment and management of
separate native link libraries. It is slower than the Hadoop native Snappy
codec by up to 75% but faster than most other algorithms regardless of
implementation. This module is optional.
- *hbase-compression-xz*: At my employer we have a private patch to Hadoop that
provides a native codec for LZMA (XZ) and support for same in HBase. LZMA is an
extremely aggressive and slow compression algorithm that is only suitable for
certain specialized uses cases, but a pure Java option is provided by the
algorithm’s developer as “XZ For Java”, so we can offer support for the LZMA
algorithm now. This module is optional.
The modular structure will enable us to offer additional alternatives at some
future time, as desired.
> Fallback support to pure Java compression
> -----------------------------------------
>
> Key: HBASE-26259
> URL: https://issues.apache.org/jira/browse/HBASE-26259
> Project: HBase
> Issue Type: Sub-task
> Reporter: Andrew Kyle Purtell
> Assignee: Andrew Kyle Purtell
> Priority: Major
> Fix For: 2.5.0, 3.0.0-alpha-2
>
> Attachments: BenchmarkCodec.java, BenchmarksMain.java,
> RandomDistribution.java, ac_lz4_results.pdf, ac_snappy_results.pdf,
> ac_zstd_results.pdf, lz4_lz4-java_result.pdf, xerial_snappy_results.pdf
>
>
> Airlift’s aircompressor
> (https://search.maven.org/artifact/io.airlift/aircompressor) is an Apache 2
> licensed library, for Java 8 and up, available in Maven central, which
> provides pure Java implementations of gzip, lz4, lzo, snappy, and zstd and
> Hadoop compression codecs for same, claiming “_they are typically 300% faster
> than the JNI wrappers_.” (https://github.com/airlift/aircompressor). This
> library is under active development and up to date releases because it is
> used by Trino.
> Proposed changes:
> * Modify Compression.java such that compression codec implementation classes
> can be specified by configuration. Currently they are hardcoded as strings.
> * Pull in aircompressor as a ‘compile’ time dependency so it will be bundled
> into our build and made available on the server classpath.
> * Modify Compression.java to fall back to an aircompressor pure Java
> implementation if schema specifies a compression algorithm, a Hadoop native
> codec was specified as desired implementation, but the requisite native
> support is somehow not available.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)