apache.org created ORC-817:
------------------------------
Summary: Replace aircompressor ZStandard compression with zstd-jni
Key: ORC-817
URL: https://issues.apache.org/jira/browse/ORC-817
Project: ORC
Issue Type: Improvement
Components: Java
Affects Versions: 1.7.0
Reporter: apache.org
Fix For: 1.7.0
This issue tracks the replacement of the `aircompressor` dependency for
ZStandard compression with `zstd-jni`.
ORC's Java ZStandard compression codec currently uses the `aircompressor`
dependency. This implementation is in pure Java, which provides all the
niceties of not using an additional language, but over time, it has become less
ideal:
* Multiple other projects in the big data processing ecosystem like `spark`,
`parquet`, and `avro`, all rely on `zstd-jni`, which is a Java Native Interface
wrapper over the core `zstd` C++ library. Relying on the same dependency as
other projects in our realm will let us track the same improvements and
maintain the aesthetic of a ZStandard implementation blessed by the community.
* ORC C++ uses the `zstd` library directly, while ORC Java relies on
`aircompressor`. Since these versions do not have feature parity, it is
theoretically possible to modify ORC C++ to produce a file that ORC Java cannot
read. Maintaining compatibility between C++ and Java ORC means keeping the
available features to those supported by both, which is limiting when relying
on `aircompressor`. It is also conceivable that unintended incompatibilities
between implementations could silently arise.
* `aircompressor` implements a very limited set of ZStandard compression
modes. In
[https://github.com/airlift/aircompressor/blob/495bae80ac7487d2efa1bba437d04e8a2a42bb7b/src/main/java/io/airlift/compress/zstd/CompressionParameters.java#L143]
it can be seen that only the `DoubleFastBlockCompressor` strategy of ZStandard
(out of the eight possible strategies) is actually implemented. This is a
fast-speed/lower-compression-ratio strategy, which means it's suitable for
things like shuffle data, but that that higher compression ratio/slower speed
levels, which could be used to store "write-once-read-many" or backup data in
ORC with high compression ratios, aren't possible with `aircompressor`.
* `aircompressor` currently suffers from a bug, originally discovered in the
`presto` community, that prevents ORC from upgrading to the most recent
`aircompressor` version, lest we introduce the same bug into ORC:
[https://github.com/airlift/aircompressor/issues/122] Moving to `zstd-jni`
could let `presto-orc` to move to `zstd-jni` as well.
* Besides bug and performance fixes, `zstd-jni` supports newer functionality
like `–long` mode that `aircompressor` doesn't. This mode uses longer distance
windows to achieve materially higher compression ratios at the same speeds as
earlier ZStandard versions, and has been available for more than two years:
[https://github.com/facebook/zstd/releases/tag/v1.3.2]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)