[GitHub] [orc] dchristle opened a new pull request #988: ORC-817: Replace aircompressor ZStandard compression with zstd-jni

GitBox Sun, 02 Jan 2022 17:01:28 -0800


dchristle opened a new pull request #988:
URL: https://github.com/apache/orc/pull/988

### What changes were proposed in this pull request?
This PR proposes to replace the
[`aircompressor`](https://github.com/airlift/aircompressor) library for ORC's
ZStandard compression with [`zstd-jni`](https://github.com/luben/zstd-jni),
which is a set of JNI bindings around the [official `zstd`
library](https://github.com/facebook/zstd). In addition to switching the
underlying library, this PR also exposes the compression level and "long mode"
settings to ORC users. These settings allow user choice around different
speed/compression tradeoffs, rather than the current approach that primarily
uses a default setting.

### Why are the changes needed?
These change makes sense for a few reasons:

* ORC users will gain all the improvements from the main `zstd` library. It
is under active development and receives regular speed and compression
improvements. In contrast, `aircompressor`'s zstd implementation is older and
stale.
* ORC users will be able to use the entire speed/compression tradeoff space.
Today, `aircompressor`'s implementation has only one of eight compression
strategies
([link](https://github.com/airlift/aircompressor/blob/c5e6972bd37e1d3834514957447028060a268eea/src/main/java/io/airlift/compress/zstd/CompressionParameters.java#L143)).
This means only a small range of faster but less compressive strategies can be
exposed to ORC users. ORC storage with high compression (e.g. for
large-but-infrequently-used data) is a clear use case that this PR would unlock.
* It will harmonize the Java ORC implementation with other projects in the
Hadoop ecosystem. Parquet, Spark, and even the C++ ORC reader/writers all rely
on the official `zstd` implementation either via `zstd-jni` or directly. In
this way, the Java reader/writer code is an outlier.
* Detection and fixing any bugs or regressions will generally happen much
faster, given the larger number of users and active developer community of
`zstd` and `zstd-jni`.

The largest tradeoff is that `zstd-jni` wraps compiled code. That said, many
microprocessor architectures are already targeted & bundled into `zstd-jni`, so
this should be a rare hurdle.

### Open issues:

* What is the best way to expose codec-specific options to users? In this
PR, we add the compression level, window log size, and a boolean for enabling
long mode, as new conf settings. But the `CompressionCodec` interface seems
limited to exposing an enum with 3 options for speed, e.g. `FAST` or `DEFAULT`,
and other codec-specific configs don't have a clear way to make it down into
the codec implementation itself. I think we want to allow users to set the
actual level as an integer, and to specify the window log size & long mode
boolean as they wish. It wasn't clear how I could communicate these confs down
to the lower level `ZstdCodec` within the bounds of the existing
`CompressionCodec` interface. Right now, I used a hack to get the codec to read
the conf options.
* I still need to implement the `DirectByteBuffer` handling case. Right now,
each call is treated the same way and will incur unnecessary copying if the
input `ByteBuffer` is direct.
* The existing code has a loop structure to repeatedly decompress. I wasn't
sure why this exists, but we should mimic it in this PR, and I haven't done
that yet.
* Benchmarks should be added to this PR description. Although `zstd-jni`
should have superior performance across the board, it's important to actually
measure that with the benchmark suite.

**List of changes:**

* Add zstd-jni dependency, and add a new CompressionCodec ZstdCodec that
uses it. Add ORC conf to set compression level.

* Add ORC conf to use long mode, and add configuration setters for windowLog
and longModeEnable.

* Add tests that verify the correctness of writing and reading across
compression levels, window sizes, and long mode use.

* Add test for compatibility between Zstd aircompressor and zstd-jni
implementations.

* Fix filterWithSeek test with a smaller percentage.

* Minor formatting and spelling fixes.

### How was this patch tested?
* Unit tests for reading and writing ORC files using a variety of
compression levels, window logs, and long mode booleans, all pass.
* Unit test to compress and decompress between `aircompressor` and
`zstd-jni` passes. Note that the current `aircompressor` implementation uses a
small subset of levels, so the test only compares data using the default
compression settings.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@orc.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [orc] dchristle opened a new pull request #988: ORC-817: Replace aircompressor ZStandard compression with zstd-jni

Reply via email to