dchristle opened a new pull request #988: URL: https://github.com/apache/orc/pull/988
### What changes were proposed in this pull request? This PR proposes to replace the [`aircompressor`](https://github.com/airlift/aircompressor) library for ORC's ZStandard compression with [`zstd-jni`](https://github.com/luben/zstd-jni), which is a set of JNI bindings around the [official `zstd` library](https://github.com/facebook/zstd). In addition to switching the underlying library, this PR also exposes the compression level and "long mode" settings to ORC users. These settings allow user choice around different speed/compression tradeoffs, rather than the current approach that primarily uses a default setting. ### Why are the changes needed? These change makes sense for a few reasons: * ORC users will gain all the improvements from the main `zstd` library. It is under active development and receives regular speed and compression improvements. In contrast, `aircompressor`'s zstd implementation is older and stale. * ORC users will be able to use the entire speed/compression tradeoff space. Today, `aircompressor`'s implementation has only one of eight compression strategies ([link](https://github.com/airlift/aircompressor/blob/c5e6972bd37e1d3834514957447028060a268eea/src/main/java/io/airlift/compress/zstd/CompressionParameters.java#L143)). This means only a small range of faster but less compressive strategies can be exposed to ORC users. ORC storage with high compression (e.g. for large-but-infrequently-used data) is a clear use case that this PR would unlock. * It will harmonize the Java ORC implementation with other projects in the Hadoop ecosystem. Parquet, Spark, and even the C++ ORC reader/writers all rely on the official `zstd` implementation either via `zstd-jni` or directly. In this way, the Java reader/writer code is an outlier. * Detection and fixing any bugs or regressions will generally happen much faster, given the larger number of users and active developer community of `zstd` and `zstd-jni`. The largest tradeoff is that `zstd-jni` wraps compiled code. That said, many microprocessor architectures are already targeted & bundled into `zstd-jni`, so this should be a rare hurdle. ### Open issues: * What is the best way to expose codec-specific options to users? In this PR, we add the compression level, window log size, and a boolean for enabling long mode, as new conf settings. But the `CompressionCodec` interface seems limited to exposing an enum with 3 options for speed, e.g. `FAST` or `DEFAULT`, and other codec-specific configs don't have a clear way to make it down into the codec implementation itself. I think we want to allow users to set the actual level as an integer, and to specify the window log size & long mode boolean as they wish. It wasn't clear how I could communicate these confs down to the lower level `ZstdCodec` within the bounds of the existing `CompressionCodec` interface. Right now, I used a hack to get the codec to read the conf options. * I still need to implement the `DirectByteBuffer` handling case. Right now, each call is treated the same way and will incur unnecessary copying if the input `ByteBuffer` is direct. * The existing code has a loop structure to repeatedly decompress. I wasn't sure why this exists, but we should mimic it in this PR, and I haven't done that yet. * Benchmarks should be added to this PR description. Although `zstd-jni` should have superior performance across the board, it's important to actually measure that with the benchmark suite. **List of changes:** * Add zstd-jni dependency, and add a new CompressionCodec ZstdCodec that uses it. Add ORC conf to set compression level. * Add ORC conf to use long mode, and add configuration setters for windowLog and longModeEnable. * Add tests that verify the correctness of writing and reading across compression levels, window sizes, and long mode use. * Add test for compatibility between Zstd aircompressor and zstd-jni implementations. * Fix filterWithSeek test with a smaller percentage. * Minor formatting and spelling fixes. ### How was this patch tested? * Unit tests for reading and writing ORC files using a variety of compression levels, window logs, and long mode booleans, all pass. * Unit test to compress and decompress between `aircompressor` and `zstd-jni` passes. Note that the current `aircompressor` implementation uses a small subset of levels, so the test only compares data using the default compression settings. <!-- Thanks for sending a pull request! Here are some tips for you: 1. File a JIRA issue first and use it as a prefix of your PR title, e.g., `ORC-001: Fix ABC`. 2. Use your PR title to summarize what this PR proposes instead of describing the problem. 3. Make PR title and description complete because these will be the permanent commit log. 4. If possible, provide a concise and reproducible example to reproduce the issue for a faster review. 5. If the PR is unfinished, use GitHub PR Draft feature. --> <!-- Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below. 1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers. 2. If there is a discussion in the mailing list, please add the link. --> <!-- Please clarify why the changes are needed. For instance, 1. If you propose a new API, clarify the use case for a new API. 2. If you fix a bug, you can clarify why it is a bug. --> <!-- If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@orc.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org