dchristle opened a new pull request #988:
URL: https://github.com/apache/orc/pull/988


   ### What changes were proposed in this pull request?
   This PR proposes to replace the 
[`aircompressor`](https://github.com/airlift/aircompressor) library for ORC's 
ZStandard compression with [`zstd-jni`](https://github.com/luben/zstd-jni), 
which is a set of JNI bindings around the [official `zstd` 
library](https://github.com/facebook/zstd). In addition to switching the 
underlying library, this PR also exposes the compression level and "long mode" 
settings to ORC users. These settings allow user choice around different 
speed/compression tradeoffs, rather than the current approach that primarily 
uses a default setting.
   
   ### Why are the changes needed?
   These change makes sense for a few reasons:
   
   * ORC users will gain all the improvements from the main `zstd` library. It 
is under active development and receives regular speed and compression 
improvements. In contrast, `aircompressor`'s zstd implementation is older and 
stale.
   * ORC users will be able to use the entire speed/compression tradeoff space. 
Today, `aircompressor`'s implementation has only one of eight compression 
strategies 
([link](https://github.com/airlift/aircompressor/blob/c5e6972bd37e1d3834514957447028060a268eea/src/main/java/io/airlift/compress/zstd/CompressionParameters.java#L143)).
 This means only a small range of faster but less compressive strategies can be 
exposed to ORC users. ORC storage with high compression (e.g. for 
large-but-infrequently-used data) is a clear use case that this PR would unlock.
   * It will harmonize the Java ORC implementation with other projects in the 
Hadoop ecosystem. Parquet, Spark, and even the C++ ORC reader/writers all rely 
on the official `zstd` implementation either via `zstd-jni` or directly. In 
this way, the Java reader/writer code is an outlier.
   * Detection and fixing any bugs or regressions will generally happen much 
faster, given the larger number of users and active developer community of 
`zstd` and `zstd-jni`.
   
   The largest tradeoff is that `zstd-jni` wraps compiled code. That said, many 
microprocessor architectures are already targeted & bundled into `zstd-jni`, so 
this should be a rare hurdle.
   
   ### Open issues:
   
   * What is the best way to expose codec-specific options to users? In this 
PR, we add the compression level, window log size, and a boolean for enabling 
long mode, as new conf settings. But the `CompressionCodec` interface seems 
limited to exposing an enum with 3 options for speed, e.g. `FAST` or `DEFAULT`, 
and other codec-specific configs don't have a clear way to make it down into 
the codec implementation itself. I think we want to allow users to set the 
actual level as an integer, and to specify the window log size & long mode 
boolean as they wish. It wasn't clear how I could communicate these confs down 
to the lower level `ZstdCodec` within the bounds of the existing 
`CompressionCodec` interface. Right now, I used a hack to get the codec to read 
the conf options.
   * I still need to implement the `DirectByteBuffer` handling case. Right now, 
each call is treated the same way and will incur unnecessary copying if the 
input `ByteBuffer` is direct.
   * The existing code has a loop structure to repeatedly decompress. I wasn't 
sure why this exists, but we should mimic it in this PR, and I haven't done 
that yet.
   * Benchmarks should be added to this PR description. Although `zstd-jni` 
should have superior performance across the board, it's important to actually 
measure that with the benchmark suite.
   
   **List of changes:**
   
   * Add zstd-jni dependency, and add a new CompressionCodec ZstdCodec that 
uses it. Add ORC conf to set compression level.
   
   * Add ORC conf to use long mode, and add configuration setters for windowLog 
and longModeEnable.
   
   * Add tests that verify the correctness of writing and reading across 
compression levels, window sizes, and long mode use.
   
   * Add test for compatibility between Zstd aircompressor and zstd-jni 
implementations.
   
   * Fix filterWithSeek test with a smaller percentage.
   
   * Minor formatting and spelling fixes.
   
   ### How was this patch tested?
   * Unit tests for reading and writing ORC files using a variety of 
compression levels, window logs, and long mode booleans, all pass.
   * Unit test to compress and decompress between `aircompressor` and 
`zstd-jni` passes. Note that the current `aircompressor` implementation uses a 
small subset of levels, so the test only compares data using the default 
compression settings.
   
   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. File a JIRA issue first and use it as a prefix of your PR title, e.g., 
`ORC-001: Fix ABC`.
     2. Use your PR title to summarize what this PR proposes instead of 
describing the problem.
     3. Make PR title and description complete because these will be the 
permanent commit log.
     4. If possible, provide a concise and reproducible example to reproduce 
the issue for a faster review.
     5. If the PR is unfinished, use GitHub PR Draft feature.
   -->
   
   
   <!--
   Please clarify what changes you are proposing. The purpose of this section 
is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster 
reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class 
hierarchy will help reviewers.
     2. If there is a discussion in the mailing list, please add the link.
   -->
   
   
   
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   
   
   
   <!--
   If tests were added, say they were added here. Please make sure to add some 
test cases that check the changes thoroughly including negative and positive 
cases if possible.
   If it was tested in a way different from regular unit tests, please clarify 
how you tested step by step, ideally copy and paste-able, so that other 
reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why 
it was difficult to add.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@orc.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to