[jira] [Commented] (PARQUET-2159) Parquet bit-packing de/encode optimization

Timothy Miller (Jira) Thu, 16 Jun 2022 07:18:04 -0700


    [ 
https://issues.apache.org/jira/browse/PARQUET-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17555099#comment-17555099
 ]


Timothy Miller commented on PARQUET-2159:
-----------------------------------------

I frequently wish Java had a preprocessor like C++ that would solve your 
problem. Currently, we have to build ParquetMR with Java 8, and plenty of 
things that depend on it (like Trino and Presto) use Java 11 at the latest. 
There are some solutions involving runtime loading of class files (e.g. 
[https://stackoverflow.com/questions/4526113/java-conditional-compilation-how-to-prevent-code-chunks-from-being-compiled),]
 but there's already enough weirdness in the ParquetMR build process (e.g. 
compile-time generated code that makes debugging a huge pain) that I hesitate 
to suggest making it even more challenging.

Where actually is the code coming from that you're working on? Is it part of 
ParquetMR or some other library? I seem to recall that when debugging Trino, 
the code you're working on has to go through the decompiler in IntelliJ. So if 
it's already external in some way, then it probably wouldn't hurt to make it 
just a bit more dynamic, where ParquetMR loads different versions of the 
bit-packing library depending on the Java version.

> Parquet bit-packing de/encode optimization
> ------------------------------------------
>
>                 Key: PARQUET-2159
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2159
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-mr
>    Affects Versions: 1.13.0
>            Reporter: Fang-Xie
>            Priority: Major
>             Fix For: 1.13.0
>
>         Attachments: image-2022-06-15-22-56-08-396.png, 
> image-2022-06-15-22-57-15-964.png, image-2022-06-15-22-58-01-442.png, 
> image-2022-06-15-22-58-40-704.png
>
>
> Current Spark use Parquet-mr as parquet reader/writer library, but the 
> built-in bit-packing en/decode is not efficient enough. 
> Our optimization for Parquet bit-packing en/decode with jdk.incubator.vector 
> in Open JDK18 brings prominent performance improvement.
> Due to Vector API is added to OpenJDK since 16, So this optimization request 
> JDK16 or higher.
> *Below are our test results*
> Functional test is based on open-source parquet-mr Bit-pack decoding 
> function: *_public final void unpack8Values(final byte[] in, final int inPos, 
> final int[] out, final int outPos)_* __
> compared with our implementation with vector API *_public final void 
> unpack8Values_vec(final byte[] in, final int inPos, final int[] out, final 
> int outPos)_*
> We tested 10 pairs (open source parquet bit unpacking vs ours optimized 
> vectorized SIMD implementation) decode function with bit 
> width=\{1,2,3,4,5,6,7,8,9,10}, below are test results:
> !image-2022-06-15-22-56-08-396.png|width=437,height=223!
> We integrated our bit-packing decode implementation into parquet-mr, tested 
> the parquet batch reader ability from Spark VectorizedParquetRecordReader 
> which get parquet column data by the batch way. We construct parquet file 
> with different row count and column count, the column data type is Int32, the 
> maximum int value is 127 which satisfies bit pack encode with bit width=7,   
> the count of the row is from 10k to 100 million and the count of the column 
> is from 1 to 4.
> !image-2022-06-15-22-57-15-964.png|width=453,height=229!
> !image-2022-06-15-22-58-01-442.png|width=439,height=217!
> !image-2022-06-15-22-58-40-704.png|width=415,height=208!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (PARQUET-2159) Parquet bit-packing de/encode optimization

Reply via email to