In my own profiling of ParquetMR (as it is used by Trino), I have also found 
these bit-packing methods to be a performance bottleneck. Of the existing ones, 
the ones that take an array are faster than the one that take a ByteBuffer. It 
sure would be nice to have even faster ones!

From: "Xie, Fang" <>
Reply-To: "" <>
Date: Thursday, May 26, 2022 at 12:46 AM
To: "" <>
Subject: [EXTERNAL] Bit-packing decode optimization on Parquet-mr

CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.

Hi Dear parquet Team,
I am Intel SW engineer, We did optimization in Parquet bit-packing en/decode 
with jdk.incubator.vector in Open JDK18 which bring prominent performance 
Not sure we can commit our optimization into Parquet-mr community?
Due to Vector API is  added to OpenJDK since 16, So this optimization request 
JDK16 or higher.

Below are ours  test results
Functional test is based on open-source parquet-mr Bit-pack decoding function: 
public final void unpack8Values(final byte[] in, final int inPos, final int[] 
out, final int outPos)
compared with our implementation with vector API public final void 
unpack8Values_vec(final byte[] in, final int inPos, final int[] out, final int 
We tested 10 pairs (open source parquet bit unpacking vs ours optimized 
vectorized SIMD implementation) decode function with bit 
width={1,2,3,4,5,6,7,8,9,10}, below are test results:

We integrated our bit-packing decode implementation into parquet-mr, test 
parquet batch reader ability from Spark VectorizedParquetRecordReader which get 
parquet column data by batch way. We construct parquet file with different row 
count and column count, the column data type is Int32, the maximum int value is 
127 which satisfy bit pack encode with bit width=7,   the count of row is from 
10k to 100 million  and the count of column is from 1 to 4.

Reply via email to