Re: Bit-packing decode optimization on Parquet-mr

Driesprong, Fokko Thu, 26 May 2022 07:09:41 -0700

Dear Xie,

Thanks for reaching out. All contributions to the Parquet projects are
welcome! Feel free to open a PR on Github.


Unfortunately, I'm unable to see the images in the mail. Do you have a
blogpost or something? Thanks!

Kind regards,
Fokko Driesprong


Op do 26 mei 2022 om 06:45 schreef Xie, Fang <fang....@intel.com>:

> Hi Dear parquet Team,
>
> I am Intel SW engineer, We did optimization in Parquet bit-packing
> en/decode with jdk.incubator.vector in Open JDK18 which bring prominent
> performance improvement.
>
> Not sure we can commit our optimization into Parquet-mr community?
>
> Due to Vector API is  added to OpenJDK since 16, So this optimization
> request JDK16 or higher.
>
>
>
> *Below are ours  test results*
>
> Functional test is based on open-source parquet-mr Bit-pack decoding
> function: *public final void unpack8Values(final byte[] in, final int
> inPos, final int[] out, final int outPos)*
>
> compared with our implementation with vector API *public final void
> unpack8Values_vec(final byte[] in, final int inPos, final int[] out, final
> int outPos)*
>
> We tested 10 pairs (open source parquet bit unpacking vs ours optimized
> vectorized SIMD implementation) decode function with bit
> width={1,2,3,4,5,6,7,8,9,10}, below are test results:
>
>
>
> We integrated our bit-packing decode implementation into parquet-mr, test
> parquet batch reader ability from Spark VectorizedParquetRecordReader which
> get parquet column data by batch way. We construct parquet file with
> different row count and column count, the column data type is Int32, the
> maximum int value is 127 which satisfy bit pack encode with bit width=7,
> the count of row is from 10k to 100 million  and the count of column is
> from 1 to 4.
>
>

Re: Bit-packing decode optimization on Parquet-mr

Reply via email to