Dear Xie, Thanks for reaching out. All contributions to the Parquet projects are welcome! Feel free to open a PR on Github.
Unfortunately, I'm unable to see the images in the mail. Do you have a blogpost or something? Thanks! Kind regards, Fokko Driesprong Op do 26 mei 2022 om 06:45 schreef Xie, Fang <fang....@intel.com>: > Hi Dear parquet Team, > > I am Intel SW engineer, We did optimization in Parquet bit-packing > en/decode with jdk.incubator.vector in Open JDK18 which bring prominent > performance improvement. > > Not sure we can commit our optimization into Parquet-mr community? > > Due to Vector API is added to OpenJDK since 16, So this optimization > request JDK16 or higher. > > > > *Below are ours test results* > > Functional test is based on open-source parquet-mr Bit-pack decoding > function: *public final void unpack8Values(final byte[] in, final int > inPos, final int[] out, final int outPos)* > > compared with our implementation with vector API *public final void > unpack8Values_vec(final byte[] in, final int inPos, final int[] out, final > int outPos)* > > We tested 10 pairs (open source parquet bit unpacking vs ours optimized > vectorized SIMD implementation) decode function with bit > width={1,2,3,4,5,6,7,8,9,10}, below are test results: > > > > We integrated our bit-packing decode implementation into parquet-mr, test > parquet batch reader ability from Spark VectorizedParquetRecordReader which > get parquet column data by batch way. We construct parquet file with > different row count and column count, the column data type is Int32, the > maximum int value is 127 which satisfy bit pack encode with bit width=7, > the count of row is from 10k to 100 million and the count of column is > from 1 to 4. > >