[ 
https://issues.apache.org/jira/browse/PARQUET-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17681958#comment-17681958
 ] 

ASF GitHub Bot commented on PARQUET-2159:
-----------------------------------------

jatin-bhateja commented on PR #1011:
URL: https://github.com/apache/parquet-mr/pull/1011#issuecomment-1408198456

   > > Sorry for the delay. I have left some comments and the implementation is 
overall looking good. Thanks @jiangjiguang for your effort!
   > > My main concern is the extensibility to support other instruction sets. 
In addition, it seems to me that the java vector api is still incubating. As I 
am not a java expert, do we have the risk of unstable API?
   > 
   > @wgtmac Jatin is a java expert, @jatin-bhateja Can you help give an 
answer? thanks.
   
   Hi @wgtmac , our patch vectorizes unpacking algorithm for various decode bit 
sizes, entire new functionality is exposed through a plugin interface 
**ParquetReadRouter**, in order to prevent any performance regressions over 
other targets we have enabled the new functionality only for X86 targets with 
valid features, this limitation can be removed over time.
   
   VectorAPI made its appearance in JDK16 and has been maturing since then with 
each successive release. I do not have a firm timeline for you at this point on 
its incubation exit and being exposed as a preview feature.  Intent here is to 
enable parquet-mr community developers to make use of the plugin in parquet 
reader and provide us with early feedback, we are also in process of 
vectorizing packer algorithm.
   
   Being a large project we plan to do this incrementally, we seek your 
guidance here in pushing this patch through. 




> Parquet bit-packing de/encode optimization
> ------------------------------------------
>
>                 Key: PARQUET-2159
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2159
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-mr
>    Affects Versions: 1.13.0
>            Reporter: Fang-Xie
>            Assignee: Fang-Xie
>            Priority: Major
>             Fix For: 1.13.0
>
>         Attachments: image-2022-06-15-22-56-08-396.png, 
> image-2022-06-15-22-57-15-964.png, image-2022-06-15-22-58-01-442.png, 
> image-2022-06-15-22-58-40-704.png
>
>
> Current Spark use Parquet-mr as parquet reader/writer library, but the 
> built-in bit-packing en/decode is not efficient enough. 
> Our optimization for Parquet bit-packing en/decode with jdk.incubator.vector 
> in Open JDK18 brings prominent performance improvement.
> Due to Vector API is added to OpenJDK since 16, So this optimization request 
> JDK16 or higher.
> *Below are our test results*
> Functional test is based on open-source parquet-mr Bit-pack decoding 
> function: *_public final void unpack8Values(final byte[] in, final int inPos, 
> final int[] out, final int outPos)_* __
> compared with our implementation with vector API *_public final void 
> unpack8Values_vec(final byte[] in, final int inPos, final int[] out, final 
> int outPos)_*
> We tested 10 pairs (open source parquet bit unpacking vs ours optimized 
> vectorized SIMD implementation) decode function with bit 
> width=\{1,2,3,4,5,6,7,8,9,10}, below are test results:
> !image-2022-06-15-22-56-08-396.png|width=437,height=223!
> We integrated our bit-packing decode implementation into parquet-mr, tested 
> the parquet batch reader ability from Spark VectorizedParquetRecordReader 
> which get parquet column data by the batch way. We construct parquet file 
> with different row count and column count, the column data type is Int32, the 
> maximum int value is 127 which satisfies bit pack encode with bit width=7,   
> the count of the row is from 10k to 100 million and the count of the column 
> is from 1 to 4.
> !image-2022-06-15-22-57-15-964.png|width=453,height=229!
> !image-2022-06-15-22-58-01-442.png|width=439,height=217!
> !image-2022-06-15-22-58-40-704.png|width=415,height=208!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to