[GitHub] [spark] jiangjiguang opened a new pull request, #40719: [WIP]Speed up parquet reading with Java Vector API

via GitHub Sun, 09 Apr 2023 20:26:19 -0700


jiangjiguang opened a new pull request, #40719:
URL: https://github.com/apache/spark/pull/40719


   ### What changes were proposed in this pull request?
   Parquet has supported vector read speed up with this PR 
https://github.com/apache/parquet-mr/pull/1011
   The performance gain is 4x ~ 8x according to the parquet microbenchmark
   TPC-H(SF100) Q6 has 11% performance increase with Apache Spark integrating 
parquet vector optimization
   
   ### Why are the changes needed?
   This PR used to support parquet vector optimization
   
   ### Does this PR introduce _any_ user-facing change?
   Add configuration spark.sql.parquet.vector512.read.enabled, If true and CPU 
contains avx512vbmi & avx512_vbmi2 instruction set, parquet decodes using Java 
Vector API. For Intel CPU, Ice Lake or newer contains the required instruction 
set.
   
   ### How was this patch tested?
   For the test case, there are some problems to fix:
   1. It is necessary to Parquet-mr community release new java version to use 
the parquet vector optimization.
   2. Parquet Vector optimization does not release default, so users have to 
build parquet with mvn clean install -P vector-plugins manually to get the 
parquet-encoding-vector-{VERSION}.jar and put it on the {SPARK_HOME}/jars path
   3. github doesn't support select runners with specific instruction set. So 
it is impossible (a self-hosted runner can do it) to verify the optimization on 
github runners machine.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] jiangjiguang opened a new pull request, #40719: [WIP]Speed up parquet reading with Java Vector API

Reply via email to