jiangjiguang opened a new pull request, #40719: URL: https://github.com/apache/spark/pull/40719
### What changes were proposed in this pull request? Parquet has supported vector read speed up with this PR https://github.com/apache/parquet-mr/pull/1011 The performance gain is 4x ~ 8x according to the parquet microbenchmark TPC-H(SF100) Q6 has 11% performance increase with Apache Spark integrating parquet vector optimization ### Why are the changes needed? This PR used to support parquet vector optimization ### Does this PR introduce _any_ user-facing change? Add configuration spark.sql.parquet.vector512.read.enabled, If true and CPU contains avx512vbmi & avx512_vbmi2 instruction set, parquet decodes using Java Vector API. For Intel CPU, Ice Lake or newer contains the required instruction set. ### How was this patch tested? For the test case, there are some problems to fix: 1. It is necessary to Parquet-mr community release new java version to use the parquet vector optimization. 2. Parquet Vector optimization does not release default, so users have to build parquet with mvn clean install -P vector-plugins manually to get the parquet-encoding-vector-{VERSION}.jar and put it on the {SPARK_HOME}/jars path 3. github doesn't support select runners with specific instruction set. So it is impossible (a self-hosted runner can do it) to verify the optimization on github runners machine. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
