kazuyukitanimura opened a new pull request #34611:
URL: https://github.com/apache/spark/pull/34611


   ### What changes were proposed in this pull request?
   This PR proposes to enable vectorized read for ` 
VectorizedPlainValuesReader.readBooleans`. Currently `readBooleans` unpacks 
encoded boolean values bit by bit such as
   ```
     public final void readBooleans(int total, WritableColumnVector c, int 
rowId) {
       // TODO: properly vectorize this
       for (int i = 0; i < total; i++) {
         c.putBoolean(rowId + i, readBoolean());
       }
     }
   ```
   The main idea is to unpack 8 bits (a byte) at once instead of bit by bit 
whenever the alignment allows.
   
   
   ### Why are the changes needed?
   With this PR, we observed up-to 100% scan performance gain.
   #### JDK11 (Main Branch)
   ```
   Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
   SQL Single BOOLEAN Column Scan:           Best Time(ms)   Avg Time(ms)   
Stdev(ms)    Rate(M/s)   Per Row(ns)
   
-------------------------------------------------------------------------------------------------------------
   SQL Parquet Vectorized                              161            189       
   24         97.6          10.2
   
   Parquet Reader Single BOOLEAN Column Scan:  Best Time(ms)   Avg Time(ms)   
Stdev(ms)    Rate(M/s)   Per Row(ns)
   
--------------------------------------------------------------------------------------------------------------
   ParquetReader Vectorized                             140            147      
     9        112.7           8.9
   ParquetReader Vectorized -> Row                       85             88      
     3        184.4           5.4
   ```
   #### JDK11 (This PR)
   ```
   Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
   SQL Single BOOLEAN Column Scan:           Best Time(ms)   Avg Time(ms)   
Stdev(ms)    Rate(M/s)   Per Row(ns)
   
-------------------------------------------------------------------------------------------------------------
   SQL Parquet Vectorized                              124            149       
   21        127.2           7.9
   
   Parquet Reader Single BOOLEAN Column Scan:  Best Time(ms)   Avg Time(ms)   
Stdev(ms)    Rate(M/s)   Per Row(ns)
   
--------------------------------------------------------------------------------------------------------------
   ParquetReader Vectorized                             100            107      
    13        157.1           6.4
   ParquetReader Vectorized -> Row                       52             54      
     3        303.1           3.3
   ```
   
   #### JDK8 (Main Branch)
   ```
   Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
   SQL Single BOOLEAN Column Scan:           Best Time(ms)   Avg Time(ms)   
Stdev(ms)    Rate(M/s)   Per Row(ns)
   
-------------------------------------------------------------------------------------------------------------
   SQL Parquet Vectorized                              189            210       
   20         83.1          12.0
   
   Parquet Reader Single BOOLEAN Column Scan:  Best Time(ms)   Avg Time(ms)   
Stdev(ms)    Rate(M/s)   Per Row(ns)
   
--------------------------------------------------------------------------------------------------------------
   ParquetReader Vectorized                             178            179      
     3         88.5          11.3
   ParquetReader Vectorized -> Row                      125            126      
     2        126.1           7.9
   ```
   #### JDK8 (This PR)
   ```
   Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
   SQL Single BOOLEAN Column Scan:           Best Time(ms)   Avg Time(ms)   
Stdev(ms)    Rate(M/s)   Per Row(ns)
   
-------------------------------------------------------------------------------------------------------------
   SQL Parquet Vectorized                              144            167       
   12        109.2           9.2
   
   Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
   Parquet Reader Single BOOLEAN Column Scan:  Best Time(ms)   Avg Time(ms)   
Stdev(ms)    Rate(M/s)   Per Row(ns)
   
--------------------------------------------------------------------------------------------------------------
   ParquetReader Vectorized                             119            125      
     8        131.9           7.6
   ParquetReader Vectorized -> Row                       60             63      
     2        260.2           3.8
   ```
   
   The benchmarks are run on GitHub Actions. The results are attached in this 
PR.
   ```
   sql/core/benchmarks/DataSourceReadBenchmark-jdk11-results.txt
   sql/core/benchmarks/DataSourceReadBenchmark-results.txt
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   Updated unit tests
   ```
   build/sbt "testOnly *ParquetEncodingSuite"
   build/sbt "testOnly *ColumnarBatchSuite"
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to