kazuyukitanimura opened a new pull request #34611:
URL: https://github.com/apache/spark/pull/34611
### What changes were proposed in this pull request?
This PR proposes to enable vectorized read for `
VectorizedPlainValuesReader.readBooleans`. Currently `readBooleans` unpacks
encoded boolean values bit by bit such as
```
public final void readBooleans(int total, WritableColumnVector c, int
rowId) {
// TODO: properly vectorize this
for (int i = 0; i < total; i++) {
c.putBoolean(rowId + i, readBoolean());
}
}
```
The main idea is to unpack 8 bits (a byte) at once instead of bit by bit
whenever the alignment allows.
### Why are the changes needed?
With this PR, we observed up-to 100% scan performance gain.
#### JDK11 (Main Branch)
```
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
SQL Single BOOLEAN Column Scan: Best Time(ms) Avg Time(ms)
Stdev(ms) Rate(M/s) Per Row(ns)
-------------------------------------------------------------------------------------------------------------
SQL Parquet Vectorized 161 189
24 97.6 10.2
Parquet Reader Single BOOLEAN Column Scan: Best Time(ms) Avg Time(ms)
Stdev(ms) Rate(M/s) Per Row(ns)
--------------------------------------------------------------------------------------------------------------
ParquetReader Vectorized 140 147
9 112.7 8.9
ParquetReader Vectorized -> Row 85 88
3 184.4 5.4
```
#### JDK11 (This PR)
```
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
SQL Single BOOLEAN Column Scan: Best Time(ms) Avg Time(ms)
Stdev(ms) Rate(M/s) Per Row(ns)
-------------------------------------------------------------------------------------------------------------
SQL Parquet Vectorized 124 149
21 127.2 7.9
Parquet Reader Single BOOLEAN Column Scan: Best Time(ms) Avg Time(ms)
Stdev(ms) Rate(M/s) Per Row(ns)
--------------------------------------------------------------------------------------------------------------
ParquetReader Vectorized 100 107
13 157.1 6.4
ParquetReader Vectorized -> Row 52 54
3 303.1 3.3
```
#### JDK8 (Main Branch)
```
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
SQL Single BOOLEAN Column Scan: Best Time(ms) Avg Time(ms)
Stdev(ms) Rate(M/s) Per Row(ns)
-------------------------------------------------------------------------------------------------------------
SQL Parquet Vectorized 189 210
20 83.1 12.0
Parquet Reader Single BOOLEAN Column Scan: Best Time(ms) Avg Time(ms)
Stdev(ms) Rate(M/s) Per Row(ns)
--------------------------------------------------------------------------------------------------------------
ParquetReader Vectorized 178 179
3 88.5 11.3
ParquetReader Vectorized -> Row 125 126
2 126.1 7.9
```
#### JDK8 (This PR)
```
Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
SQL Single BOOLEAN Column Scan: Best Time(ms) Avg Time(ms)
Stdev(ms) Rate(M/s) Per Row(ns)
-------------------------------------------------------------------------------------------------------------
SQL Parquet Vectorized 144 167
12 109.2 9.2
Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
Parquet Reader Single BOOLEAN Column Scan: Best Time(ms) Avg Time(ms)
Stdev(ms) Rate(M/s) Per Row(ns)
--------------------------------------------------------------------------------------------------------------
ParquetReader Vectorized 119 125
8 131.9 7.6
ParquetReader Vectorized -> Row 60 63
2 260.2 3.8
```
The benchmarks are run on GitHub Actions. The results are attached in this
PR.
```
sql/core/benchmarks/DataSourceReadBenchmark-jdk11-results.txt
sql/core/benchmarks/DataSourceReadBenchmark-results.txt
```
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Updated unit tests
```
build/sbt "testOnly *ParquetEncodingSuite"
build/sbt "testOnly *ColumnarBatchSuite"
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]