parthchandra commented on pull request #34471:
URL: https://github.com/apache/spark/pull/34471#issuecomment-972526301
> Looks much better now @parthchandra ! do you have any benchmark result to
share, i.e., comparing the vectorized reader versus non-vectorized reader from
`parquet-mr`? you can use `DataSourceReadBenchmark` for this.
I've added a couple of cases to the benchmark for this. Because Delta
encoding is more suited for sorted data, I also ran a modified version (not
checked in) on sorted data.
TL;DR For unsorted data Delta vectorized is slower than default (RLE)
vectorized reader. For sorted data, Delta vectorized is much faster than
unsorted data, but is still slightly slower than default (RLE) vectorized
reader. For both unsorted/sorted data the vectorized reader is ~7x/~10x faster
than the Parquet-MR reader.
Here is the summary for INT data
UNSORTED
```
OpenJDK 64-Bit Server VM 11.0.6+10-LTS on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
SQL Single INT Column Scan: Best Time(ms) Avg Time(ms)
Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
SQL CSV 12625 12720
135 1.2 802.7 1.0X
SQL Json 6784 6844
85 2.3 431.3 1.9X
SQL Parquet Vectorized 106 112
7 148.5 6.7 119.2X
SQL Parquet MR 1594 1601
9 9.9 101.4 7.9X
SQL ORC Vectorized 218 227
5 72.0 13.9 57.8X
SQL ORC MR 1289 1320
44 12.2 81.9 9.8X
SQL Parquet Vectorized (Delta Binary) 232 243
17 67.9 14.7 54.5X
SQL Parquet MR (Delta Binary) 1462 1478
24 10.8 92.9 8.6X
```
SORTED
```
OpenJDK 64-Bit Server VM 11.0.6+10-LTS on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
SQL Single INT Column Scan: Best Time(ms) Avg Time(ms)
Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
SQL CSV 12135 12150
22 1.3 771.5 1.0X
SQL Json 6138 6177
54 2.6 390.3 2.0X
SQL Parquet Vectorized 102 107
7 153.5 6.5 118.4X
SQL Parquet MR 1489 1496
9 10.6 94.7 8.1X
SQL ORC Vectorized 166 171
3 94.9 10.5 73.2X
SQL ORC MR 1184 1188
5 13.3 75.3 10.3X
SQL Parquet Vectorized (Delta Binary) 130 138
8 120.7 8.3 93.1X
SQL Parquet MR (Delta Binary) 1287 1306
26 12.2 81.8 9.4X
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]