sadikovi commented on PR #37485:
URL: https://github.com/apache/spark/pull/37485#issuecomment-1212765192
I reran the benchmarks again, on a larger 4x dataset (I changed the size in
DataSourceReadBenchmark). The numbers are still very similar with the patch
performing slightly better than the current code. I don't quite understand how
that is possible unless the benchmark does not exercise the encoding.
### Before
```
OpenJDK 64-Bit Server VM 1.8.0_312-8u312-b07-0ubuntu1~18.04-b07 on Linux
5.4.0-1071-aws
Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
Parquet Reader Single INT Column Scan: Best Time(ms) Avg Time(ms)
Stdev(ms) Rate(M/s) Per Row(ns) Relative
---------------------------------------------------------------------------------------------------------------------------
ParquetReader Vectorized: DataPageV1 672 707
45 93.6 10.7 1.0X
ParquetReader Vectorized: DataPageV2 945 1012
95 66.6 15.0 0.7X
ParquetReader Vectorized -> Row: DataPageV1 383 432
28 164.4 6.1 1.8X
ParquetReader Vectorized -> Row: DataPageV2 670 678
8 93.9 10.6 1.0X
OpenJDK 64-Bit Server VM 1.8.0_312-8u312-b07-0ubuntu1~18.04-b07 on Linux
5.4.0-1071-aws
Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
Parquet Reader Single BIGINT Column Scan: Best Time(ms) Avg Time(ms)
Stdev(ms) Rate(M/s) Per Row(ns) Relative
---------------------------------------------------------------------------------------------------------------------------
ParquetReader Vectorized: DataPageV1 931 935
4 67.6 14.8 1.0X
ParquetReader Vectorized: DataPageV2 1475 1477
4 42.7 23.4 0.6X
ParquetReader Vectorized -> Row: DataPageV1 638 650
14 98.5 10.1 1.5X
ParquetReader Vectorized -> Row: DataPageV2 1172 1173
2 53.7 18.6 0.8X
```
### After
```
[info] OpenJDK 64-Bit Server VM 1.8.0_312-8u312-b07-0ubuntu1~18.04-b07 on
Linux 5.4.0-1071-aws
[info] Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
[info] Parquet Reader Single INT Column Scan: Best Time(ms) Avg
Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info]
---------------------------------------------------------------------------------------------------------------------------
[info] ParquetReader Vectorized: DataPageV1 656
704 60 95.9 10.4 1.0X
[info] ParquetReader Vectorized: DataPageV2 888
898 12 70.9 14.1 0.7X
[info] ParquetReader Vectorized -> Row: DataPageV1 393
435 24 160.2 6.2 1.7X
[info] ParquetReader Vectorized -> Row: DataPageV2 667
681 12 94.3 10.6 1.0X
[info] OpenJDK 64-Bit Server VM 1.8.0_312-8u312-b07-0ubuntu1~18.04-b07 on
Linux 5.4.0-1071-aws
[info] Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
[info] Parquet Reader Single BIGINT Column Scan: Best Time(ms) Avg
Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info]
---------------------------------------------------------------------------------------------------------------------------
[info] ParquetReader Vectorized: DataPageV1 935
953 16 67.3 14.9 1.0X
[info] ParquetReader Vectorized: DataPageV2 1437
1440 4 43.8 22.8 0.7X
[info] ParquetReader Vectorized -> Row: DataPageV1 717
731 12 87.7 11.4 1.3X
[info] ParquetReader Vectorized -> Row: DataPageV2 1176
1185 13 53.5 18.7 0.8X
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]