LuciferYang commented on pull request #30663:
URL: https://github.com/apache/spark/pull/30663#issuecomment-748988593
I'm sorry, I've been busy with internal meetings of the company in recent
days :(
One of the cases being tried is as follows:
1. create a temp table use parquet or orc table with 150 files and each file
has 1000 columns
2. SQL with count on one column like `select count(columnX) FROM
parquetTable`
However, there was no significant performance difference when tested locally.
**Without this pr**
Running benchmark: Single Column Scan from 1000 columns, 150 files
Running case: SQL Parquet Vectorized
Stopped after 2 iterations, 3615 ms
Running case: SQL Parquet MR
Stopped after 2 iterations, 4285 ms
Running case: SQL ORC Vectorized
Stopped after 2 iterations, 3544 ms
Running case: SQL ORC MR
Stopped after 2 iterations, 3382 ms
Java HotSpot(TM) 64-Bit Server VM 1.8.0_192-b12 on Mac OS X 10.15.7
Intel(R) Core(TM) i5-7360U CPU @ 2.30GHz
Single Column Scan from 1000 columns, 150 files: Best Time(ms) Avg
Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------------------------------------------
SQL Parquet Vectorized 1773
1808 49 0.6 1691.2 1.0X
SQL Parquet MR 2048
2143 133 0.5 1953.5 0.9X
SQL ORC Vectorized 1708
1772 90 0.6 1629.1 1.0X
SQL ORC MR 1665
1691 38 0.6 1587.7 1.1X
**With this pr**
Running benchmark: Single Column Scan from 1000 columns, 150 files
Running case: SQL Parquet Vectorized
Stopped after 2 iterations, 3595 ms
Running case: SQL Parquet MR
Stopped after 2 iterations, 3462 ms
Running case: SQL ORC Vectorized
Stopped after 2 iterations, 2568 ms
Running case: SQL ORC MR
Stopped after 2 iterations, 3314 ms
Java HotSpot(TM) 64-Bit Server VM 1.8.0_192-b12 on Mac OS X 10.15.7
Intel(R) Core(TM) i5-7360U CPU @ 2.30GHz
Single Column Scan from 1000 columns, 150 files: Best Time(ms) Avg
Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------------------------------------------
SQL Parquet Vectorized 1788
1798 14 0.6 1705.4 1.0X
SQL Parquet MR 1672
1731 84 0.6 1594.4 1.1X
SQL ORC Vectorized 1026
1284 366 1.0 978.2 1.7X
SQL ORC MR 1521
1657 192 0.7 1450.9 1.2X
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]