[GitHub] [spark] LuciferYang edited a comment on pull request #30663: [SPARK-33700][SQL] Avoid file meta reading when enableFilterPushDown is true and filters is empty for Parquet and Orc

GitBox Mon, 21 Dec 2020 06:02:24 -0800


LuciferYang edited a comment on pull request #30663:
URL: https://github.com/apache/spark/pull/30663#issuecomment-748988593



   I'm sorry, I've been busy with internal meetings of the company in recent 
days :(
   
   One of the cases being tried is as follows:
   
   1. create a temp table use parquet or orc table with 150 files and each file 
has 1000 columns
   2. SQL with count on one column like `select count(columnX) FROM 
parquetTable`
   
   However, there was no significant performance difference when tested locally.
   
   Without this pr:
   
   Running benchmark: Single Column Scan from 1000 columns, 150 files
     Running case: SQL Parquet Vectorized
     Stopped after 2 iterations, 3615 ms
     Running case: SQL Parquet MR
     Stopped after 2 iterations, 4285 ms
     Running case: SQL ORC Vectorized
     Stopped after 2 iterations, 3544 ms
     Running case: SQL ORC MR
     Stopped after 2 iterations, 3382 ms
   
   Java HotSpot(TM) 64-Bit Server VM 1.8.0_192-b12 on Mac OS X 10.15.7
   Intel(R) Core(TM) i5-7360U CPU @ 2.30GHz
   Single Column Scan from 1000 columns, 150 files:  Best Time(ms)   Avg 
Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   
-------------------------------------------------------------------------------------------------------------------------------
   SQL Parquet Vectorized                                    1773           
1808          49          0.6        1691.2       1.0X
   SQL Parquet MR                                            2048           
2143         133          0.5        1953.5       0.9X
   SQL ORC Vectorized                                        1708           
1772          90          0.6        1629.1       1.0X
   SQL ORC MR                                                1665           
1691          38          0.6        1587.7       1.1X
   
   With this pr:
   
   Running benchmark: Single Column Scan from 1000 columns, 150 files
     Running case: SQL Parquet Vectorized
     Stopped after 2 iterations, 3595 ms
     Running case: SQL Parquet MR
     Stopped after 2 iterations, 3462 ms
     Running case: SQL ORC Vectorized
     Stopped after 2 iterations, 2568 ms
     Running case: SQL ORC MR
     Stopped after 2 iterations, 3314 ms
   
   Java HotSpot(TM) 64-Bit Server VM 1.8.0_192-b12 on Mac OS X 10.15.7
   Intel(R) Core(TM) i5-7360U CPU @ 2.30GHz
   Single Column Scan from 1000 columns, 150 files:  Best Time(ms)   Avg 
Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   
-------------------------------------------------------------------------------------------------------------------------------
   SQL Parquet Vectorized                                    1788           
1798          14          0.6        1705.4       1.0X
   SQL Parquet MR                                            1672           
1731          84          0.6        1594.4       1.1X
   SQL ORC Vectorized                                        1026           
1284         366          1.0         978.2       1.7X
   SQL ORC MR                                                1521           
1657         192          0.7        1450.9       1.2X
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] LuciferYang edited a comment on pull request #30663: [SPARK-33700][SQL] Avoid file meta reading when enableFilterPushDown is true and filters is empty for Parquet and Orc

Reply via email to