parthchandra commented on pull request #34471:
URL: https://github.com/apache/spark/pull/34471#issuecomment-972526301


   > Looks much better now @parthchandra ! do you have any benchmark result to 
share, i.e., comparing the vectorized reader versus non-vectorized reader from 
`parquet-mr`? you can use `DataSourceReadBenchmark` for this.
   
   I've added a couple of cases to the benchmark for this. Because Delta 
encoding is more suited for sorted data, I also ran a modified version (not 
checked in) on sorted data. 
   
   TL;DR For unsorted data Delta vectorized is slower than default (RLE) 
vectorized reader. For sorted data, Delta vectorized is much faster than 
unsorted data, but is still slightly slower than default (RLE) vectorized 
reader. For both unsorted/sorted data the vectorized reader is ~7x/~10x faster 
than the Parquet-MR reader.
   
   Here is the summary for INT data
   
   UNSORTED
   ```
   OpenJDK 64-Bit Server VM 11.0.6+10-LTS on Mac OS X 10.15.7
   Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
   
   SQL Single INT Column Scan:               Best Time(ms)   Avg Time(ms)   
Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   
------------------------------------------------------------------------------------------------------------------------
   SQL CSV                                           12625          12720       
  135          1.2         802.7       1.0X
   SQL Json                                           6784           6844       
   85          2.3         431.3       1.9X
   SQL Parquet Vectorized                              106            112       
    7        148.5           6.7     119.2X
   SQL Parquet MR                                     1594           1601       
    9          9.9         101.4       7.9X
   SQL ORC Vectorized                                  218            227       
    5         72.0          13.9      57.8X
   SQL ORC MR                                         1289           1320       
   44         12.2          81.9       9.8X
   SQL Parquet Vectorized (Delta Binary)               232            243       
   17         67.9          14.7      54.5X
   SQL Parquet MR (Delta Binary)                      1462           1478       
   24         10.8          92.9       8.6X
   ```
   
   SORTED 
   ```
   OpenJDK 64-Bit Server VM 11.0.6+10-LTS on Mac OS X 10.15.7
   Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
   SQL Single INT Column Scan:               Best Time(ms)   Avg Time(ms)   
Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   
------------------------------------------------------------------------------------------------------------------------
   SQL CSV                                           12135          12150       
   22          1.3         771.5       1.0X
   SQL Json                                           6138           6177       
   54          2.6         390.3       2.0X
   SQL Parquet Vectorized                              102            107       
    7        153.5           6.5     118.4X
   SQL Parquet MR                                     1489           1496       
    9         10.6          94.7       8.1X
   SQL ORC Vectorized                                  166            171       
    3         94.9          10.5      73.2X
   SQL ORC MR                                         1184           1188       
    5         13.3          75.3      10.3X
   SQL Parquet Vectorized (Delta Binary)               130            138       
    8        120.7           8.3      93.1X
   SQL Parquet MR (Delta Binary)                      1287           1306       
   26         12.2          81.8       9.4X
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to