sunchao commented on pull request #32753:
URL: https://github.com/apache/spark/pull/32753#issuecomment-867163244


   @viirya I ran the benchmark added by @lxian in #31998 and here're the 
numbers:
   ```
   [info] Intel(R) Core(TM) i9-10910 CPU @ 3.60GHz
   [info] simple filters:                           Best Time(ms)   Avg 
Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   [info] 
------------------------------------------------------------------------------------------------------------------------
   [info] Parquet Vectorized                                  726            
740          15         21.7          46.1       1.0X
   [info] Parquet Vectorized (columnIndex)                     30             
34           4        533.1           1.9      24.6X
   [info] Running benchmark: range filters
   [info]   Running case: Parquet Vectorized
   [info]   Stopped after 5 iterations, 4215 ms
   [info]   Running case: Parquet Vectorized (columnIndex)
   [info]   Stopped after 11 iterations, 2055 ms
   [info] OpenJDK 64-Bit Server VM 1.8.0_282-b08 on Mac OS X 10.16
   [info] Intel(R) Core(TM) i9-10910 CPU @ 3.60GHz
   [info] range filters:                            Best Time(ms)   Avg 
Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   [info] 
------------------------------------------------------------------------------------------------------------------------
   [info] Parquet Vectorized                                  800            
843          38         19.7          50.8       1.0X
   [info] Parquet Vectorized (columnIndex)                    165            
187          34         95.4          10.5       4.9X
   [info] Running benchmark: multi range filters
   [info]   Running case: Parquet Vectorized
   [info]   Stopped after 5 iterations, 4466 ms
   [info]   Running case: Parquet Vectorized (columnIndex)
   [info]   Stopped after 7 iterations, 2311 ms
   [info] OpenJDK 64-Bit Server VM 1.8.0_282-b08 on Mac OS X 10.16
   [info] Intel(R) Core(TM) i9-10910 CPU @ 3.60GHz
   [info] multi range filters:                      Best Time(ms)   Avg 
Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   [info] 
------------------------------------------------------------------------------------------------------------------------
   [info] Parquet Vectorized                                  822            
893          85         19.1          52.3       1.0X
   [info] Parquet Vectorized (columnIndex)                    305            
330          16         51.5          19.4       2.7X
   ```
   
   I also ran `DataSourceReadBenchmark` with and without the PR, and don't see 
much difference w.r.t Parquet vectorized read performance (although for some 
reason CSV read performance is quite different). The result is 
[here](https://gist.github.com/sunchao/e00947713dae790c6761ea860637d811). 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to