sunchao commented on pull request #32753: URL: https://github.com/apache/spark/pull/32753#issuecomment-867163244
@viirya I ran the benchmark added by @lxian in #31998 and here're the numbers: ``` [info] Intel(R) Core(TM) i9-10910 CPU @ 3.60GHz [info] simple filters: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] Parquet Vectorized 726 740 15 21.7 46.1 1.0X [info] Parquet Vectorized (columnIndex) 30 34 4 533.1 1.9 24.6X [info] Running benchmark: range filters [info] Running case: Parquet Vectorized [info] Stopped after 5 iterations, 4215 ms [info] Running case: Parquet Vectorized (columnIndex) [info] Stopped after 11 iterations, 2055 ms [info] OpenJDK 64-Bit Server VM 1.8.0_282-b08 on Mac OS X 10.16 [info] Intel(R) Core(TM) i9-10910 CPU @ 3.60GHz [info] range filters: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] Parquet Vectorized 800 843 38 19.7 50.8 1.0X [info] Parquet Vectorized (columnIndex) 165 187 34 95.4 10.5 4.9X [info] Running benchmark: multi range filters [info] Running case: Parquet Vectorized [info] Stopped after 5 iterations, 4466 ms [info] Running case: Parquet Vectorized (columnIndex) [info] Stopped after 7 iterations, 2311 ms [info] OpenJDK 64-Bit Server VM 1.8.0_282-b08 on Mac OS X 10.16 [info] Intel(R) Core(TM) i9-10910 CPU @ 3.60GHz [info] multi range filters: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] Parquet Vectorized 822 893 85 19.1 52.3 1.0X [info] Parquet Vectorized (columnIndex) 305 330 16 51.5 19.4 2.7X ``` I also ran `DataSourceReadBenchmark` with and without the PR, and don't see much difference w.r.t Parquet vectorized read performance (although for some reason CSV read performance is quite different). The result is [here](https://gist.github.com/sunchao/e00947713dae790c6761ea860637d811). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
