c21 commented on pull request #31958: URL: https://github.com/apache/spark/pull/31958#issuecomment-810970640
> For orc files without nested schema, do we observe perf regression after this PR? @cloud-fan - in theory here the difference for non-nested schema: Before this PR: ``` OrcColumnVector -> org.apache.spark.sql.vectorized.ColumnVector ``` After this PR: ``` OrcAtomicColumnVector -> OrcColumnVector -> org.apache.spark.sql.vectorized.ColumnVector ``` The only overhead introduced here is one more layer in classes, and might be some overhead for virtual function call and class loading. Tested on same type of machine in AWS EC2 for all ORC reader benchmarks: [OrcReadBenchmark-results.txt](https://github.com/apache/spark/blob/master/sql/hive/benchmarks/OrcReadBenchmark-results.txt) and [DataSourceReadBenchmark-results.txt](https://github.com/apache/spark/blob/master/sql/core/benchmarks/DataSourceReadBenchmark-results.txt) (java 8 here). Do not see regression compared ORC vectorized reader and other ORC readers. Results: [OrcReadBenchmark-results.txt](https://gist.github.com/c21/978bd38abb781af83ec19cec80000759) [DataSourceReadBenchmark-results.txt](https://gist.github.com/c21/62391f8cfdaf4f15acac0ee168e3b86b) Machine: ``` Amazon AWS EC2 type: r3.xlarge region: us-west-2 (Oregon) OS: Linux ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
