[GitHub] [spark] c21 commented on pull request #31958: [SPARK-34862][SQL] Support nested column in ORC vectorized reader

GitBox Wed, 31 Mar 2021 03:51:17 -0700


c21 commented on pull request #31958:
URL: https://github.com/apache/spark/pull/31958#issuecomment-810970640



   > For orc files without nested schema, do we observe perf regression after 
this PR?
   
   @cloud-fan - in theory here the difference for non-nested schema:
   
   Before this PR:
   
   ```
   OrcColumnVector -> org.apache.spark.sql.vectorized.ColumnVector
   ```
   
   After this PR:
   
   ```
   OrcAtomicColumnVector -> OrcColumnVector -> 
org.apache.spark.sql.vectorized.ColumnVector
   ```
   
   The only overhead introduced here is one more layer in classes, and might be 
some overhead for virtual function call and class loading.
   
   Tested on same type of machine in AWS EC2 for all ORC reader benchmarks: 
[OrcReadBenchmark-results.txt](https://github.com/apache/spark/blob/master/sql/hive/benchmarks/OrcReadBenchmark-results.txt)
 and 
[DataSourceReadBenchmark-results.txt](https://github.com/apache/spark/blob/master/sql/core/benchmarks/DataSourceReadBenchmark-results.txt)
 (java 8 here). Do not see regression compared ORC vectorized reader and other 
ORC readers.
   
   Results:
   
   
[OrcReadBenchmark-results.txt](https://gist.github.com/c21/978bd38abb781af83ec19cec80000759)
   
[DataSourceReadBenchmark-results.txt](https://gist.github.com/c21/62391f8cfdaf4f15acac0ee168e3b86b)
   
   Machine:
   
   ```
   Amazon AWS EC2
   type: r3.xlarge
   region: us-west-2 (Oregon)
   OS: Linux
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] c21 commented on pull request #31958: [SPARK-34862][SQL] Support nested column in ORC vectorized reader

Reply via email to