shardulm94 opened a new pull request #826: Fix Spark ORC Reader creating new RowWriter for every row URL: https://github.com/apache/incubator-iceberg/pull/826 Spark ORC reader was creating a new RowWriter for every row leading to sluggish performance. RowWriter is meant to be reused and the reader code was already performing the necessary reset operations after every row. The only change required was to reuse the RowWriter object. Tested with local JMH test I have written for ORC. These are essentially copies of the corresponding JMH tests for Parquet. I will send a separate PR for the JMH tests since they require enabling of ORC writers. Without Fix: ``` Benchmark Mode Cnt Score Error Units IcebergSourceFlatORCDataReadBenchmark.readFileSourceNonVectorized ss 5 10.805 ± 6.819 s/op IcebergSourceFlatORCDataReadBenchmark.readFileSourceVectorized ss 5 4.185 ± 0.713 s/op IcebergSourceFlatORCDataReadBenchmark.readIceberg ss 5 128.416 ± 5.686 s/op IcebergSourceFlatORCDataReadBenchmark.readWithProjectionFileSourceNonVectorized ss 5 2.256 ± 0.222 s/op IcebergSourceFlatORCDataReadBenchmark.readWithProjectionFileSourceVectorized ss 5 0.594 ± 0.132 s/op IcebergSourceFlatORCDataReadBenchmark.readWithProjectionIceberg ss 5 116.871 ± 7.665 s/op IcebergSourceNestedORCDataReadBenchmark.readFileSourceNonVectorized ss 5 13.563 ± 1.074 s/op IcebergSourceNestedORCDataReadBenchmark.readFileSourceVectorized ss 5 13.631 ± 1.244 s/op IcebergSourceNestedORCDataReadBenchmark.readIceberg ss 5 118.870 ± 1.925 s/op IcebergSourceNestedORCDataReadBenchmark.readWithProjectionFileSourceNonVectorized ss 5 16.093 ± 1.089 s/op IcebergSourceNestedORCDataReadBenchmark.readWithProjectionFileSourceVectorized ss 5 16.923 ± 4.898 s/op IcebergSourceNestedORCDataReadBenchmark.readWithProjectionIceberg ss 5 118.069 ± 2.229 s/op ``` With Fix: ``` Benchmark Mode Cnt Score Error Units IcebergSourceFlatORCDataReadBenchmark.readFileSourceNonVectorized ss 5 9.220 ± 0.828 s/op IcebergSourceFlatORCDataReadBenchmark.readFileSourceVectorized ss 5 4.015 ± 0.502 s/op IcebergSourceFlatORCDataReadBenchmark.readIceberg ss 5 8.576 ± 1.504 s/op IcebergSourceFlatORCDataReadBenchmark.readWithProjectionFileSourceNonVectorized ss 5 2.292 ± 0.391 s/op IcebergSourceFlatORCDataReadBenchmark.readWithProjectionFileSourceVectorized ss 5 0.664 ± 0.138 s/op IcebergSourceFlatORCDataReadBenchmark.readWithProjectionIceberg ss 5 1.245 ± 0.848 s/op IcebergSourceNestedORCDataReadBenchmark.readFileSourceNonVectorized ss 5 15.806 ± 1.035 s/op IcebergSourceNestedORCDataReadBenchmark.readFileSourceVectorized ss 5 17.490 ± 0.926 s/op IcebergSourceNestedORCDataReadBenchmark.readIceberg ss 5 3.238 ± 0.821 s/op IcebergSourceNestedORCDataReadBenchmark.readWithProjectionFileSourceNonVectorized ss 5 22.043 ± 9.899 s/op IcebergSourceNestedORCDataReadBenchmark.readWithProjectionFileSourceVectorized ss 5 17.451 ± 9.970 s/op IcebergSourceNestedORCDataReadBenchmark.readWithProjectionIceberg ss 5 2.885 ± 0.764 s/op ```
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org