aokolnychyi opened a new pull request, #6550: URL: https://github.com/apache/iceberg/pull/6550
This PR adds logic to automatically set Arrow properties for read performance. Unless these properties are set, our read path can be up to 2x slower than built-in read path in Spark. I verified this patch by removing all explicit settings and running benchmarks with and without setting properties. Benchmark results without setting Arrow properties: ``` Benchmark Mode Cnt Score Error Units VectorizedReadFlatParquetDataBenchmark.readLongsIcebergVectorized5k ss 5 1.321 ± 0.029 s/op VectorizedReadFlatParquetDataBenchmark.readLongsSparkVectorized5k ss 5 1.064 ± 0.162 s/op VectorizedReadFlatParquetDataBenchmark.readStringsIcebergVectorized5k ss 5 2.187 ± 0.031 s/op VectorizedReadFlatParquetDataBenchmark.readStringsSparkVectorized5k ss 5 1.304 ± 0.287 s/op ``` Benchmark results with setting Arrow properties automatically: ``` Benchmark Mode Cnt Score Error Units VectorizedReadFlatParquetDataBenchmark.readLongsIcebergVectorized5k ss 5 0.927 ± 0.028 s/op VectorizedReadFlatParquetDataBenchmark.readLongsSparkVectorized5k ss 5 1.035 ± 0.070 s/op VectorizedReadFlatParquetDataBenchmark.readStringsIcebergVectorized5k ss 5 1.306 ± 0.029 s/op VectorizedReadFlatParquetDataBenchmark.readStringsSparkVectorized5k ss 5 1.369 ± 0.114 s/op ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
