[GitHub] [iceberg] aokolnychyi opened a new pull request, #6550: Spark 3.3: Automatically set Arrow properties for read performance

GitBox Mon, 09 Jan 2023 14:04:05 -0800


aokolnychyi opened a new pull request, #6550:
URL: https://github.com/apache/iceberg/pull/6550


   This PR adds logic to automatically set Arrow properties for read 
performance. Unless these properties are set, our read path can be up to 2x 
slower than built-in read path in Spark.
   
   I verified this patch by removing all explicit settings and running 
benchmarks with and without setting properties.
   
   Benchmark results without setting Arrow properties:
   ```
   Benchmark                                                              Mode  
Cnt  Score   Error  Units
   VectorizedReadFlatParquetDataBenchmark.readLongsIcebergVectorized5k      ss  
  5  1.321 ± 0.029   s/op
   VectorizedReadFlatParquetDataBenchmark.readLongsSparkVectorized5k        ss  
  5  1.064 ± 0.162   s/op
   VectorizedReadFlatParquetDataBenchmark.readStringsIcebergVectorized5k    ss  
  5  2.187 ± 0.031   s/op
   VectorizedReadFlatParquetDataBenchmark.readStringsSparkVectorized5k      ss  
  5  1.304 ± 0.287   s/op
   ```
   
   Benchmark results with setting Arrow properties automatically:
   ```
   Benchmark                                                              Mode  
Cnt  Score   Error  Units
   VectorizedReadFlatParquetDataBenchmark.readLongsIcebergVectorized5k      ss  
  5  0.927 ± 0.028   s/op
   VectorizedReadFlatParquetDataBenchmark.readLongsSparkVectorized5k        ss  
  5  1.035 ± 0.070   s/op
   VectorizedReadFlatParquetDataBenchmark.readStringsIcebergVectorized5k    ss  
  5  1.306 ± 0.029   s/op
   VectorizedReadFlatParquetDataBenchmark.readStringsSparkVectorized5k      ss  
  5  1.369 ± 0.114   s/op
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] aokolnychyi opened a new pull request, #6550: Spark 3.3: Automatically set Arrow properties for read performance

Reply via email to