[GitHub] [incubator-iceberg] shardulm94 opened a new pull request #826: Fix Spark ORC Reader creating new RowWriter for every row

GitBox Tue, 03 Mar 2020 01:42:57 -0800

shardulm94 opened a new pull request #826: Fix Spark ORC Reader creating new 
RowWriter for every row
URL: https://github.com/apache/incubator-iceberg/pull/826
 
 
   Spark ORC reader was creating a new RowWriter for every row leading to 
sluggish performance. RowWriter is meant to be reused and the reader code was 
already performing the necessary reset operations after every row. The only 
change required was to reuse the RowWriter object.
   
   Tested with local JMH test I have written for ORC. These are essentially 
copies of the corresponding JMH tests for Parquet. I will send a separate PR 
for the JMH tests since they require enabling of ORC writers.
   
   Without Fix:
   ```
   Benchmark                                                                    
      Mode  Cnt    Score   Error  Units
   IcebergSourceFlatORCDataReadBenchmark.readFileSourceNonVectorized            
        ss    5   10.805 Â± 6.819   s/op
   IcebergSourceFlatORCDataReadBenchmark.readFileSourceVectorized               
        ss    5    4.185 Â± 0.713   s/op
   IcebergSourceFlatORCDataReadBenchmark.readIceberg                            
        ss    5  128.416 Â± 5.686   s/op
   
IcebergSourceFlatORCDataReadBenchmark.readWithProjectionFileSourceNonVectorized 
     ss    5    2.256 Â± 0.222   s/op
   IcebergSourceFlatORCDataReadBenchmark.readWithProjectionFileSourceVectorized 
        ss    5    0.594 Â± 0.132   s/op
   IcebergSourceFlatORCDataReadBenchmark.readWithProjectionIceberg              
        ss    5  116.871 Â± 7.665   s/op
   IcebergSourceNestedORCDataReadBenchmark.readFileSourceNonVectorized          
        ss    5   13.563 Â± 1.074   s/op
   IcebergSourceNestedORCDataReadBenchmark.readFileSourceVectorized             
        ss    5   13.631 Â± 1.244   s/op
   IcebergSourceNestedORCDataReadBenchmark.readIceberg                          
        ss    5  118.870 Â± 1.925   s/op
   
IcebergSourceNestedORCDataReadBenchmark.readWithProjectionFileSourceNonVectorized
    ss    5   16.093 Â± 1.089   s/op
   
IcebergSourceNestedORCDataReadBenchmark.readWithProjectionFileSourceVectorized  
     ss    5   16.923 Â± 4.898   s/op
   IcebergSourceNestedORCDataReadBenchmark.readWithProjectionIceberg            
        ss    5  118.069 Â± 2.229   s/op
   ```
   
   With Fix:
   ```
   Benchmark                                                                    
      Mode  Cnt   Score   Error  Units
   IcebergSourceFlatORCDataReadBenchmark.readFileSourceNonVectorized            
        ss    5   9.220 Â± 0.828   s/op
   IcebergSourceFlatORCDataReadBenchmark.readFileSourceVectorized               
        ss    5   4.015 Â± 0.502   s/op
   IcebergSourceFlatORCDataReadBenchmark.readIceberg                            
        ss    5   8.576 Â± 1.504   s/op
   
IcebergSourceFlatORCDataReadBenchmark.readWithProjectionFileSourceNonVectorized 
     ss    5   2.292 Â± 0.391   s/op
   IcebergSourceFlatORCDataReadBenchmark.readWithProjectionFileSourceVectorized 
        ss    5   0.664 Â± 0.138   s/op
   IcebergSourceFlatORCDataReadBenchmark.readWithProjectionIceberg              
        ss    5   1.245 Â± 0.848   s/op
   IcebergSourceNestedORCDataReadBenchmark.readFileSourceNonVectorized          
        ss    5  15.806 Â± 1.035   s/op
   IcebergSourceNestedORCDataReadBenchmark.readFileSourceVectorized             
        ss    5  17.490 Â± 0.926   s/op
   IcebergSourceNestedORCDataReadBenchmark.readIceberg                          
        ss    5   3.238 Â± 0.821   s/op
   
IcebergSourceNestedORCDataReadBenchmark.readWithProjectionFileSourceNonVectorized
    ss    5  22.043 Â± 9.899   s/op
   
IcebergSourceNestedORCDataReadBenchmark.readWithProjectionFileSourceVectorized  
     ss    5  17.451 Â± 9.970   s/op
   IcebergSourceNestedORCDataReadBenchmark.readWithProjectionIceberg            
        ss    5   2.885 Â± 0.764   s/op
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [incubator-iceberg] shardulm94 opened a new pull request #826: Fix Spark ORC Reader creating new RowWriter for every row

Reply via email to