prodeezy opened a new pull request #452: V1 Vectorized reader
URL: https://github.com/apache/incubator-iceberg/pull/452
 
 
   **Changes:**
   - Added a new reader viz. V1VectorizedReader that internally short circuits 
to using the V1 codepath [3]  which does most of the setup and work to perform 
vectorization. it's exactly what Vanilla Spark's reader does underneath the 
DSV1 implementation.
   - It builds an iterator which expects ColumnarBatches from the Objects 
returned by the resolving iterator.
   - We re-organized and optimized code while building ReadTask instances which 
considerably improved task initiation and planning time.
   - Setting `iceberg.read.enableV1VectorizedReader` to true enables this 
reader in IcebergSource.
   - The V1Vectorized reader is an independent class with copied code in some 
methods as we didn't want to degrade perf due to inheritance/virtual method 
calls (we noticed degradation when we did try to re-use code). 
   - I'v pushed this code to a separate branch [4] in case others want to give 
this a try. 
   
   
   
   The Numbers:
   
   
   Flat Data 10 files 10M rows each
   
   
   
   Benchmark                                                                    
        Mode  Cnt   Score   Error  Units
   
   IcebergSourceFlatParquetDataReadBenchmark.readFileSourceNonVectorized        
          ss    5  63.631 ± 1.300   s/op
   
   IcebergSourceFlatParquetDataReadBenchmark.readFileSourceVectorized           
          ss    5  28.322 ± 2.400   s/op
   
   IcebergSourceFlatParquetDataReadBenchmark.readIceberg                        
          ss    5  65.862 ± 2.480   s/op
   
   IcebergSourceFlatParquetDataReadBenchmark.readIcebergV1Vectorized10k         
          ss    5  28.199 ± 1.255   s/op
   
   IcebergSourceFlatParquetDataReadBenchmark.readIcebergV1Vectorized20k         
          ss    5  29.822 ± 2.848   s/op
   
   IcebergSourceFlatParquetDataReadBenchmark.readIcebergV1Vectorized5k          
          ss    5  27.953 ± 0.949   s/op
   
   
   
   
   
   
   
   Flat Data Projections 10 files 10M rows each
   
   
   
   Benchmark                                                                    
        Mode  Cnt   Score   Error  Units
   
   
IcebergSourceFlatParquetDataReadBenchmark.readWithProjectionFileSourceNonVectorized
    ss    5  11.307 ± 1.791   s/op
   
   
IcebergSourceFlatParquetDataReadBenchmark.readWithProjectionFileSourceVectorized
       ss    5   3.480 ± 0.087   s/op
   
   IcebergSourceFlatParquetDataReadBenchmark.readWithProjectionIceberg          
          ss    5  11.057 ± 0.236   s/op
   
   
IcebergSourceFlatParquetDataReadBenchmark.readWithProjectionIcebergV1Vectorized10k
     ss    5   3.953 ± 1.592   s/op
   
   
IcebergSourceFlatParquetDataReadBenchmark.readWithProjectionIcebergV1Vectorized20k
     ss    5   3.619 ± 1.305   s/op
   
   
IcebergSourceFlatParquetDataReadBenchmark.readWithProjectionIcebergV1Vectorized5k
      ss    5   4.109 ± 1.734   s/op
   
   
   
   Filtered Data 500 files 10k rows each 
   
   
   Benchmark                                                                    
      Mode  Cnt  Score   Error  Units
   
   
IcebergSourceFlatParquetDataFilterBenchmark.readWithFilterFileSourceNonVectorized
    ss    5  2.139 ± 0.719   s/op
   
   
IcebergSourceFlatParquetDataFilterBenchmark.readWithFilterFileSourceVectorized  
     ss    5  2.213 ± 0.598   s/op
   
   
IcebergSourceFlatParquetDataFilterBenchmark.readWithFilterIcebergNonVectorized  
     ss    5  0.144 ± 0.029   s/op
   
   
IcebergSourceFlatParquetDataFilterBenchmark.readWithFilterIcebergV1Vectorized100k
    ss    5  0.179 ± 0.019   s/op
   
   
IcebergSourceFlatParquetDataFilterBenchmark.readWithFilterIcebergV1Vectorized10k
     ss    5  0.189 ± 0.046   s/op
   
   
IcebergSourceFlatParquetDataFilterBenchmark.readWithFilterIcebergV1Vectorized5k 
     ss    5  0.195 ± 0.137   s/op
   
   
   
   **Perf Notes:**
   - Iceberg V1 Vectorization's real gain (over current Iceberg impl) is in 
flat data scans. Notice how it's almost exactly same as vanilla spark 
vectorization.
   - Projections work equally well. Although we see Nested column projections 
are still not performing as well as we need to be able to push nested column 
projections down to Iceberg.
   - We saw a slight overhead with Iceberg V1 Vectorization over smaller 
workloads, but this goes away with larger data files.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to