[GitHub] [iceberg] bryanck opened a new pull request, #5168: Arrow: Pad decimal bytes before passing to decimal vector

GitBox Thu, 30 Jun 2022 08:12:48 -0700


bryanck opened a new pull request, #5168:
URL: https://github.com/apache/iceberg/pull/5168


   The vectorized reader benchmarks showed that the Iceberg Parquet vectorized 
reader falls behind the one in Spark when reading decimal types. When profiling 
the code, a bottleneck was discovered in a method in Arrow that pads the byte 
buffer when setting a value in the DecimalVector, specifically [this 
operation](https://github.com/apache/arrow/blob/fb6f200278ecdf65394fc293de8d35edfcda8bde/java/vector/src/main/java/org/apache/arrow/vector/DecimalVector.java#L241).
   
   Runs of [this 
benchmark](https://gist.github.com/anonymous/1c5ceca58222430d7dfb85bc5cc0e6a1) 
showed that calling `Unsafe.setMemory()` can be slower than Java array 
operations. Results of a run are 
[here](https://gist.github.com/bryanck/9e293b8b10d7f29810ec8fdbb0582ae8).
   
   This PR adds a workaround that pads the byte buffer before calling 
`setBigEndian()` to avoid `Unsafe.setMemory()` from being called.
   
   Here are the results of a run of the 
`VectorizedReadDictionaryEncodedFlatParquetDataBenchmark` benchmark without 
this change:
   ```
   Benchmark                                                                    
              Mode  Cnt   Score   Error  Units
   
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readDatesIcebergVectorized5k
         ss    5   2.016 ± 0.069   s/op
   
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readDatesSparkVectorized5k
           ss    5   2.083 ± 0.076   s/op
   
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readDecimalsIcebergVectorized5k
      ss    5  14.451 ± 0.273   s/op
   
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readDecimalsSparkVectorized5k
        ss    5   6.886 ± 0.163   s/op
   
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readDoublesIcebergVectorized5k
       ss    5   2.058 ± 0.108   s/op
   
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readDoublesSparkVectorized5k
         ss    5   1.731 ± 0.117   s/op
   
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readFloatsIcebergVectorized5k
        ss    5   1.905 ± 0.016   s/op
   
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readFloatsSparkVectorized5k
          ss    5   2.436 ± 0.178   s/op
   
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readIntegersIcebergVectorized5k
      ss    5   2.975 ± 0.053   s/op
   
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readIntegersSparkVectorized5k
        ss    5   2.461 ± 0.951   s/op
   
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readLongsIcebergVectorized5k
         ss    5   2.713 ± 0.075   s/op
   
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readLongsSparkVectorized5k
           ss    5   2.321 ± 0.953   s/op
   
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readStringsIcebergVectorized5k
       ss    5   3.154 ± 0.062   s/op
   
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readStringsSparkVectorized5k
         ss    5   4.567 ± 1.864   s/op
   
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readTimestampsIcebergVectorized5k
    ss    5   2.674 ± 0.085   s/op
   
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readTimestampsSparkVectorized5k
      ss    5   2.634 ± 0.089   s/op
   ```
   
   Here are the results of a run with this change:
   ```
   Benchmark                                                                    
              Mode  Cnt  Score   Error  Units
   
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readDatesIcebergVectorized5k
         ss    5  2.339 ± 1.092   s/op
   
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readDatesSparkVectorized5k
           ss    5  2.204 ± 0.085   s/op
   
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readDecimalsIcebergVectorized5k
      ss    5  8.501 ± 0.129   s/op
   
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readDecimalsSparkVectorized5k
        ss    5  7.130 ± 0.111   s/op
   
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readDoublesIcebergVectorized5k
       ss    5  2.677 ± 0.083   s/op
   
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readDoublesSparkVectorized5k
         ss    5  2.251 ± 0.142   s/op
   
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readFloatsIcebergVectorized5k
        ss    5  2.616 ± 0.090   s/op
   
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readFloatsSparkVectorized5k
          ss    5  2.438 ± 0.074   s/op
   
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readIntegersIcebergVectorized5k
      ss    5  2.620 ± 0.171   s/op
   
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readIntegersSparkVectorized5k
        ss    5  2.242 ± 0.140   s/op
   
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readLongsIcebergVectorized5k
         ss    5  2.679 ± 0.084   s/op
   
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readLongsSparkVectorized5k
           ss    5  2.504 ± 0.173   s/op
   
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readStringsIcebergVectorized5k
       ss    5  3.804 ± 0.215   s/op
   
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readStringsSparkVectorized5k
         ss    5  4.864 ± 0.163   s/op
   
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readTimestampsIcebergVectorized5k
    ss    5  2.544 ± 0.086   s/op
   
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readTimestampsSparkVectorized5k
      ss    5  2.524 ± 0.193   s/op
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] bryanck opened a new pull request, #5168: Arrow: Pad decimal bytes before passing to decimal vector

Reply via email to