prodeezy opened a new pull request #452: V1 Vectorized reader URL: https://github.com/apache/incubator-iceberg/pull/452 **Changes:** - Added a new reader viz. V1VectorizedReader that internally short circuits to using the V1 codepath [3] which does most of the setup and work to perform vectorization. it's exactly what Vanilla Spark's reader does underneath the DSV1 implementation. - It builds an iterator which expects ColumnarBatches from the Objects returned by the resolving iterator. - We re-organized and optimized code while building ReadTask instances which considerably improved task initiation and planning time. - Setting `iceberg.read.enableV1VectorizedReader` to true enables this reader in IcebergSource. - The V1Vectorized reader is an independent class with copied code in some methods as we didn't want to degrade perf due to inheritance/virtual method calls (we noticed degradation when we did try to re-use code). - I'v pushed this code to a separate branch [4] in case others want to give this a try. The Numbers: Flat Data 10 files 10M rows each Benchmark Mode Cnt Score Error Units IcebergSourceFlatParquetDataReadBenchmark.readFileSourceNonVectorized ss 5 63.631 ± 1.300 s/op IcebergSourceFlatParquetDataReadBenchmark.readFileSourceVectorized ss 5 28.322 ± 2.400 s/op IcebergSourceFlatParquetDataReadBenchmark.readIceberg ss 5 65.862 ± 2.480 s/op IcebergSourceFlatParquetDataReadBenchmark.readIcebergV1Vectorized10k ss 5 28.199 ± 1.255 s/op IcebergSourceFlatParquetDataReadBenchmark.readIcebergV1Vectorized20k ss 5 29.822 ± 2.848 s/op IcebergSourceFlatParquetDataReadBenchmark.readIcebergV1Vectorized5k ss 5 27.953 ± 0.949 s/op Flat Data Projections 10 files 10M rows each Benchmark Mode Cnt Score Error Units IcebergSourceFlatParquetDataReadBenchmark.readWithProjectionFileSourceNonVectorized ss 5 11.307 ± 1.791 s/op IcebergSourceFlatParquetDataReadBenchmark.readWithProjectionFileSourceVectorized ss 5 3.480 ± 0.087 s/op IcebergSourceFlatParquetDataReadBenchmark.readWithProjectionIceberg ss 5 11.057 ± 0.236 s/op IcebergSourceFlatParquetDataReadBenchmark.readWithProjectionIcebergV1Vectorized10k ss 5 3.953 ± 1.592 s/op IcebergSourceFlatParquetDataReadBenchmark.readWithProjectionIcebergV1Vectorized20k ss 5 3.619 ± 1.305 s/op IcebergSourceFlatParquetDataReadBenchmark.readWithProjectionIcebergV1Vectorized5k ss 5 4.109 ± 1.734 s/op Filtered Data 500 files 10k rows each Benchmark Mode Cnt Score Error Units IcebergSourceFlatParquetDataFilterBenchmark.readWithFilterFileSourceNonVectorized ss 5 2.139 ± 0.719 s/op IcebergSourceFlatParquetDataFilterBenchmark.readWithFilterFileSourceVectorized ss 5 2.213 ± 0.598 s/op IcebergSourceFlatParquetDataFilterBenchmark.readWithFilterIcebergNonVectorized ss 5 0.144 ± 0.029 s/op IcebergSourceFlatParquetDataFilterBenchmark.readWithFilterIcebergV1Vectorized100k ss 5 0.179 ± 0.019 s/op IcebergSourceFlatParquetDataFilterBenchmark.readWithFilterIcebergV1Vectorized10k ss 5 0.189 ± 0.046 s/op IcebergSourceFlatParquetDataFilterBenchmark.readWithFilterIcebergV1Vectorized5k ss 5 0.195 ± 0.137 s/op **Perf Notes:** - Iceberg V1 Vectorization's real gain (over current Iceberg impl) is in flat data scans. Notice how it's almost exactly same as vanilla spark vectorization. - Projections work equally well. Although we see Nested column projections are still not performing as well as we need to be able to push nested column projections down to Iceberg. - We saw a slight overhead with Iceberg V1 Vectorization over smaller workloads, but this goes away with larger data files.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org