CodiumAI-Agent commented on PR #8937: URL: https://github.com/apache/incubator-gluten/pull/8937#issuecomment-2708191410
## PR Reviewer Guide ๐ Here are some key observations to aid the review process: <table> <tr><td> **๐ซ Ticket compliance analysis ๐ถ** **[8846](https://github.com/apache/incubator-gluten/issues/8846) - Partially compliant** Compliant requirements: - Integration of new delete file readers (EqualityDeleteFileReader and PositionalDeleteFileReader). - Enhanced iceberg reader that supports both delete-expression and bitmap deletion filtering. - Extensive tests added/updated covering varied delete scenarios. Non-compliant requirements: - Support for deleting columns not in the select set is still pending. Requires further human verification: - Benchmark performance across different datasets. - Verify correctness of row filtering logic for edge-case row group boundaries. </td></tr> <tr><td>โฑ๏ธ <strong>Estimated effort to review</strong>: 4 ๐ต๐ต๐ต๐ตโช</td></tr> <tr><td>๐งช <strong>PR contains tests</strong></td></tr> <tr><td>๐ <strong>No security concerns identified</strong></td></tr> <tr><td>โก <strong>Recommended focus areas for review</strong><br><br> <details><summary><a href='https://github.com/apache/incubator-gluten/pull/8937/files#diff-b74de69bb5b8b743a6a74fafe0bbb47238b7a7c0cc10c00a3f9e30977a5922ffR84-R103'><strong>Complex Filtering Logic</strong></a> The merging of equality and positional delete logic introduces non-trivial filtering; ensure all edge cases (empty deletes, complete deletes, and mixed scenarios) are thoroughly validated. </summary> ```c++ IcebergReader::~IcebergReader() = default; Chunk IcebergReader::doPull() { if (!delete_expr && !delete_bitmap_array) return NormalFileReader::doPull(); while (true) { Chunk chunk = NormalFileReader::doPull(); if (chunk.getNumRows() == 0) return chunk; Block deleted_block; if (delete_expr) deleted_block = applyEqualityDelete(chunk); if (delete_bitmap_array) { ``` </details> <details><summary><a href='https://github.com/apache/incubator-gluten/pull/8937/files#diff-b4b623e335b7b298f164645c008bbae839fa76dbb1460500c9df35ba244069abR800-R915'><strong>Test Coverage</strong></a> New tests add extensive delete scenarios with random and continuous values. Review that all potential edge cases are covered and that test expectations align with the new deletion logic. </summary> ```c++ /// This test creates one single data file and one delete file. The parameter /// passed to assertSingleBaseFileSingleDeleteFile is the delete positions. TEST_F(IcebergTest, singleBaseFileSinglePositionalDeleteFile) { assertSingleBaseFileSingleDeleteFile({{0, 1, 2, 3}}); // Delete the first and last row in each batch (10000 rows per batch) assertSingleBaseFileSingleDeleteFile({{0, 9999, 10000, 19999}}); // Delete several rows in the second batch (10000 rows per batch) assertSingleBaseFileSingleDeleteFile({{10000, 10002, 19999}}); // Delete random rows assertSingleBaseFileSingleDeleteFile({makeRandomIncreasingValues(0, 20000)}); // Delete 0 rows assertSingleBaseFileSingleDeleteFile({}); // Delete all rows assertSingleBaseFileSingleDeleteFile( {makeContinuousIncreasingValues(0, 20000)}); // Delete rows that don't exist assertSingleBaseFileSingleDeleteFile({{20000, 29999}}); } /// This test creates 3 base data files, only the middle one has corresponding /// delete positions. The parameter passed to /// assertSingleBaseFileSingleDeleteFile is the delete positions.for the middle /// base file. TEST_F(IcebergTest, MultipleBaseFilesSinglePositionalDeleteFile) { assertMultipleBaseFileSingleDeleteFile({0, 1, 2, 3}); assertMultipleBaseFileSingleDeleteFile({0, 9999, 10000, 19999}); assertMultipleBaseFileSingleDeleteFile({10000, 10002, 19999}); assertMultipleBaseFileSingleDeleteFile({10000, 10002, 19999}); assertMultipleBaseFileSingleDeleteFile( makeRandomIncreasingValues(0, rowCount)); assertMultipleBaseFileSingleDeleteFile({}); assertMultipleBaseFileSingleDeleteFile( makeContinuousIncreasingValues(0, rowCount)); } /// This test creates one base data file/split with multiple delete files. The /// parameter passed to assertSingleBaseFileMultipleDeleteFiles is the vector of /// delete files. Each leaf vector represents the delete positions in that /// delete file. TEST_F(IcebergTest, singleBaseFileMultiplePositionalDeleteFiles) { // Delete row 0, 1, 2, 3 from the first batch out of two. assertSingleBaseFileMultipleDeleteFiles({{1}, {2}, {3}, {4}}); // Delete the first and last row in each batch (10000 rows per batch). assertSingleBaseFileMultipleDeleteFiles({{0}, {9999}, {10000}, {19999}}); assertSingleBaseFileMultipleDeleteFiles({{500, 21000}}); assertSingleBaseFileMultipleDeleteFiles( {makeRandomIncreasingValues(0, 10000), makeRandomIncreasingValues(10000, 20000), makeRandomIncreasingValues(5000, 15000)}); assertSingleBaseFileMultipleDeleteFiles( {makeContinuousIncreasingValues(0, 10000), makeContinuousIncreasingValues(10000, 20000)}); assertSingleBaseFileMultipleDeleteFiles( {makeContinuousIncreasingValues(0, 10000), makeContinuousIncreasingValues(10000, 20000), makeRandomIncreasingValues(5000, 15000)}); assertSingleBaseFileMultipleDeleteFiles( {makeContinuousIncreasingValues(0, 20000), makeContinuousIncreasingValues(0, 20000)}); assertSingleBaseFileMultipleDeleteFiles( {makeRandomIncreasingValues(0, 20000), {}, makeRandomIncreasingValues(5000, 15000)}); assertSingleBaseFileMultipleDeleteFiles({{}, {}}); } /// This test creates 2 base data files, and 1 or 2 delete files, with unaligned /// RowGroup boundaries TEST_F(IcebergTest, multipleBaseFileMultiplePositionalDeleteFiles) { std::map<std::string, std::vector<int64_t>> rowGroupSizesForFiles; std::unordered_map< std::string, std::multimap<std::string, std::vector<int64_t>>> deleteFilesForBaseDatafiles; // Create two data files, each with two RowGroups rowGroupSizesForFiles["data_file_1"] = {100, 85}; rowGroupSizesForFiles["data_file_2"] = {99, 1}; // Delete 3 rows from the first RowGroup in data_file_1 deleteFilesForBaseDatafiles["delete_file_1"] = {{"data_file_1", {0, 1, 99}}}; assertPositionalDeletes(rowGroupSizesForFiles, deleteFilesForBaseDatafiles); // Delete 3 rows from the second RowGroup in data_file_1 deleteFilesForBaseDatafiles["delete_file_1"] = { {"data_file_1", {100, 101, 184}}}; assertPositionalDeletes(rowGroupSizesForFiles, deleteFilesForBaseDatafiles); // Delete random rows from the both RowGroups in data_file_1 deleteFilesForBaseDatafiles["delete_file_1"] = { {"data_file_1", makeRandomIncreasingValues(0, 185)}}; assertPositionalDeletes(rowGroupSizesForFiles, deleteFilesForBaseDatafiles); // Delete all rows in data_file_1 deleteFilesForBaseDatafiles["delete_file_1"] = { {"data_file_1", makeContinuousIncreasingValues(0, 185)}}; assertPositionalDeletes(rowGroupSizesForFiles, deleteFilesForBaseDatafiles); // // Delete non-existent rows from data_file_1 deleteFilesForBaseDatafiles["delete_file_1"] = { {"data_file_1", makeRandomIncreasingValues(186, 300)}}; assertPositionalDeletes(rowGroupSizesForFiles, deleteFilesForBaseDatafiles); // Delete several rows from both RowGroups in both data files ``` </details> </td></tr> </table> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
