Daniel Becker has posted comments on this change. ( http://gerrit.cloudera.org:8080/21388 )
Change subject: IMPALA-12867: Filter files to OPTIMIZE based on file size ...................................................................... Patch Set 10: (10 comments) Thanks NoƩmi! http://gerrit.cloudera.org:8080/#/c/21388/10/fe/src/test/java/org/apache/impala/analysis/ParserTest.java File fe/src/test/java/org/apache/impala/analysis/ParserTest.java: http://gerrit.cloudera.org:8080/#/c/21388/10/fe/src/test/java/org/apache/impala/analysis/ParserTest.java@1975 PS10, Line 1975: 0.1 It could be mentioned in the commit message that the value must be an integer. http://gerrit.cloudera.org:8080/#/c/21388/6/fe/src/test/java/org/apache/impala/util/IcebergFileFilterTest.java File fe/src/test/java/org/apache/impala/util/IcebergFileFilterTest.java: http://gerrit.cloudera.org:8080/#/c/21388/6/fe/src/test/java/org/apache/impala/util/IcebergFileFilterTest.java@70 PS6, Line 70: TIcebergOptimizationMode expectedMode, int expectedSize, > It is more of a PreconditionCheck that checks the correctness of the test c Now you added an ASSERT, didn't you want to add a PreconditionCheck instead? I agree that a PreconditionCheck is better. http://gerrit.cloudera.org:8080/#/c/21388/6/tests/query_test/test_iceberg.py File tests/query_test/test_iceberg.py: http://gerrit.cloudera.org:8080/#/c/21388/6/tests/query_test/test_iceberg.py@1773 PS6, Line 1773: def test_optimize(self, vector, unique_database): > This is just an overview of what should be checked. The actual checks cover I think the comment is misleading, but if explained in more detail it could be useful. But in the case of partition evolution, we only check that the set of partitions with new files is disjoint from the set of partitions with removed files, not that the number of new files equals the number of new partitions. Is it the case or is the number of new files unrelated in this case? http://gerrit.cloudera.org:8080/#/c/21388/10/tests/query_test/test_iceberg.py File tests/query_test/test_iceberg.py: http://gerrit.cloudera.org:8080/#/c/21388/10/tests/query_test/test_iceberg.py@1791 PS10, Line 1791: check_file_filtering Could prefix the helper function name with a '_': '_check_file_filtering'. http://gerrit.cloudera.org:8080/#/c/21388/10/tests/query_test/test_iceberg.py@1799 PS10, Line 1799: # file: 0:content, 1:path, 2:partition, 3:size Could use a 'namedtuple' instead of a tuple - field access would be more readable with a field name than an index. For 'namedtuple', see https://docs.python.org/3/library/collections.html#collections.namedtuple http://gerrit.cloudera.org:8080/#/c/21388/10/tests/query_test/test_iceberg.py@1803 PS10, Line 1803: files_per_partition[partition] = set() Could use 'defaultdict' to avoid looking up the key twice, see https://docs.python.org/3/library/collections.html#collections.defaultdict http://gerrit.cloudera.org:8080/#/c/21388/10/tests/query_test/test_iceberg.py@1813 PS10, Line 1813: if int(file[3]) < threshold_bytes: We now include delete files among small files. Probably not a problem, but would be clearer to omit them. http://gerrit.cloudera.org:8080/#/c/21388/10/tests/query_test/test_iceberg.py@1824 PS10, Line 1824: if Could be 'elif'. http://gerrit.cloudera.org:8080/#/c/21388/10/tests/query_test/test_iceberg.py@1842 PS10, Line 1842: new_files = files_after - unchanged_files Could also add a check that (files_after - unchanged_files) == (files_after - files_before). http://gerrit.cloudera.org:8080/#/c/21388/10/tests/query_test/test_iceberg.py@1857 PS10, Line 1857: # Check that all delete files were merged. This checks that all new files are data files (not delete files), not that all delete files have been merged. That could be checked by iterating over 'files_after'. -- To view, visit http://gerrit.cloudera.org:8080/21388 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icfbb589513aacdb68a86c1aec4a0d39b12091820 Gerrit-Change-Number: 21388 Gerrit-PatchSet: 10 Gerrit-Owner: Noemi Pap-Takacs <[email protected]> Gerrit-Reviewer: Daniel Becker <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Noemi Pap-Takacs <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Sat, 03 Aug 2024 13:19:17 +0000 Gerrit-HasComments: Yes
