Daniel Becker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21388 )

Change subject: IMPALA-12867: Filter files to OPTIMIZE based on file size
......................................................................


Patch Set 10:

(10 comments)

Thanks NoƩmi!

http://gerrit.cloudera.org:8080/#/c/21388/10/fe/src/test/java/org/apache/impala/analysis/ParserTest.java
File fe/src/test/java/org/apache/impala/analysis/ParserTest.java:

http://gerrit.cloudera.org:8080/#/c/21388/10/fe/src/test/java/org/apache/impala/analysis/ParserTest.java@1975
PS10, Line 1975: 0.1
It could be mentioned in the commit message that the value must be an integer.


http://gerrit.cloudera.org:8080/#/c/21388/6/fe/src/test/java/org/apache/impala/util/IcebergFileFilterTest.java
File fe/src/test/java/org/apache/impala/util/IcebergFileFilterTest.java:

http://gerrit.cloudera.org:8080/#/c/21388/6/fe/src/test/java/org/apache/impala/util/IcebergFileFilterTest.java@70
PS6, Line 70:       TIcebergOptimizationMode expectedMode, int expectedSize,
> It is more of a PreconditionCheck that checks the correctness of the test c
Now you added an ASSERT, didn't you want to add a PreconditionCheck instead? I 
agree that a PreconditionCheck is better.


http://gerrit.cloudera.org:8080/#/c/21388/6/tests/query_test/test_iceberg.py
File tests/query_test/test_iceberg.py:

http://gerrit.cloudera.org:8080/#/c/21388/6/tests/query_test/test_iceberg.py@1773
PS6, Line 1773:   def test_optimize(self, vector, unique_database):
> This is just an overview of what should be checked. The actual checks cover
I think the comment is misleading, but if explained in more detail it could be 
useful. But in the case of partition evolution, we only check that the set of 
partitions with new files is disjoint from the set of partitions with removed 
files, not that the number of new files equals the number of new partitions. Is 
it the case or is the number of new files unrelated in this case?


http://gerrit.cloudera.org:8080/#/c/21388/10/tests/query_test/test_iceberg.py
File tests/query_test/test_iceberg.py:

http://gerrit.cloudera.org:8080/#/c/21388/10/tests/query_test/test_iceberg.py@1791
PS10, Line 1791: check_file_filtering
Could prefix the helper function name with a '_': '_check_file_filtering'.


http://gerrit.cloudera.org:8080/#/c/21388/10/tests/query_test/test_iceberg.py@1799
PS10, Line 1799:       # file: 0:content, 1:path, 2:partition, 3:size
Could use a 'namedtuple' instead of a tuple - field access would be more 
readable with a field name than an index. For 'namedtuple', see 
https://docs.python.org/3/library/collections.html#collections.namedtuple


http://gerrit.cloudera.org:8080/#/c/21388/10/tests/query_test/test_iceberg.py@1803
PS10, Line 1803:         files_per_partition[partition] = set()
Could use 'defaultdict' to avoid looking up the key twice, see 
https://docs.python.org/3/library/collections.html#collections.defaultdict


http://gerrit.cloudera.org:8080/#/c/21388/10/tests/query_test/test_iceberg.py@1813
PS10, Line 1813:             if int(file[3]) < threshold_bytes:
We now include delete files among small files. Probably not a problem, but 
would be clearer to omit them.


http://gerrit.cloudera.org:8080/#/c/21388/10/tests/query_test/test_iceberg.py@1824
PS10, Line 1824: if
Could be 'elif'.


http://gerrit.cloudera.org:8080/#/c/21388/10/tests/query_test/test_iceberg.py@1842
PS10, Line 1842:     new_files = files_after - unchanged_files
Could also add a check that (files_after - unchanged_files) == (files_after - 
files_before).


http://gerrit.cloudera.org:8080/#/c/21388/10/tests/query_test/test_iceberg.py@1857
PS10, Line 1857:       # Check that all delete files were merged.
This checks that all new files are data files (not delete files), not that all 
delete files have been merged.
That could be checked by iterating over 'files_after'.



--
To view, visit http://gerrit.cloudera.org:8080/21388
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icfbb589513aacdb68a86c1aec4a0d39b12091820
Gerrit-Change-Number: 21388
Gerrit-PatchSet: 10
Gerrit-Owner: Noemi Pap-Takacs <[email protected]>
Gerrit-Reviewer: Daniel Becker <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Noemi Pap-Takacs <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Comment-Date: Sat, 03 Aug 2024 13:19:17 +0000
Gerrit-HasComments: Yes

Reply via email to