Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/16392 )
Change subject: MPALA-10075: Reuse unchanged partition instances ...................................................................... Patch Set 5: (10 comments) http://gerrit.cloudera.org:8080/#/c/16392/5//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/16392/5//COMMIT_MSG@7 PS5, Line 7: MPALA typo: IMPALA http://gerrit.cloudera.org:8080/#/c/16392/5//COMMIT_MSG@11 PS5, Line 11: redudant typo: redundant http://gerrit.cloudera.org:8080/#/c/16392/5//COMMIT_MSG@13 PS5, Line 13: changes typo: changed http://gerrit.cloudera.org:8080/#/c/16392/5/fe/src/main/java/org/apache/impala/catalog/FileMetadataLoader.java File fe/src/main/java/org/apache/impala/catalog/FileMetadataLoader.java: http://gerrit.cloudera.org:8080/#/c/16392/5/fe/src/main/java/org/apache/impala/catalog/FileMetadataLoader.java@293 PS5, Line 293: if (oldFd == null || oldFd.getFileLength() != fd.getFileLength() : || oldFd.getModificationTime() != fd.getModificationTime() Are you sure that we shouldn't check other properties too? Or the modification time will change for all other changes? An example that comes to mind is HDFS rebalancing - my guess is that if the locations of some blocks change for a file, the modification time won't change, so we won't propagate the change, which can lead to suboptimal plans. One more thing: I would prefer to move this logic to FileDescriptor, e.g. in a function like equalsTo(), or isUnchanged(). http://gerrit.cloudera.org:8080/#/c/16392/5/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java File fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java: http://gerrit.cloudera.org:8080/#/c/16392/5/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java@1541 PS5, Line 1541: filesChanged_ I think that relying on filesChanged_ is counter intuitive in equalsTo. Changing the name to something like "equalsToOriginal" would reflect the semantics better, as we assume here that the partition is compared to the same partition before the update. Checking the files here again is also a solution - it would waste some CPU but the semantics would be clear. http://gerrit.cloudera.org:8080/#/c/16392/5/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java@1541 PS5, Line 1541: oldInstance == oldInstance_ Is it possible for these two to be different? http://gerrit.cloudera.org:8080/#/c/16392/5/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java@1545 PS5, Line 1545: oldInstance_ We are using oldInstance_ here and in the last line while using oldInstance in all other places. This is correct as we have checked that the two are equal, but I would prefer more consistency. http://gerrit.cloudera.org:8080/#/c/16392/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java: http://gerrit.cloudera.org:8080/#/c/16392/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@933 PS5, Line 933: // TODO: trace Do you plan to change the logging below? http://gerrit.cloudera.org:8080/#/c/16392/5/tests/metadata/test_reuse_partitions.py File tests/metadata/test_reuse_partitions.py: http://gerrit.cloudera.org:8080/#/c/16392/5/tests/metadata/test_reuse_partitions.py@49 PS5, Line 49: test_reuse_partitions An idea to extend this test is to run the same queries for an insert_only transactional table. Some operations are not supported there (ALTER TABLE), but I think that other operations should lead to the same result. What may affect this is writeId - if it is changed while the partition actually doesn't change, it could lead to less reuse than in the the non-transactional case. It would be good to verify that this doesn't happen. http://gerrit.cloudera.org:8080/#/c/16392/5/tests/metadata/test_reuse_partitions.py@55 PS5, Line 55: "create table %s.%s (id int) partitioned by (p int) stored as textfile" : % (unique_database, tbl_name)) nit: +2 indentation -- To view, visit http://gerrit.cloudera.org:8080/16392 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2dd645c260d271291021e52fdac4b74924df1170 Gerrit-Change-Number: 16392 Gerrit-PatchSet: 5 Gerrit-Owner: Quanlong Huang <[email protected]> Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Qifan Chen <[email protected]> Gerrit-Reviewer: Quanlong Huang <[email protected]> Gerrit-Reviewer: Vihang Karajgaonkar <[email protected]> Gerrit-Comment-Date: Tue, 06 Oct 2020 10:44:08 +0000 Gerrit-HasComments: Yes
