Noemi Pap-Takacs has posted comments on this change. ( http://gerrit.cloudera.org:8080/22873 )
Change subject: IMPALA-14014: Fix COMPUTE STATS with TABLESAMPLE clause ...................................................................... Patch Set 6: (2 comments) Thanks for this patch! I see that it follows the logic of the existing implementation for HDFS tables and fits it nicely. I just wondered if we can be a bit smarter with our resources and open a discussion about it. But if it seems too complicated, then I am happy to give +2 to this change as it is now. http://gerrit.cloudera.org:8080/#/c/22873/6/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java File fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java: http://gerrit.cloudera.org:8080/#/c/22873/6/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java@830 PS6, Line 830: Map<Long, List<FileDescriptor>> sample = feFsTable.getFilesSample( : samplePerc, minSampleBytes, sampleSeed); Do I understand it correctly that we calculate the file samples twice? First here and then in the ScanNode? Is it necessary to load everything here for the checks? Couldn't we just check the parameters in the analyzer? Or alternatively, would it be too complicated to propagate the selected sample files in the analysisResult through the Planner so that we won't need to recalculate the same again? http://gerrit.cloudera.org:8080/#/c/22873/6/fe/src/main/java/org/apache/impala/catalog/FeIcebergTable.java File fe/src/main/java/org/apache/impala/catalog/FeIcebergTable.java: http://gerrit.cloudera.org:8080/#/c/22873/6/fe/src/main/java/org/apache/impala/catalog/FeIcebergTable.java@364 PS6, Line 364: // We don't sample delete files (for correctness), let's add all of them to : // the merged result. AFAIK a delete file can reference multiple data files. Is it possible that we end up with an incorrect cardinality estimate in the IcebergDeleteNode and above in the plan tree because of this if the selectivity of data files (with deletes) is very high? If we estimate that all rows from delete files will be subtracted from the data files, that could underestimate the actual number of rows we get after the anti-join, so we "oversample" the table, which consumes more resources than planned. -- To view, visit http://gerrit.cloudera.org:8080/22873 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ie59d5fc1374ab69209a74f2488bcb9a7d510b782 Gerrit-Change-Number: 22873 Gerrit-PatchSet: 6 Gerrit-Owner: Zoltan Borok-Nagy <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Michael Smith <[email protected]> Gerrit-Reviewer: Noemi Pap-Takacs <[email protected]> Gerrit-Reviewer: Riza Suminto <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Fri, 30 May 2025 19:06:42 +0000 Gerrit-HasComments: Yes
