[Impala-ASF-CR] IMPALA-14014: Fix COMPUTE STATS with TABLESAMPLE clause

Noemi Pap-Takacs (Code Review) Fri, 30 May 2025 12:07:45 -0700

Noemi Pap-Takacs has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/22873 )


Change subject: IMPALA-14014: Fix COMPUTE STATS with TABLESAMPLE clause
......................................................................


Patch Set 6:

(2 comments)

Thanks for this patch!
I see that it follows the logic of the existing implementation for HDFS tables 
and fits it nicely. I just wondered if we can be a bit smarter with our 
resources and open a discussion about it. But if it seems too complicated, then 
I am happy to give +2 to this change as it is now.

http://gerrit.cloudera.org:8080/#/c/22873/6/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
File fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java:

http://gerrit.cloudera.org:8080/#/c/22873/6/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java@830
PS6, Line 830: Map<Long, List<FileDescriptor>> sample = 
feFsTable.getFilesSample(
             :         samplePerc, minSampleBytes, sampleSeed);
Do I understand it correctly that we calculate the file samples twice? First 
here and then in the ScanNode? Is it necessary to load everything here for the 
checks? Couldn't we just check the parameters in the analyzer? Or 
alternatively, would it be too complicated to propagate the selected sample 
files in the analysisResult through the Planner so that we won't need to 
recalculate the same again?


http://gerrit.cloudera.org:8080/#/c/22873/6/fe/src/main/java/org/apache/impala/catalog/FeIcebergTable.java
File fe/src/main/java/org/apache/impala/catalog/FeIcebergTable.java:

http://gerrit.cloudera.org:8080/#/c/22873/6/fe/src/main/java/org/apache/impala/catalog/FeIcebergTable.java@364
PS6, Line 364: // We don't sample delete files (for correctness), let's add all 
of them to
             :     // the merged result.
AFAIK a delete file can reference multiple data files. Is it possible that we 
end up with an incorrect cardinality estimate in the IcebergDeleteNode and 
above in the plan tree because of this if the selectivity of data files (with 
deletes) is very high? If we estimate that all rows from delete files will be 
subtracted from the data files, that could underestimate the actual number of 
rows we get after the anti-join, so we "oversample" the table, which consumes 
more resources than planned.



--
To view, visit http://gerrit.cloudera.org:8080/22873
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie59d5fc1374ab69209a74f2488bcb9a7d510b782
Gerrit-Change-Number: 22873
Gerrit-PatchSet: 6
Gerrit-Owner: Zoltan Borok-Nagy <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Michael Smith <[email protected]>
Gerrit-Reviewer: Noemi Pap-Takacs <[email protected]>
Gerrit-Reviewer: Riza Suminto <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Comment-Date: Fri, 30 May 2025 19:06:42 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-14014: Fix COMPUTE STATS with TABLESAMPLE clause

Reply via email to