Hello Andrew Sherman, Tamas Mate, [email protected], Gergely Fürnstáhl,
Impala Public Jenkins,
I'd like you to reexamine a change. Please visit
http://gerrit.cloudera.org:8080/19354
to look at the new patch set (#4).
Change subject: IMPALA-11787, IMPALA-11516: Cardinality estimate for UNION in
Iceberg position-delete plans can double the actual table cardinality
......................................................................
IMPALA-11787, IMPALA-11516: Cardinality estimate for UNION in Iceberg
position-delete plans can double the actual table cardinality
The plan for Iceberg tables with position-delete files includes a UNION
operator that takes the following inputs:
LHS: Scan of the data files that don't have corresponding delete files
RHS: ANTI JOIN that filters the data files that do have corresponding
delete files based on the content of the delete files.
The planner's cardinality estimates for each of these two inputs to the
UNION can be as large as the full row count of the table (assuming no
other predicates in the scan) and the planner simply sums these in the
UNION which can result in a cardinality estimate for the UNION that's
twice the size of the table.
In this patch IcebergScanNode overrides computeCardinalities() of the
HdfsScanNode. The method is implemented similarly with a few
modifications:
* we exactly know the record counts of the data files
* for table sampling we know the file descriptors, hence the record
counts as well
* IDENTITY-based partition conjuncts already filtered out the files, so
we don't need their selectivity
So we calculate the SCAN NODE's cardinalities much more precisely.
This patch also sets the column stats for the virtual columns of the
scan node of the left-hand side of the ANTI JOIN. But because of
IMPALA-11797 the ANTI JOIN's cardinality always equals to the
LHS cardinality. IMPALA-11619 can also resolve this.
Testing:
* planner tests updated
Change-Id: Ie2927c58c4adfd0ba1e135b63454ac9b07991cbf
---
M common/fbs/IcebergObjects.fbs
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java
M fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java
M fe/src/main/java/org/apache/impala/util/IcebergUtil.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
M fe/src/test/java/org/apache/impala/planner/PlannerTestBase.java
M
testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test
M testdata/workloads/functional-planner/queries/PlannerTest/tablesample.test
9 files changed, 419 insertions(+), 91 deletions(-)
git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/54/19354/4
--
To view, visit http://gerrit.cloudera.org:8080/19354
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ie2927c58c4adfd0ba1e135b63454ac9b07991cbf
Gerrit-Change-Number: 19354
Gerrit-PatchSet: 4
Gerrit-Owner: Zoltan Borok-Nagy <[email protected]>
Gerrit-Reviewer: Andrew Sherman <[email protected]>
Gerrit-Reviewer: Anonymous Coward <[email protected]>
Gerrit-Reviewer: Gergely Fürnstáhl <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Tamas Mate <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>