David Rorke created IMPALA-13721:
------------------------------------
Summary: CPU costing for scan materialization is using wrong value
for input cardinality
Key: IMPALA-13721
URL: https://issues.apache.org/jira/browse/IMPALA-13721
Project: IMPALA
Issue Type: Bug
Components: Frontend
Reporter: David Rorke
Assignee: David Rorke
When COMPUTE_PROCESSING_COST is enabled, the materialization cost for the HDFS
scan node is computed based on an estimate of bytes materialized calculated as:
{noformat}
estBytes = (long) Math.ceil(avgRowDataSize * (double) inputCardinality)
{noformat}
where inputCardinality is currently the filtered input cardinality (after
accounting for runtime filters) returned by getFilteredInputCardinality().
This is the correct approach when all runtime filters are "partition" filters
that skip reading entire files and row groups. But if some or all of the
runtime filters are "row level" filters that are applied after the rows are
materialized, then getFilteredInputCardinality() reflects the cardinality after
this row level filtering and so using it will underestimate the materialization
cost.
We should use an input cardinality here that reflects the estimated cardinality
after applying all runtime filters that eliminate data prior to materialization
(and ignoring the impact of runtime filters that are applied after
materialization).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)