[
https://issues.apache.org/jira/browse/IMPALA-13721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Riza Suminto reassigned IMPALA-13721:
-------------------------------------
Assignee: Riza Suminto (was: David Rorke)
> CPU costing for scan materialization is using wrong value for input
> cardinality
> -------------------------------------------------------------------------------
>
> Key: IMPALA-13721
> URL: https://issues.apache.org/jira/browse/IMPALA-13721
> Project: IMPALA
> Issue Type: Bug
> Components: Frontend
> Reporter: David Rorke
> Assignee: Riza Suminto
> Priority: Major
>
> When COMPUTE_PROCESSING_COST is enabled, the materialization cost for the
> HDFS scan node is computed based on an estimate of bytes materialized
> calculated as:
> {noformat}
> estBytes = (long) Math.ceil(avgRowDataSize * (double) inputCardinality)
> {noformat}
> where inputCardinality is currently the filtered input cardinality (after
> accounting for runtime filters) returned by getFilteredInputCardinality().
> This is the correct approach when all runtime filters are "partition" filters
> that skip reading entire files and row groups. But if some or all of the
> runtime filters are "row level" filters that are applied after the rows are
> materialized, then getFilteredInputCardinality() reflects the cardinality
> after this row level filtering and so using it will underestimate the
> materialization cost.
> We should use an input cardinality here that reflects the estimated
> cardinality after applying all runtime filters that eliminate data prior to
> materialization (and ignoring the impact of runtime filters that are applied
> after materialization).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]