[
https://issues.apache.org/jira/browse/IMPALA-13721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17944424#comment-17944424
]
Riza Suminto commented on IMPALA-13721:
---------------------------------------
getFilteredInputCardinality() (filteredInputCardinality_) is truly the input
estimate after evaluating partition filter ONLY. If there is indeed partition
filter in-play, then we should see "est-scan-range" somewhere in query plan
tree. If there is none, then ScanNode will still assume a full table scan
(inputCardinality == total num rows in that table).
If there is discrepancy between output cardinality estimave vs real output
cardinality of ScanNode, it might be due to row-level runtime filter is
ineffective and disabled early. In that case, we should tune
RUNTIME_FILTER_CARDINALITY_REDUCTION_SCALE option to be more conservative. It
is currently default to 1.0 (assume all filters are effective). Maybe 0.5 makes
more sense.
> CPU costing for scan materialization is using wrong value for input
> cardinality
> -------------------------------------------------------------------------------
>
> Key: IMPALA-13721
> URL: https://issues.apache.org/jira/browse/IMPALA-13721
> Project: IMPALA
> Issue Type: Bug
> Components: Frontend
> Reporter: David Rorke
> Assignee: Riza Suminto
> Priority: Major
>
> When COMPUTE_PROCESSING_COST is enabled, the materialization cost for the
> HDFS scan node is computed based on an estimate of bytes materialized
> calculated as:
> {noformat}
> estBytes = (long) Math.ceil(avgRowDataSize * (double) inputCardinality)
> {noformat}
> where inputCardinality is currently the filtered input cardinality (after
> accounting for runtime filters) returned by getFilteredInputCardinality().
> This is the correct approach when all runtime filters are "partition" filters
> that skip reading entire files and row groups. But if some or all of the
> runtime filters are "row level" filters that are applied after the rows are
> materialized, then getFilteredInputCardinality() reflects the cardinality
> after this row level filtering and so using it will underestimate the
> materialization cost.
> We should use an input cardinality here that reflects the estimated
> cardinality after applying all runtime filters that eliminate data prior to
> materialization (and ignoring the impact of runtime filters that are applied
> after materialization).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]