[jira] [Commented] (IMPALA-13721) CPU costing for scan materialization is using wrong value for input cardinality

Riza Suminto (Jira) Mon, 14 Apr 2025 10:34:19 -0700


    [ 
https://issues.apache.org/jira/browse/IMPALA-13721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17944424#comment-17944424
 ]


Riza Suminto commented on IMPALA-13721:
---------------------------------------

getFilteredInputCardinality() (filteredInputCardinality_) is truly the input 
estimate after evaluating partition filter ONLY. If there is indeed partition 
filter in-play, then we should see "est-scan-range" somewhere in query plan 
tree. If there is none, then ScanNode will still assume a full table scan 
(inputCardinality == total num rows in that table).

If there is discrepancy between output cardinality estimave vs real output 
cardinality of ScanNode, it might be due to row-level runtime filter is 
ineffective and disabled early. In that case, we should tune 
RUNTIME_FILTER_CARDINALITY_REDUCTION_SCALE option to be more conservative. It 
is currently default to 1.0 (assume all filters are effective). Maybe 0.5 makes 
more sense.

> CPU costing for scan materialization is using wrong value for input 
> cardinality
> -------------------------------------------------------------------------------
>
>                 Key: IMPALA-13721
>                 URL: https://issues.apache.org/jira/browse/IMPALA-13721
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Frontend
>            Reporter: David Rorke
>            Assignee: Riza Suminto
>            Priority: Major
>
> When COMPUTE_PROCESSING_COST is enabled, the materialization cost for the 
> HDFS scan node is computed based on an estimate of bytes materialized 
> calculated as:
> {noformat}
> estBytes = (long) Math.ceil(avgRowDataSize * (double) inputCardinality)
> {noformat}
> where inputCardinality is currently the filtered input cardinality (after 
> accounting for runtime filters) returned by getFilteredInputCardinality().
> This is the correct approach when all runtime filters are "partition" filters 
> that skip reading entire files and row groups. But if some or all of the 
> runtime filters are "row level" filters that are applied after the rows are 
> materialized, then getFilteredInputCardinality() reflects the cardinality 
> after this row level filtering and so using it will underestimate the 
> materialization cost.
> We should use an input cardinality here that reflects the estimated 
> cardinality after applying all runtime filters that eliminate data prior to 
> materialization (and ignoring the impact of runtime filters that are applied 
> after materialization).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (IMPALA-13721) CPU costing for scan materialization is using wrong value for input cardinality

Reply via email to