[jira] [Created] (IMPALA-13721) CPU costing for scan materialization is using wrong value for input cardinality

David Rorke (Jira) Fri, 31 Jan 2025 16:56:25 -0800

David Rorke created IMPALA-13721:
------------------------------------

             Summary: CPU costing for scan materialization is using wrong value 
for input cardinality
                 Key: IMPALA-13721
                 URL: https://issues.apache.org/jira/browse/IMPALA-13721
             Project: IMPALA
          Issue Type: Bug
          Components: Frontend
            Reporter: David Rorke
            Assignee: David Rorke



When COMPUTE_PROCESSING_COST is enabled, the materialization cost for the HDFS 
scan node is computed based on an estimate of bytes materialized calculated as:
{noformat}
estBytes = (long) Math.ceil(avgRowDataSize * (double) inputCardinality)
{noformat}
where inputCardinality is currently the filtered input cardinality (after 
accounting for runtime filters) returned by getFilteredInputCardinality().

This is the correct approach when all runtime filters are "partition" filters 
that skip reading entire files and row groups. But if some or all of the 
runtime filters are "row level" filters that are applied after the rows are 
materialized, then getFilteredInputCardinality() reflects the cardinality after 
this row level filtering and so using it will underestimate the materialization 
cost.

We should use an input cardinality here that reflects the estimated cardinality 
after applying all runtime filters that eliminate data prior to materialization 
(and ignoring the impact of runtime filters that are applied after 
materialization).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (IMPALA-13721) CPU costing for scan materialization is using wrong value for input cardinality

Reply via email to