[
https://issues.apache.org/jira/browse/IMPALA-12395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Abhishek Rawat updated IMPALA-12395:
------------------------------------
Priority: Critical (was: Major)
> Planner overestimates scan cardinality for queries using count star
> optimization
> --------------------------------------------------------------------------------
>
> Key: IMPALA-12395
> URL: https://issues.apache.org/jira/browse/IMPALA-12395
> Project: IMPALA
> Issue Type: Bug
> Components: fe
> Reporter: David Rorke
> Assignee: Riza Suminto
> Priority: Critical
>
> The scan cardinality estimate for count(*) queries doesn't account for the
> fact that the count(*) optimization only scans metadata and not the actual
> columns.
> Scan for a count(*) query on Parquet store_sales:
>
> {noformat}
> Operator #Hosts #Inst Avg Time Max Time #Rows Est. #Rows Peak Mem Est. Peak
> Mem Detail
> -----------------------------------------------------------------------------------------------------------------------------------------------------
> 00:SCAN S3 6 72 8s131ms 8s496ms 2.71K 8.64B 128.00 KB 88.00 MB
> tpcds_3000_string_parquet_managed.store_sales
> {noformat}
>
> This is a problem with all file/table formats that implement count(*)
> optimizations (Parquet and also probably ORC and Iceberg).
> This problem is more serious than it was in the past because with
> IMPALA-12091 we now rely on scan cardinality estimates for executor group
> assignments so count(*) queries are likely to get assigned to a larger
> executor group than needed.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]