David Rorke created IMPALA-12395:
------------------------------------
Summary: Planner overestimates scan cardinality for queries using
count star optimization
Key: IMPALA-12395
URL: https://issues.apache.org/jira/browse/IMPALA-12395
Project: IMPALA
Issue Type: Bug
Components: fe
Reporter: David Rorke
The scan cardinality estimate for count(*) queries doesn't account for the fact
that the count(*) optimization only scans metadata and not the actual columns.
Scan for a count(*) query on Parquet store_sales:
{noformat}
Operator #Hosts #Inst Avg Time Max Time #Rows Est. #Rows Peak Mem Est. Peak Mem
Detail
-----------------------------------------------------------------------------------------------------------------------------------------------------
00:SCAN S3 6 72 8s131ms 8s496ms 2.71K 8.64B 128.00 KB 88.00 MB
tpcds_3000_string_parquet_managed.store_sales
{noformat}
This is a problem with all file/table formats that implement count(*)
optimizations (Parquet and also probably ORC and Iceberg).
This problem is more serious than it was in the past because with IMPALA-12091
we now rely on scan cardinality estimates for executor group assignments so
count(*) queries are likely to get assigned to a larger executor group than
needed.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)