David Rorke created IMPALA-12395:
------------------------------------

             Summary: Planner overestimates scan cardinality for queries using 
count star optimization
                 Key: IMPALA-12395
                 URL: https://issues.apache.org/jira/browse/IMPALA-12395
             Project: IMPALA
          Issue Type: Bug
          Components: fe
            Reporter: David Rorke


The scan cardinality estimate for count(*) queries doesn't account for the fact 
that the count(*) optimization only scans metadata and not the actual columns.



Scan for a count(*) query on Parquet store_sales:

 
{noformat}
Operator #Hosts #Inst Avg Time Max Time #Rows Est. #Rows Peak Mem Est. Peak Mem 
Detail 
-----------------------------------------------------------------------------------------------------------------------------------------------------
00:SCAN S3 6 72 8s131ms 8s496ms 2.71K 8.64B 128.00 KB 88.00 MB 
tpcds_3000_string_parquet_managed.store_sales
{noformat}
 

This is a problem with all file/table formats that implement count(*) 
optimizations (Parquet and also probably ORC and Iceberg).

This problem is more serious than it was in the past because with IMPALA-12091 
we now rely on scan cardinality estimates for executor group assignments so 
count(*) queries are likely to get assigned to a larger executor group than 
needed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to