[
https://issues.apache.org/jira/browse/IMPALA-12395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759756#comment-17759756
]
ASF subversion and git services commented on IMPALA-12395:
----------------------------------------------------------
Commit 0c8fc997ef7df09b675180a7baa1482852d60b11 in impala's branch
refs/heads/master from Riza Suminto
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=0c8fc997e ]
IMPALA-12395: Override scan cardinality for optimized count star
The cardinality estimate in HdfsScanNode.java for count queries does not
account for the fact that the count optimization only scans metadata and
not the actual columns. Optimized count star scan will return only 1 row
per parquet row group.
This patch override the scan cardinality with total number of files,
which is the closest estimate to number of row group. Similar override
already exist in IcebergScanNode.java.
Testing:
- Add count query testcases in test_query_cpu_count_divisor_default
- Pass core tests
Change-Id: Id5ce967657208057d50bd80adadac29ebb51cbc5
Reviewed-on: http://gerrit.cloudera.org:8080/20406
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>
> Planner overestimates scan cardinality for queries using count star
> optimization
> --------------------------------------------------------------------------------
>
> Key: IMPALA-12395
> URL: https://issues.apache.org/jira/browse/IMPALA-12395
> Project: IMPALA
> Issue Type: Bug
> Components: fe
> Reporter: David Rorke
> Assignee: Riza Suminto
> Priority: Critical
>
> The scan cardinality estimate for count(*) queries doesn't account for the
> fact that the count(*) optimization only scans metadata and not the actual
> columns.
> Scan for a count(*) query on Parquet store_sales:
>
> {noformat}
> Operator #Hosts #Inst Avg Time Max Time #Rows Est. #Rows Peak Mem Est. Peak
> Mem Detail
> -----------------------------------------------------------------------------------------------------------------------------------------------------
> 00:SCAN S3 6 72 8s131ms 8s496ms 2.71K 8.64B 128.00 KB 88.00 MB
> tpcds_3000_string_parquet_managed.store_sales
> {noformat}
>
> This is a problem with all file/table formats that implement count(*)
> optimizations (Parquet and also probably ORC and Iceberg).
> This problem is more serious than it was in the past because with
> IMPALA-12091 we now rely on scan cardinality estimates for executor group
> assignments so count(*) queries are likely to get assigned to a larger
> executor group than needed.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]