[
https://issues.apache.org/jira/browse/IMPALA-3475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Armstrong resolved IMPALA-3475.
-----------------------------------
Resolution: Later
We added various optimisations since this JIRA was filed and count(*) is now
fast in practice on Parquet - between the optimisation to read the row count
from the footer and data cache. Doesn't seem worth the risk of incorrect
results.
> Extend partition key scans to support count(*)
> ----------------------------------------------
>
> Key: IMPALA-3475
> URL: https://issues.apache.org/jira/browse/IMPALA-3475
> Project: IMPALA
> Issue Type: New Feature
> Components: Frontend
> Affects Versions: Impala 2.5.0
> Reporter: Mostafa Mokhtar
> Priority: Minor
>
> Queries like the one below should be solved entirely from metadata where
> store_sales is partitioned on ss_sold_date_sk
> {code}
> select ss_sold_date_sk , count(*) from store_sales group by ss_sold_date_sk;
> {code}
> {code}
> +----------------------------------------------------------+
> | Explain String |
> +----------------------------------------------------------+
> | Estimated Per-Host Requirements: Memory=20.00MB VCores=2 |
> | |
> | 04:EXCHANGE [UNPARTITIONED] |
> | | |
> | 03:AGGREGATE [FINALIZE] |
> | | output: count:merge(*) |
> | | group by: ss_sold_date_sk |
> | | |
> | 02:EXCHANGE [HASH(ss_sold_date_sk)] |
> | | |
> | 01:AGGREGATE [STREAMING] |
> | | output: count(*) |
> | | group by: ss_sold_date_sk |
> | | |
> | 00:SCAN HDFS [tpcds_1000_parquet.store_sales] |
> | partitions=1824/1824 files=1824 size=189.24GB |
> +----------------------------------------------------------+
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]