[ 
https://issues.apache.org/jira/browse/IMPALA-11701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17679217#comment-17679217
 ] 

Zoltán Borók-Nagy commented on IMPALA-11701:
--------------------------------------------

The problem is that we are pushing down predicates to the scan node even though 
they won't filter out any more rows (since the predicates are on 
IDENTITY-partition columns, so after partition pruning all rows are needed).

We should either drop the conjunct and do count(* )-optimization in the 
scanners.
Or, we could answer this query from Iceberg metadata, similarly to what we do 
with plain SELECT count(* ) statements that don't have any prediates: 
IMPALA-11279.

The difficulty is that the predicates can be arbitrarily complex, and we need 
to be able to decide wether any conjunct should be evaluated by the executors.

> Slow query problem about querying iceberg table by impala
> ---------------------------------------------------------
>
>                 Key: IMPALA-11701
>                 URL: https://issues.apache.org/jira/browse/IMPALA-11701
>             Project: IMPALA
>          Issue Type: Bug
>            Reporter: Qizhu Chan
>            Priority: Critical
>              Labels: impala-iceberg
>         Attachments: image-2022-11-03-17-37-14-712.png, 
> profile_cf446a1ab3a5e852_1b1005de00000000.txt
>
>
> I use impala to query iceberg table, but the query efficiency is not ideal, 
> compared with querying the hive format table of the same data, the 
> time-consuming increase is dozens of times.
> The sql statement used is a very simple statistical query, be like :
> select count(*)  from `db_name`.tbl_name where datekey='20221001' and 
> event='xxx'
> ('datekey' and 'event' are the partition fields)
> My personal feeling is that impala might fetch iceberg's metadata stats and 
> return results very quickly, but it doesn't.
> The catalog of iceberg table is of the hadoop type, and Impala can access it 
> by creating an external table in hive. By the way,  iceberg table will 
> perform snapshot expiration and data compaction on a daily basis, so there 
> should be no small file problems.
> I found this warning using the explain statement:
> {code:java}
> | WARNING: The following tables are missing relevant table and/or column 
> statistics. |
> | iceberg.gamebox_event_iceberg
> {code}
> Query: SHOW TABLE STATS `iceberg`.gamebox_event_iceberg
> +-------+--------+--------+--------------+-------------------+---------+-------------------+-----------------------------------------------------------------+
> | #Rows | #Files | Size   | Bytes Cached | Cache Replication | Format  | 
> Incremental stats | Location                                                  
>       |
> +-------+--------+--------+--------------+-------------------+---------+-------------------+-----------------------------------------------------------------+
> | 0     | 590509 | 1.91TB | NOT CACHED   | NOT CACHED        | PARQUET | 
> false             | hdfs:///hive/warehouse/iceberg/gamebox_event_iceberg |
> +-------+--------+--------+--------------+-------------------+---------+-------------------+-----------------------------------------------------------------+
> It seems like Impala is not syncing iceberg's table and column statistics. 
> I'm not sure if this has anything to do with slow queries.
> As shown in the screenshot, i think the query time is mainly on planning and 
> execution backends , but I don't know what is the reason for these two time 
> consuming.
> Attachment is the complete profile for this query.
> How do I speed up the query? Can someone help with my question?plz.....
>  !image-2022-11-03-17-37-14-712.png! 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to