[
https://issues.apache.org/jira/browse/IMPALA-11701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17715661#comment-17715661
]
Gabor Kaszab commented on IMPALA-11701:
---------------------------------------
Part1 has been merged now. The content is to skip pushing down predicates to
Impala Scan nodes when Iceberg pre-filtered the files using these predicates
and using them wouldn't result in further filtering any more rows. "All or
nothing" approach meaning that either all predicates are skipped from being
pushed down or none of them.
Further potential improvements:
Part2:
[IMPALA-11802|https://issues.apache.org/jira/browse/IMPALA-11802] introduced a
query rewrite for count(*) queries. The idea is after part1 we can check if we
eliminated the predicates from being pushed down to scanners and this is a
count(*) and then we can do the optimization introduced in IMPALA-11802. One
difficulty though, is that the count(*) optimisation happens in analysis phase
with re-writing the query while the predicate pushdown optimisation is in the
planner after calling planFiles() on an Iceberg table. So if we want to make
the same optimisation after judging predicate pushdown we then have to
re-implement the query-rewrite code in the planner.
Part3:
Be able to push down a subset of the predicates to Impala Scan nodes. For this
we should be able to map Iceberg predicates (returned from residual()) to
Impala predicates. This might not be that trivial as Iceberg sometimes doesn't
return the exact same predicates as it received through planFiles(). E.g. the
object ID might be different making the mapping more difficult.
I propose now to make an epic jira to hold these 3 different steps and split
this ticket up as this seems too general to do everything in one go.
> Slow query problem about querying iceberg table by impala
> ---------------------------------------------------------
>
> Key: IMPALA-11701
> URL: https://issues.apache.org/jira/browse/IMPALA-11701
> Project: IMPALA
> Issue Type: Bug
> Reporter: Qizhu Chan
> Assignee: Gabor Kaszab
> Priority: Critical
> Labels: impala-iceberg
> Attachments: image-2022-11-03-17-37-14-712.png,
> profile_cf446a1ab3a5e852_1b1005de00000000.txt
>
>
> I use impala to query iceberg table, but the query efficiency is not ideal,
> compared with querying the hive format table of the same data, the
> time-consuming increase is dozens of times.
> The sql statement used is a very simple statistical query, be like :
> select count(*) from `db_name`.tbl_name where datekey='20221001' and
> event='xxx'
> ('datekey' and 'event' are the partition fields)
> My personal feeling is that impala might fetch iceberg's metadata stats and
> return results very quickly, but it doesn't.
> The catalog of iceberg table is of the hadoop type, and Impala can access it
> by creating an external table in hive. By the way, iceberg table will
> perform snapshot expiration and data compaction on a daily basis, so there
> should be no small file problems.
> I found this warning using the explain statement:
> {code:java}
> | WARNING: The following tables are missing relevant table and/or column
> statistics. |
> | iceberg.gamebox_event_iceberg
> {code}
> Query: SHOW TABLE STATS `iceberg`.gamebox_event_iceberg
> +-------+--------+--------+--------------+-------------------+---------+-------------------+-----------------------------------------------------------------+
> | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format |
> Incremental stats | Location
> |
> +-------+--------+--------+--------------+-------------------+---------+-------------------+-----------------------------------------------------------------+
> | 0 | 590509 | 1.91TB | NOT CACHED | NOT CACHED | PARQUET |
> false | hdfs:///hive/warehouse/iceberg/gamebox_event_iceberg |
> +-------+--------+--------+--------------+-------------------+---------+-------------------+-----------------------------------------------------------------+
> It seems like Impala is not syncing iceberg's table and column statistics.
> I'm not sure if this has anything to do with slow queries.
> As shown in the screenshot, i think the query time is mainly on planning and
> execution backends , but I don't know what is the reason for these two time
> consuming.
> Attachment is the complete profile for this query.
> How do I speed up the query? Can someone help with my question?plz.....
> !image-2022-11-03-17-37-14-712.png!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]