[Impala-ASF-CR] IMPALA-11279: Optimize plain count(*) queries for Iceberg tables

Csaba Ringhofer (Code Review) Thu, 02 Jun 2022 10:40:22 -0700

Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18574 )


Change subject: IMPALA-11279: Optimize plain count(*) queries for Iceberg tables
......................................................................


Patch Set 3:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/18574/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/18574/3//COMMIT_MSG@17
PS3, Line 17:  - SelectStmt does not have WHERE clause
> Partitioning might use partition transforms, or maybe the table went throug
Ok, I agree with not dealing with this in the current review.


http://gerrit.cloudera.org:8080/#/c/18574/3//COMMIT_MSG@22
PS3, Line 22:  - SelectList contains only 'count(*)'
> Good suggestions, thanks for your codereview, I would like to create a foll
tl;dr: the current solution is ok for me if you want to push this optimization 
quickly, but I would prefer a different approach

While I agree with focusing on the simple select count(*) in this review, I 
think that it would be hard to extend the current implementation to be more 
generic. Specifically I think that it would be much better to create a plan in 
SingleNodePlanner that returns a constant instead of adding a special case in 
ClientRequestState::ExecQueryOrDmlRequest().

Some hints on how this could be done:
- we already have a special case where we return a const plan based on metadata 
- the Node that can be used in Impala to return a const is Union. See this 
block:
https://github.com/apache/impala/blob/c41e6941cafad453819b57e78b2083a3e64496e0/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java#L1621
What happens there is that the function is expected to return a HdfsScanNode, 
but if the result can be computed from metadata, we get the result and create a 
UnionNode that returns this result.
- ScanNodes for Iceberg tables are created in this block:
https://github.com/apache/impala/blob/c41e6941cafad453819b57e78b2083a3e64496e0/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java#L1873

There is also a function that can help to decide whether count star 
optimization is applicable (so there is no JOIN, WHERE, etc ... in the query):
https://github.com/apache/impala/blob/c41e6941cafad453819b57e78b2083a3e64496e0/fe/src/main/java/org/apache/impala/planner/ScanNode.java#L156

Where this differs from what is needed in this patch is that it also allows 
queries that use partitioning columns (e.g. in WHERE), which is expressed by || 
|| desc_.hasClusteringColsOnly()

The tricky part in the function is !desc_.hasMaterializedSlots() , which means 
the no column is used from the table in the select list or WHERE, and this is 
only possible in case of constants and count( * ).



--
To view, visit http://gerrit.cloudera.org:8080/18574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8e9c48bbba7ab2320fa80915e7001ce54f1ef6d9
Gerrit-Change-Number: 18574
Gerrit-PatchSet: 3
Gerrit-Owner: Anonymous Coward <lipeng...@sensorsdata.cn>
Gerrit-Reviewer: Anonymous Coward <lipeng...@sensorsdata.cn>
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Gergely Fürnstáhl <gfurnst...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Jian Zhang <zjsar...@gmail.com>
Gerrit-Reviewer: Tamas Mate <tma...@apache.org>
Gerrit-Reviewer: Xianqing He <hexianqing...@126.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Comment-Date: Thu, 02 Jun 2022 17:36:09 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-11279: Optimize plain count(*) queries for Iceberg tables

Reply via email to