[ 
https://issues.apache.org/jira/browse/IMPALA-13756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18046716#comment-18046716
 ] 

ASF subversion and git services commented on IMPALA-13756:
----------------------------------------------------------

Commit 85d77b908b12ae3d3f48ed5d49f38fb3832edc4e in impala's branch 
refs/heads/master from Zoltan Borok-Nagy
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=85d77b908 ]

IMPALA-13756: Fix Iceberg V2 count(*) optimization for complex queries

We optimize plain count(*) queries on Iceberg tables the following way:

       AGGREGATE
       COUNT(*)
           |
       UNION ALL
      /        \
     /          \
    /            \
   SCAN all  ANTI JOIN
   datafiles  /      \
   without   /        \
   deletes  SCAN      SCAN
            datafiles deletes

            ||
          rewrite
            ||
            \/

  ArithmethicExpr: LHS + RHS
      /             \
     /               \
    /                 \
   record_count  AGGREGATE
   of all        COUNT(*)
   datafiles         |
   without       ANTI JOIN
   deletes      /         \
               /           \
               SCAN        SCAN
               datafiles   deletes

This optimization consists of two parts:
 1 Rewriting count(*) expression to count(*) + "record_count" (of data
   files without deletes)
 2 In IcebergScanPlanner we only need to consruct the right side of
   the original UNION ALL operator, i.e.:

            ANTI JOIN
           /         \
          /           \
         SCAN        SCAN
         datafiles   deletes

SelectStmt decides whether we can do the count(*) optimization, and if
so, does the following:

 1: SelectStmt sets 'TotalRecordsNumV2' in the analyzer, then during the
    expression rewrite phase the CountStarToConstRule rewrites the
    count(*) to count(*) + record_count
 2: SelectStmt sets "OptimizeCountStarForIcebergV2" in the query context
    then IcebergScanPlanner creates plan accordingly.

This mechanism works for simple queries, but can turn on count(*)
optimization in IcebergScanPlanner for all Iceberg V2 tables in complex
queries. Even if only one subquery enables count(*) optimization during
analysis.

With this patch the followings change:
 1: We introduce IcebergV2CountStarAccumulator which we use instead of
    the ArithmethicExpr. So after rewrite we still know if count(*)
    optimization should be enabled for the planner.
 2: Instead of using the query context, we pass the information to the
    IcebergScanPlanner via the TableRef object.

Testing
 * e2e tests

Change-Id: I1940031298eb634aa82c3d32bbbf16bce8eaf874
Reviewed-on: http://gerrit.cloudera.org:8080/23705
Tested-by: Impala Public Jenkins <[email protected]>
Reviewed-by: Zoltan Borok-Nagy <[email protected]>


> Impala query returning wrong results
> ------------------------------------
>
>                 Key: IMPALA-13756
>                 URL: https://issues.apache.org/jira/browse/IMPALA-13756
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>    Affects Versions: Impala 5.0.0
>            Reporter: vamshi kolanu
>            Assignee: Zoltán Borók-Nagy
>            Priority: Blocker
>              Labels: correctness
>         Attachments: repro.sql
>
>
> How to reproduce this issue.
> {code:java}
> create  table merge_exclude_columns
> STORED BY ICEBERG
>   as
> select CAST(1 AS INT) as id, 'hello' as msg, 'blue' as color
> union all
> select CAST(2 AS INT) as id, 'goodbye' as msg, 'green' as color
> union all
> select CAST(3 AS INT) as id, 'anyway' as msg, 'purple' as color;{code}
>  
> {code:java}
> create table expected_merge_exclude_columns as select * from 
> merge_exclude_columns; {code}
> This query is supposed to find the row count difference and records 
> difference between both the tables. Since both the tables are same, we expect 
> row_count_difference and num_mismatched to be 0 but currently num_mismatched 
> is being computed as 3.
> {code:java}
> with diff_count as (
>     SELECT
>         1 as id,
>         COUNT(*) as num_missing FROM (
>             (SELECT color, id, msg FROM expected_merge_exclude_columns EXCEPT
>              SELECT color, id, msg FROM merge_exclude_columns)
>              UNION ALL
>             (SELECT color, id, msg FROM merge_exclude_columns EXCEPT
>              SELECT color, id, msg FROM expected_merge_exclude_columns)
>         ) as a
> ), table_a as (
>     SELECT COUNT(*) as num_rows FROM expected_merge_exclude_columns
> ), table_b as (
>     SELECT COUNT(*) as num_rows FROM merge_exclude_columns
> ), row_count_diff as (
>     select
>         1 as id,
>         table_a.num_rows - table_b.num_rows as difference
>     from table_a, table_b
> )
> select
>     row_count_diff.difference as row_count_difference,
>     diff_count.num_missing as num_mismatched
> from row_count_diff
> join diff_count using (id);{code}
> Any help to resolve this issue is appreciated. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to