[jira] [Commented] (IMPALA-12383) Aggregation with num_nodes=1 and limit returns too many rows

ASF subversion and git services (Jira) Tue, 12 Sep 2023 00:22:04 -0700


    [ 
https://issues.apache.org/jira/browse/IMPALA-12383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17764043#comment-17764043
 ]


ASF subversion and git services commented on IMPALA-12383:
----------------------------------------------------------

Commit 704ff7788d015dcbe66a319fb017d0a3f8a76399 in impala's branch 
refs/heads/master from Michael Smith
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=704ff7788 ]

IMPALA-12383: Fix SingleNodePlanner aggregation limits

IMPALA-2581 added enforcement of the limit when adding entries to the
grouping aggregation. It would stop adding new entries if the number of
entries in the grouping aggregation was >= the limit. If the grouping
aggregation never contains more entries than the limit, then it would
not output more entries.

However, this limit was not enforced exactly when adding. It would add a
whole batch before checking the limit, so it can go past the limit. In
practice the exchange in a distributed aggregation would enforce limits,
so this would only show up when num_nodes=1. As a result, the following
query incorrectly returns 16 rows, not 10:

  set num_nodes=1;
  select distinct l_orderkey from tpch.lineitem limit 10;

One option is to be exact when adding items to the group aggregation,
which would require testing the limit on each row (we don't know which
are duplicates). This is awkward. Removing the limit on the output of
the aggregation also is not really needed for the original change
(stopping the children early once the limit is reached). Instead, we
restore the limit on the output of the grouping agg (which is already
known to work).

Testing:
- added a test case where we assert number of rows returned by an
  aggregation node (rather than an exchange or top-n).
- restores definition of ALL_CLUSTER_SIZES and makes it simpler to
  enable for individual test suites. Filed IMPALA-12394 to generally
  re-enable testing with ALL_CLUSTER_SIZES. Enables ALL_CLUSTER_SIZES
  for aggregation tests.

Change-Id: Ic5eec1190e8e182152aa954897b79cc3f219c816
Reviewed-on: http://gerrit.cloudera.org:8080/20379
Tested-by: Impala Public Jenkins <[email protected]>
Reviewed-by: Joe McDonnell <[email protected]>


> Aggregation with num_nodes=1 and limit returns too many rows
> ------------------------------------------------------------
>
>                 Key: IMPALA-12383
>                 URL: https://issues.apache.org/jira/browse/IMPALA-12383
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend, Frontend
>    Affects Versions: Impala 4.1.0
>            Reporter: Michael Smith
>            Assignee: Michael Smith
>            Priority: Major
>             Fix For: Impala 4.3.0
>
>
> With {{set num_nodes=1}} to select SingleNodePlanner, aggregations return too 
> many rows:
> {code}
> > select distinct l_orderkey from tpch.lineitem limit 10;
> ...
> Fetched 16 row(s) in 0.12s
> > select ss_cdemo_sk from tpcds.store_sales group by ss_cdemo_sk limit 3;
> ...
> Fetched 7 row(s) in 0.14s
> {code}
> This looks like it's caused by changes in IMPALA-2581, which attempts to push 
> down limits to pre-aggregation. In SingleNodePlanner, there is no 
> pre-aggregation, which the patch appears to have failed to account for.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (IMPALA-12383) Aggregation with num_nodes=1 and limit returns too many rows

Reply via email to