[ 
https://issues.apache.org/jira/browse/IMPALA-12960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17834080#comment-17834080
 ] 

ASF subversion and git services commented on IMPALA-12960:
----------------------------------------------------------

Commit 4be5fd8896dcd445a6379bdcda4bdcf318f24511 in impala's branch 
refs/heads/master from Yida Wu
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=4be5fd889 ]

IMPALA-12960: Fix Incorrect RowsPassedThrough Metric in Streaming Aggregation

This patch fixes a bug in the RowsPassedThrough metric within the
query profile while using Streaming Aggregation. The issue is from
the AddBatchStreaming() function's logic, where the number of rows
in the output batch isn't necessarily initialized to 0, while the
function uses num_rows() of the output batch directly to be the
actual number of rows returned and passed through of this specific
aggregator. This discrepancy can significantly impact the accuracy
of the returned and passed through numbers, as well as the
calculation of reduction rates during hash table expansion in
Streaming Aggregation. Huge differences can be observed especially
when using the rollup function.

The solution is to calculate the actual number of rows added
to the output batch within each round of the AddBatchStreaming()
function.

Tests:
Passed exhaustive tests.
Added a corresponding case in tpch-passthrough-aggregations.test.

Change-Id: I59205a4b06824ee1607a25e906db1f96dc4eda9f
Reviewed-on: http://gerrit.cloudera.org:8080/21235
Reviewed-by: Wenzhe Zhou <[email protected]>
Reviewed-by: Riza Suminto <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> Incorrect RowsPassedThrough Metric in Streaming Aggregation
> -----------------------------------------------------------
>
>                 Key: IMPALA-12960
>                 URL: https://issues.apache.org/jira/browse/IMPALA-12960
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>            Reporter: Yida Wu
>            Assignee: Yida Wu
>            Priority: Major
>
> The logic in grouping aggregation uses the row number of the output batch as 
> the number of the current aggregator returning, however the output batch is 
> shared with other aggregators in streaming aggregation and may not 
> necessarily start at 0 initially, this can result in incorrect numbering and 
> subsequently affect the accuracy of the RowsPassedThrough metric.
> https://github.com/apache/impala/blob/f55077007bf68e6cbeaa15cf270c333af847a1f1/be/src/exec/grouping-aggregator.cc#L519



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to