[jira] [Commented] (IMPALA-13465) Trace TupleId further to reduce Agg cardinality

ASF subversion and git services (Jira) Sat, 18 Jan 2025 11:45:13 -0800


    [ 
https://issues.apache.org/jira/browse/IMPALA-13465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17914373#comment-17914373
 ]


ASF subversion and git services commented on IMPALA-13465:
----------------------------------------------------------

Commit c298c542621cb58ffe0772bf29ebdf7316cb77d1 in impala's branch 
refs/heads/master from Riza Suminto
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=c298c5426 ]

IMPALA-13644: Generalize and move getPerInstanceNdvForCpuCosting

getPerInstanceNdvForCpuCosting is a method to estimate the number of
distinct values of exprs per fragment instance when accounting for the
likelihood of duplicate keys across fragment instances. It borrows the
probabilistic model described in IMPALA-2945. This method is exclusively
used by AggregationNode only.

getPerInstanceNdvForCpuCosting run the probabilistic formula
individually for each grouping expression and then multiply it together.
That match with how we estimate group NDV in the past where we simply do
NDV multiplication of each grouping expression.

Recently, we adds tuple-based analysis to lower cardinality estimate for
all kind of aggregation node (IMPALA-13045, IMPALA-13465, IMPALA-13086).
All of the bounding happens in AggregationNode.computeStats(), where we
call estimateNumGroups() function that returns globalNdv estimate for
specific aggregation class.

To take advantage from that more precise globalNdv, this patch replace
getPerInstanceNdvForCpuCosting() with estimatePreaggCardinality() that
apply the probabilistic formula over this single globalNdv number rather
than the old way where it often return an overestimated number from NDV
multiplication method. Its use is still limited only to calculate
ProcessingCost. Using it for preagg output cardinality will be done by
IMPALA-2945.

estimatePreaggCardinality is skipped if data partition of input is a
subset of grouping expression.

Testing:
- Run and pass PlannerTest that set COMPUTE_PROCESSING_COST=True.
  ProcessingCost changes, but all cardinality number stays.
- Add CardinalityTest#testEstimatePreaggCardinality.
- Update test_executor_groups.py. Enable v2 profile as well for easier
  runtime profile debugging.

Change-Id: Iddf75833981558fe0188ea7475b8d996d66983c1
Reviewed-on: http://gerrit.cloudera.org:8080/22320
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> Trace TupleId further to reduce Agg cardinality
> -----------------------------------------------
>
>                 Key: IMPALA-13465
>                 URL: https://issues.apache.org/jira/browse/IMPALA-13465
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Frontend
>            Reporter: Riza Suminto
>            Assignee: Riza Suminto
>            Priority: Major
>             Fix For: Impala 4.5.0
>
>
> IMPALA-13405 do tuple analysis to lower AggregationNode cardinality. It begin 
> focusing on simple column SlotRef, but we can improve this further to trace 
> the origin TupleId across views and intermediate aggregation tuple.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (IMPALA-13465) Trace TupleId further to reduce Agg cardinality

Reply via email to