[
https://issues.apache.org/jira/browse/HIVE-22448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969880#comment-16969880
]
Gopal Vijayaraghavan edited comment on HIVE-22448 at 11/8/19 6:52 AM:
----------------------------------------------------------------------
bq. I am trying to understand why this improvement is always a good idea
Unless we change other parts of the shuffle implementation, this is a good idea
right now (specifically, implement a rack level combiner & we can do better
than a grouping set + hash aggregate).
bq. along with grouping set (which increases the data size)
We're moving around where the data-size increase is actually happening.
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java#L374
bq. adding an extra reducer
That is true, but the distribution function is different between these two end
points which is why we end up getting performance improvements due to
distribution function changes with this implementation when the group by key
has a low nDV (like a date) and the count distinct key has a high nDV (like a
user_id).
That is a distributed SQL engine specific improvement where shuffling on a high
nDV column always uses more cpu cores better than a shuffle on a low nDV column
(in the query example, we send the z & x keys across all reducers in the first
pass which prevents a skew forming there - assume if z=[1] and x=[1,2], then
only 2 reducers do any work out of the 1009 estimated).
The no-key group-by is the extreme case of that, which we fixed before, which
had only 1 reducer effectively receiving every single key in the count distinct.
That was originally fixed up in the physical optimizer, but I prefer the
Calcite approach as it is a better place for this.
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CountDistinctRewriteProc.java#L64
bq. If so, isn't it better to make this optimization statistics based?
Everything is better that way - the statistics aren't absolute and are estimate
driven, so the fact that this optimization right now gives us an improvement on
the per-row cost which is not accounted for in CBO right now (as in the
constant multiplier against the row doesn't change).
As I mentioned before, this change indirectly corrects for a potential skew in
the group-by key which can only be estimated if we have histograms.
was (Author: gopalv):
bq. I am trying to understand why this improvement is always a good idea
Unless we change other parts of the shuffle implementation, this is a good idea
right now (specifically, implement a rack level combiner & we can do better
than a grouping set + hash aggregate).
bq. along with grouping set (which increases the data size)
We're moving around where the data-size increase is actually happening.
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java#L374
bq. adding an extra reducer
That is true, but the distribution function is different between these two end
points which is why we end up getting performance improvements due to
distribution function changes with this implementation when the group by key
has a low nDV (like a date) and the count distinct key has a high nDV (like a
user_id).
That is a distributed SQL engine specific improvement where shuffling on a high
nDV column always uses more cpu cores better than a shuffle on a low nDV column
(in the query example, we send the z & x keys across all reducers in the first
pass which prevents a skew forming there).
The no-key group-by is the extreme case of that, which we fixed before, which
had only 1 reducer effectively receiving every single key in the count distinct.
That was originally fixed up in the physical optimizer, but I prefer the
Calcite approach as it is a better place for this.
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CountDistinctRewriteProc.java#L64
bq. If so, isn't it better to make this optimization statistics based?
Everything is better that way - the statistics aren't absolute and are estimate
driven, so the fact that this optimization right now gives us an improvement on
the per-row cost which is not accounted for in CBO right now (as in the
constant multiplier against the row doesn't change).
As I mentioned before, this change indirectly corrects for a potential skew in
the group-by key which can only be estimated if we have histograms.
> CBO: Expand the multiple count distinct with a group-by key
> -----------------------------------------------------------
>
> Key: HIVE-22448
> URL: https://issues.apache.org/jira/browse/HIVE-22448
> Project: Hive
> Issue Type: Bug
> Components: CBO
> Reporter: Gopal Vijayaraghavan
> Assignee: Jesus Camacho Rodriguez
> Priority: Major
> Labels: pull-request-available
> Attachments: HIVE-22448.01.patch, HIVE-22448.02.patch,
> HIVE-22448.02.patch, HIVE-22448.patch
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> {code}
> create temporary table mytable1 (x integer, y integer, z integer, a integer);
> explain cbo
> select z, x, count(distinct y), count(distinct a)
> from mytable1
> group by z, x;
> explain cbo
> select count(distinct y), count(distinct a)
> from mytable1
> {code}
> The first is not vectorized, the second one is because of the grouping-set
> based rewrite for count distinct.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)