[ 
https://issues.apache.org/jira/browse/HIVE-22448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969880#comment-16969880
 ] 

Gopal Vijayaraghavan edited comment on HIVE-22448 at 11/8/19 6:52 AM:
----------------------------------------------------------------------

bq.  I am trying to understand why this improvement is always a good idea

Unless we change other parts of the shuffle implementation, this is a good idea 
right now (specifically, implement a rack level combiner & we can do better 
than a grouping set + hash aggregate).

bq. along with grouping set (which increases the data size)

We're moving around where the data-size increase is actually happening.

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java#L374

bq. adding an extra reducer

That is true, but the distribution function is different between these two end 
points which is why we end up getting performance improvements due to 
distribution function changes with this implementation when the group by key 
has a low nDV (like a date) and the count distinct key has a high nDV (like a 
user_id).

That is a distributed SQL engine specific improvement where shuffling on a high 
nDV column always uses more cpu cores better than a shuffle on a low nDV column 
(in the query example, we send the z & x keys across all reducers in the first 
pass which prevents a skew forming there - assume if z=[1] and x=[1,2], then 
only 2 reducers do any work out of the 1009 estimated).

The no-key group-by is the extreme case of that, which we fixed before, which 
had only 1 reducer effectively receiving every single key in the count distinct.

That was originally fixed up in the physical optimizer, but I prefer the 
Calcite approach as it is a better place for this.

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CountDistinctRewriteProc.java#L64

bq. If so, isn't it better to make this optimization statistics based?

Everything is better that way - the statistics aren't absolute and are estimate 
driven, so the fact that this optimization right now gives us an improvement on 
the per-row cost which is not accounted for in CBO right now (as in the 
constant multiplier against the row doesn't change).

As I mentioned before, this change indirectly corrects for a potential skew in 
the group-by key which can only be estimated if we have histograms.


was (Author: gopalv):
bq.  I am trying to understand why this improvement is always a good idea

Unless we change other parts of the shuffle implementation, this is a good idea 
right now (specifically, implement a rack level combiner & we can do better 
than a grouping set + hash aggregate).

bq. along with grouping set (which increases the data size)

We're moving around where the data-size increase is actually happening.

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java#L374

bq. adding an extra reducer

That is true, but the distribution function is different between these two end 
points which is why we end up getting performance improvements due to 
distribution function changes with this implementation when the group by key 
has a low nDV (like a date) and the count distinct key has a high nDV (like a 
user_id).

That is a distributed SQL engine specific improvement where shuffling on a high 
nDV column always uses more cpu cores better than a shuffle on a low nDV column 
(in the query example, we send the z & x keys across all reducers in the first 
pass which prevents a skew forming there).

The no-key group-by is the extreme case of that, which we fixed before, which 
had only 1 reducer effectively receiving every single key in the count distinct.

That was originally fixed up in the physical optimizer, but I prefer the 
Calcite approach as it is a better place for this.

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CountDistinctRewriteProc.java#L64

bq. If so, isn't it better to make this optimization statistics based?

Everything is better that way - the statistics aren't absolute and are estimate 
driven, so the fact that this optimization right now gives us an improvement on 
the per-row cost which is not accounted for in CBO right now (as in the 
constant multiplier against the row doesn't change).

As I mentioned before, this change indirectly corrects for a potential skew in 
the group-by key which can only be estimated if we have histograms.

> CBO: Expand the multiple count distinct with a group-by key
> -----------------------------------------------------------
>
>                 Key: HIVE-22448
>                 URL: https://issues.apache.org/jira/browse/HIVE-22448
>             Project: Hive
>          Issue Type: Bug
>          Components: CBO
>            Reporter: Gopal Vijayaraghavan
>            Assignee: Jesus Camacho Rodriguez
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: HIVE-22448.01.patch, HIVE-22448.02.patch, 
> HIVE-22448.02.patch, HIVE-22448.patch
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> {code}
> create temporary table mytable1 (x integer, y integer, z integer, a integer);
> explain cbo
> select z, x, count(distinct y), count(distinct a)
> from mytable1
> group by z, x;
> explain cbo
> select count(distinct y), count(distinct a)
> from mytable1
> {code}
> The first is not vectorized, the second one is because of the grouping-set 
> based rewrite for count distinct.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to