[ 
https://issues.apache.org/jira/browse/HIVE-13254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15204700#comment-15204700
 ] 

Ashutosh Chauhan commented on HIVE-13254:
-----------------------------------------

[~prasanth_j] Are you suggesting that Map1 -> Reducer2 edge should have been 
broadcast instead of shuffle ? I am not sure we support broadcast edge between 
Map side GBY & Reduce side GBY?


> GBY cardinality estimation is wrong partition columns is involved
> -----------------------------------------------------------------
>
>                 Key: HIVE-13254
>                 URL: https://issues.apache.org/jira/browse/HIVE-13254
>             Project: Hive
>          Issue Type: Bug
>          Components: Hive
>    Affects Versions: 1.3.0, 2.0.0, 2.1.0
>            Reporter: Prasanth Jayachandran
>            Assignee: Jesus Camacho Rodriguez
>         Attachments: q3_ef_transpose_aggr.svg
>
>
> When running the following query on TPCDS-1000 scale, setting 
> hive.transpose.aggr.join=true is expected to generate optimal plan but it was 
> not generating. 
> {code:title=Query}
> SELECT `date_dim`.`d_day_name` AS `d_day_name`, 
>        `item`.`i_category`     AS `i_category` 
> FROM   `store_sales` `store_sales` 
>        INNER JOIN `item` `item` 
>                ON ( `store_sales`.`ss_item_sk` = `item`.`i_item_sk` ) 
>        INNER JOIN `date_dim` `date_dim` 
>                ON ( `store_sales`.`ss_sold_date_sk` = `date_dim`.`d_date_sk` 
> ) 
> GROUP  BY `d_day_name`, 
>           `i_category`;
> {code}
> The reason for that is stats annotation rule for GROUP BY is not considering 
> partition column into account. For the above query, the generated plan is 
> attached. As we can see from the plan, GBY is pushed to fact table 
> (store_sales) but that output of GBY shuffled to perform join instead of 
> MapJoin conversion. This is because of wrong estimation of cardinality/data 
> size of GBY on store_sales (Map 1). 
> What's happening internally is, GBY computes estimated cardinality which in 
> this case is NDV(ss_item_sk) * NDV(ss_sold_date_sk) = 338901 * 1823 ~= 617M. 
> This estimate is wrong as ss_sold_date_sk is partition column and estimator 
> assumes its non-partition column. In this case, not every tasks reads data 
> from all partitions. We need to take estimated task parallelism into account. 
> For example: If task parallelism is determined to be 100 the estimate from 
> GBY should be ~6M which should convert this vertex into map join vertex. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to