[
https://issues.apache.org/jira/browse/HIVE-13254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jesus Camacho Rodriguez updated HIVE-13254:
-------------------------------------------
Target Version/s: (was: 2.1.0)
> GBY cardinality estimation is wrong partition columns is involved
> -----------------------------------------------------------------
>
> Key: HIVE-13254
> URL: https://issues.apache.org/jira/browse/HIVE-13254
> Project: Hive
> Issue Type: Bug
> Components: Hive
> Affects Versions: 1.3.0, 2.0.0, 2.1.0
> Reporter: Prasanth Jayachandran
> Assignee: Jesus Camacho Rodriguez
> Attachments: q3.svg, q3_ef_transpose_aggr.svg
>
>
> When running the following query on TPCDS-1000 scale, setting
> hive.transpose.aggr.join=true is expected to generate optimal plan but it was
> not generating.
> {code:title=Query}
> SELECT `date_dim`.`d_day_name` AS `d_day_name`,
> `item`.`i_category` AS `i_category`
> FROM `store_sales` `store_sales`
> INNER JOIN `item` `item`
> ON ( `store_sales`.`ss_item_sk` = `item`.`i_item_sk` )
> INNER JOIN `date_dim` `date_dim`
> ON ( `store_sales`.`ss_sold_date_sk` = `date_dim`.`d_date_sk`
> )
> GROUP BY `d_day_name`,
> `i_category`;
> {code}
> The reason for that is stats annotation rule for GROUP BY is not considering
> partition column into account. For the above query, the generated plan is
> attached. As we can see from the plan, GBY is pushed to fact table
> (store_sales) but that output of GBY shuffled to perform join instead of
> MapJoin conversion. This is because of wrong estimation of cardinality/data
> size of GBY on store_sales (Map 1).
> What's happening internally is, GBY computes estimated cardinality which in
> this case is NDV(ss_item_sk) * NDV(ss_sold_date_sk) = 338901 * 1823 ~= 617M.
> This estimate is wrong as ss_sold_date_sk is partition column and estimator
> assumes its non-partition column. In this case, not every tasks reads data
> from all partitions. We need to take estimated task parallelism into account.
> For example: If task parallelism is determined to be 100 the estimate from
> GBY should be ~6M which should convert this vertex into map join vertex.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)