[
https://issues.apache.org/jira/browse/HIVE-13254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15201587#comment-15201587
]
Jesus Camacho Rodriguez commented on HIVE-13254:
------------------------------------------------
[~prasanth_j], I have been looking into this, but I do not understand the
analysis fully yet...
Do you think that the problem in the stats annotation (way of calculating full
GBY cardinality LGTM) or in the logic that creates the tasks for the operators
(and that needs to consider if a certain input is already partitioned)? Or in
another way: do we consider partitions/parallelism to compute stats for other
operators, or do we just compute the full cardinality (i.e. all records that an
operator emits)?
In turn, shouldn't partition column be taken into account when we decide if the
intermediate reduction of the GBY can be computed using hash aggregation?
Currently, it doesn't.
> GBY cardinality estimation is wrong partition columns is involved
> -----------------------------------------------------------------
>
> Key: HIVE-13254
> URL: https://issues.apache.org/jira/browse/HIVE-13254
> Project: Hive
> Issue Type: Bug
> Components: Hive
> Affects Versions: 1.3.0, 2.0.0, 2.1.0
> Reporter: Prasanth Jayachandran
> Assignee: Jesus Camacho Rodriguez
> Attachments: q3_ef_transpose_aggr.svg
>
>
> When running the following query on TPCDS-1000 scale, setting
> hive.transpose.aggr.join=true is expected to generate optimal plan but it was
> not generating.
> {code:title=Query}
> SELECT `date_dim`.`d_day_name` AS `d_day_name`,
> `item`.`i_category` AS `i_category`
> FROM `store_sales` `store_sales`
> INNER JOIN `item` `item`
> ON ( `store_sales`.`ss_item_sk` = `item`.`i_item_sk` )
> INNER JOIN `date_dim` `date_dim`
> ON ( `store_sales`.`ss_sold_date_sk` = `date_dim`.`d_date_sk`
> )
> GROUP BY `d_day_name`,
> `i_category`;
> {code}
> The reason for that is stats annotation rule for GROUP BY is not considering
> partition column into account. For the above query, the generated plan is
> attached. As we can see from the plan, GBY is pushed to fact table
> (store_sales) but that output of GBY shuffled to perform join instead of
> MapJoin conversion. This is because of wrong estimation of cardinality/data
> size of GBY on store_sales (Map 1).
> What's happening internally is, GBY computes estimated cardinality which in
> this case is NDV(ss_item_sk) * NDV(ss_sold_date_sk) = 338901 * 1823 ~= 617M.
> This estimate is wrong as ss_sold_date_sk is partition column and estimator
> assumes its non-partition column. In this case, not every tasks reads data
> from all partitions. We need to take estimated task parallelism into account.
> For example: If task parallelism is determined to be 100 the estimate from
> GBY should be ~6M which should convert this vertex into map join vertex.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)