[ https://issues.apache.org/jira/browse/HIVE-13254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Prasanth Jayachandran updated HIVE-13254: ----------------------------------------- Attachment: q3.svg > GBY cardinality estimation is wrong partition columns is involved > ----------------------------------------------------------------- > > Key: HIVE-13254 > URL: https://issues.apache.org/jira/browse/HIVE-13254 > Project: Hive > Issue Type: Bug > Components: Hive > Affects Versions: 1.3.0, 2.0.0, 2.1.0 > Reporter: Prasanth Jayachandran > Assignee: Jesus Camacho Rodriguez > Attachments: q3.svg, q3_ef_transpose_aggr.svg > > > When running the following query on TPCDS-1000 scale, setting > hive.transpose.aggr.join=true is expected to generate optimal plan but it was > not generating. > {code:title=Query} > SELECT `date_dim`.`d_day_name` AS `d_day_name`, > `item`.`i_category` AS `i_category` > FROM `store_sales` `store_sales` > INNER JOIN `item` `item` > ON ( `store_sales`.`ss_item_sk` = `item`.`i_item_sk` ) > INNER JOIN `date_dim` `date_dim` > ON ( `store_sales`.`ss_sold_date_sk` = `date_dim`.`d_date_sk` > ) > GROUP BY `d_day_name`, > `i_category`; > {code} > The reason for that is stats annotation rule for GROUP BY is not considering > partition column into account. For the above query, the generated plan is > attached. As we can see from the plan, GBY is pushed to fact table > (store_sales) but that output of GBY shuffled to perform join instead of > MapJoin conversion. This is because of wrong estimation of cardinality/data > size of GBY on store_sales (Map 1). > What's happening internally is, GBY computes estimated cardinality which in > this case is NDV(ss_item_sk) * NDV(ss_sold_date_sk) = 338901 * 1823 ~= 617M. > This estimate is wrong as ss_sold_date_sk is partition column and estimator > assumes its non-partition column. In this case, not every tasks reads data > from all partitions. We need to take estimated task parallelism into account. > For example: If task parallelism is determined to be 100 the estimate from > GBY should be ~6M which should convert this vertex into map join vertex. -- This message was sent by Atlassian JIRA (v6.3.4#6332)