[
https://issues.apache.org/jira/browse/HIVE-17287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16122897#comment-16122897
]
liyunzhang_intel commented on HIVE-17287:
-----------------------------------------
[~gopalv],[~lirui]: the result why the output of join is skewed is because I
convert all join to map join. In following query, fact table is store_sales and
dimension tables are date_dim,store and item. The total size of date_dim,
store and item is smaller than the
{{hive.auto.convert.join.noconditionaltask.size}}. Hive starts 11 map works to
read store_sales and do map join. There is possibility that there is no records
in one map work because no match data in other dimension tables with
store_sales.
{code}
select i_category
,i_class
,i_brand
,i_product_name
,d_year
,d_qoy
,d_moy
,s_store_id
,store_sales.ss_sold_date_sk
,store_sales.ss_item_sk
,store_sales.ss_store_sk
from store_sales
,date_dim
,store
,item
where store_sales.ss_sold_date_sk=date_dim.d_date_sk
and store_sales.ss_item_sk=item.i_item_sk
and store_sales.ss_store_sk = store.s_store_sk
and d_month_seq between 1193 and 1193+11;
{code}
It is reasonable that the result of map join is not even but is there any way
to make it even? because it will cause the data assigned to the group by tasks
is not even if group by operation follows the map join.
> HoS can not deal with skewed data group by
> ------------------------------------------
>
> Key: HIVE-17287
> URL: https://issues.apache.org/jira/browse/HIVE-17287
> Project: Hive
> Issue Type: Bug
> Reporter: liyunzhang_intel
> Assignee: liyunzhang_intel
>
> In
> [tpcds/query67.sql|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query67.sql],
> fact table {{store_sales}} joins with small tables {{date_dim}},
> {{item}},{{store}}. After join, groupby the intermediate data.
> Here the data of {{store_sales}} on 3TB tpcds is skewed: there are 1824
> partitions. The biggest partition is 25.7G and others are 715M.
> {code}
> hadoop fs -du -h
> /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales
> ....
> 715.0 M
> /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452639
> 713.9 M
> /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452640
> 714.1 M
> /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452641
> 712.9 M
> /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452642
> 25.7 G
> /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=__HIVE_DEFAULT_PARTITION__
> {code}
> The skewed table {{store_sales}} caused the failed job. Is there any way to
> solve the groupby problem of skewed table? I tried to enable
> {{hive.groupby.skewindata}} to first divide the data more evenly then start
> do group by. But the job still hangs.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)