liyunzhang_intel commented on HIVE-17287:

[~gopalv],[~lirui]: the result why the output of join is skewed is because I 
convert all join to map join. In following query, fact table is store_sales and 
 dimension tables are date_dim,store and item. The total size of date_dim, 
store and item is smaller than the 
{{hive.auto.convert.join.noconditionaltask.size}}.  Hive starts 11 map works to 
read store_sales and do map join. There is possibility that there is no records 
in one map work because no match data in other dimension tables with 
 select i_category
            from store_sales
       where  store_sales.ss_sold_date_sk=date_dim.d_date_sk
          and store_sales.ss_item_sk=item.i_item_sk
          and store_sales.ss_store_sk = store.s_store_sk
          and d_month_seq between 1193 and 1193+11;

It is reasonable that the result of map join is not even but is there any way 
to make it even? because it will cause the data assigned to the group by tasks 
is not even if group by operation follows the map join. 

> HoS can not deal with skewed data group by
> ------------------------------------------
>                 Key: HIVE-17287
>                 URL: https://issues.apache.org/jira/browse/HIVE-17287
>             Project: Hive
>          Issue Type: Bug
>            Reporter: liyunzhang_intel
>            Assignee: liyunzhang_intel
> In 
> [tpcds/query67.sql|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query67.sql],
>  fact table {{store_sales}} joins with small tables {{date_dim}}, 
> {{item}},{{store}}. After join, groupby the intermediate data.
> Here the data of {{store_sales}} on 3TB tpcds is skewed:  there are 1824 
> partitions. The biggest partition is 25.7G and others are 715M.
> {code}
> hadoop fs -du -h 
> /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales
> ....
> 715.0 M  
> /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452639
> 713.9 M  
> /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452640
> 714.1 M  
> /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452641
> 712.9 M  
> /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452642
> 25.7 G   
> /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=__HIVE_DEFAULT_PARTITION__
> {code}
> The skewed table {{store_sales}} caused the failed job. Is there any way to 
> solve the groupby problem of skewed table?  I tried to enable 
> {{hive.groupby.skewindata}} to first divide the data more evenly then start 
> do group by. But the job still hangs. 

This message was sent by Atlassian JIRA

Reply via email to