[
https://issues.apache.org/jira/browse/HIVE-17287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16125326#comment-16125326
]
liyunzhang_intel commented on HIVE-17287:
-----------------------------------------
[~lirui]: in current case, i have not set {{hive.spark.use.groupby.shuffle}},
but i think the value is true because the default value is true. After
disabling {{spark.shuffle.reduceLocality.enabled}}, i reran the query67. It
showed passed. But one strange thing i found is [not all stages finished but
the result of spark job is
completed|https://issues.apache.org/jira/secure/attachment/12881692/not_stages_completed_but_job_completed.PNG].I
don't know whether this is bug of spark. Meanwhile in spark history server,
the shuffle read metrics are still skewed.
> HoS can not deal with skewed data group by
> ------------------------------------------
>
> Key: HIVE-17287
> URL: https://issues.apache.org/jira/browse/HIVE-17287
> Project: Hive
> Issue Type: Bug
> Reporter: liyunzhang_intel
> Assignee: liyunzhang_intel
> Attachments: not_stages_completed_but_job_completed.PNG,
> query67-fail-at-groupby.png, query67-groupby_shuffle_metric.png
>
>
> In
> [tpcds/query67.sql|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query67.sql],
> fact table {{store_sales}} joins with small tables {{date_dim}},
> {{item}},{{store}}. After join, groupby the intermediate data.
> Here the data of {{store_sales}} on 3TB tpcds is skewed: there are 1824
> partitions. The biggest partition is 25.7G and others are 715M.
> {code}
> hadoop fs -du -h
> /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales
> ....
> 715.0 M
> /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452639
> 713.9 M
> /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452640
> 714.1 M
> /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452641
> 712.9 M
> /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452642
> 25.7 G
> /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=__HIVE_DEFAULT_PARTITION__
> {code}
> The skewed table {{store_sales}} caused the failed job. Is there any way to
> solve the groupby problem of skewed table? I tried to enable
> {{hive.groupby.skewindata}} to first divide the data more evenly then start
> do group by. But the job still hangs.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)