[jira] [Commented] (HIVE-16980) The partition of join is not divided evently in HOS

Rui Li (JIRA) Sat, 01 Jul 2017 03:10:22 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-16980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16071155#comment-16071155
 ]


Rui Li commented on HIVE-16980:
-------------------------------

Hi [~kellyzly], in the attached query plan, there's only 1 reducer for Reducer3:
{noformat}
Reducer 3 <- Map 8 (PARTITION-LEVEL SORT, 1), Reducer 2 (PARTITION-LEVEL SORT, 
1)
{noformat}
Do you know why we only use 1 reducer to do the join in Reducer3? Can you try 
forcing Hive to use more reducers in this stage? An easy way to do it is to 
manually set {{mapreduce.job.reduces}}.

> The partition of join is not divided evently in HOS
> ---------------------------------------------------
>
>                 Key: HIVE-16980
>                 URL: https://issues.apache.org/jira/browse/HIVE-16980
>             Project: Hive
>          Issue Type: Bug
>            Reporter: liyunzhang_intel
>         Attachments: HIVE-16980_screenshot.png, query17_explain.log
>
>
> In HoS，the join implementation is union+repartition sort. We use 
> HashPartitioner to partition the result of union. 
> SortByShuffler.java
> {code}
>     public JavaPairRDD<HiveKey, BytesWritable> shuffle(
>       JavaPairRDD<HiveKey, BytesWritable> input, int numPartitions) {
>     JavaPairRDD<HiveKey, BytesWritable> rdd;
>     if (totalOrder) {
>       if (numPartitions > 0) {
>         if (numPartitions > 1 && input.getStorageLevel() == 
> StorageLevel.NONE()) {
>           input.persist(StorageLevel.DISK_ONLY());
>           sparkPlan.addCachedRDDId(input.id());
>         }
>         rdd = input.sortByKey(true, numPartitions);
>       } else {
>         rdd = input.sortByKey(true);
>       }
>     } else {
>       Partitioner partitioner = new HashPartitioner(numPartitions);
>       rdd = input.repartitionAndSortWithinPartitions(partitioner);
>     }
>     return rdd;
>   }
> {code}
> In spark history server, i saw that there are 28 tasks in the repartition 
> sort period while 21 tasks are finished less than 1s and the remaining 7 
> tasks spend long time to execute. Is there any way to make the data evenly 
> assigned to every partition?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (HIVE-16980) The partition of join is not divided evently in HOS

Reply via email to