[ 
https://issues.apache.org/jira/browse/HIVE-16980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16071883#comment-16071883
 ] 

liyunzhang_intel commented on HIVE-16980:
-----------------------------------------

have filed HIVE-17010 to trace the overflow problem, but give me more time to 
ensure there is no problem of not divided records evenly after several join in 
HOS. In my view, HiveKey is generated by ReduceSinkOperator#computeHashCode. Is 
there any possibility the key became skewed after several joins? For example, 
select * from C, (select * from A join B where key="1"), the key in the result 
are always 1 which may be skewed.  [~lirui], do you recommend to set 
hive.optimize.skewjoin as true to convert a join to skewed join at runtime?

> The partition of join is not divided evently in HOS
> ---------------------------------------------------
>
>                 Key: HIVE-16980
>                 URL: https://issues.apache.org/jira/browse/HIVE-16980
>             Project: Hive
>          Issue Type: Bug
>            Reporter: liyunzhang_intel
>         Attachments: HIVE-16980_screenshot.png, query17_explain.log
>
>
> In HoS,the join implementation is union+repartition sort. We use 
> HashPartitioner to partition the result of union. 
> SortByShuffler.java
> {code}
>     public JavaPairRDD<HiveKey, BytesWritable> shuffle(
>       JavaPairRDD<HiveKey, BytesWritable> input, int numPartitions) {
>     JavaPairRDD<HiveKey, BytesWritable> rdd;
>     if (totalOrder) {
>       if (numPartitions > 0) {
>         if (numPartitions > 1 && input.getStorageLevel() == 
> StorageLevel.NONE()) {
>           input.persist(StorageLevel.DISK_ONLY());
>           sparkPlan.addCachedRDDId(input.id());
>         }
>         rdd = input.sortByKey(true, numPartitions);
>       } else {
>         rdd = input.sortByKey(true);
>       }
>     } else {
>       Partitioner partitioner = new HashPartitioner(numPartitions);
>       rdd = input.repartitionAndSortWithinPartitions(partitioner);
>     }
>     return rdd;
>   }
> {code}
> In spark history server, i saw that there are 28 tasks in the repartition 
> sort period while 21 tasks are finished less than 1s and the remaining 7 
> tasks spend long time to execute. Is there any way to make the data evenly 
> assigned to every partition?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to