[
https://issues.apache.org/jira/browse/HIVE-16980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065908#comment-16065908
]
Xuefu Zhang commented on HIVE-16980:
------------------------------------
This is interesting. Have you checked your data for skew? In theory, hash
partitioner does a pretty good job for evenly distributing the keys. Unless the
keys are skewed, each partition is expected to process about the same number of
rows.
It's possible to provide a custom partitioner here, but I'm not entirely sure
if that's worthwhile.
> The partition of join is not divided evently in HOS
> ---------------------------------------------------
>
> Key: HIVE-16980
> URL: https://issues.apache.org/jira/browse/HIVE-16980
> Project: Hive
> Issue Type: Bug
> Reporter: liyunzhang_intel
>
> In HoS,the join implementation is union+repartition sort. We use
> HashPartitioner to partition the result of union.
> SortByShuffler.java
> {code}
> public JavaPairRDD<HiveKey, BytesWritable> shuffle(
> JavaPairRDD<HiveKey, BytesWritable> input, int numPartitions) {
> JavaPairRDD<HiveKey, BytesWritable> rdd;
> if (totalOrder) {
> if (numPartitions > 0) {
> if (numPartitions > 1 && input.getStorageLevel() ==
> StorageLevel.NONE()) {
> input.persist(StorageLevel.DISK_ONLY());
> sparkPlan.addCachedRDDId(input.id());
> }
> rdd = input.sortByKey(true, numPartitions);
> } else {
> rdd = input.sortByKey(true);
> }
> } else {
> Partitioner partitioner = new HashPartitioner(numPartitions);
> rdd = input.repartitionAndSortWithinPartitions(partitioner);
> }
> return rdd;
> }
> {code}
> In spark history server, i saw that there are 28 tasks in the repartition
> sort period while 21 tasks are finished less than 1s and the remaining 7
> tasks spend long time to execute. Is there any way to make the data evenly
> assigned to every partition?
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)