[ https://issues.apache.org/jira/browse/HIVE-16980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16071155#comment-16071155 ]
Rui Li commented on HIVE-16980: ------------------------------- Hi [~kellyzly], in the attached query plan, there's only 1 reducer for Reducer3: {noformat} Reducer 3 <- Map 8 (PARTITION-LEVEL SORT, 1), Reducer 2 (PARTITION-LEVEL SORT, 1) {noformat} Do you know why we only use 1 reducer to do the join in Reducer3? Can you try forcing Hive to use more reducers in this stage? An easy way to do it is to manually set {{mapreduce.job.reduces}}. > The partition of join is not divided evently in HOS > --------------------------------------------------- > > Key: HIVE-16980 > URL: https://issues.apache.org/jira/browse/HIVE-16980 > Project: Hive > Issue Type: Bug > Reporter: liyunzhang_intel > Attachments: HIVE-16980_screenshot.png, query17_explain.log > > > In HoS,the join implementation is union+repartition sort. We use > HashPartitioner to partition the result of union. > SortByShuffler.java > {code} > public JavaPairRDD<HiveKey, BytesWritable> shuffle( > JavaPairRDD<HiveKey, BytesWritable> input, int numPartitions) { > JavaPairRDD<HiveKey, BytesWritable> rdd; > if (totalOrder) { > if (numPartitions > 0) { > if (numPartitions > 1 && input.getStorageLevel() == > StorageLevel.NONE()) { > input.persist(StorageLevel.DISK_ONLY()); > sparkPlan.addCachedRDDId(input.id()); > } > rdd = input.sortByKey(true, numPartitions); > } else { > rdd = input.sortByKey(true); > } > } else { > Partitioner partitioner = new HashPartitioner(numPartitions); > rdd = input.repartitionAndSortWithinPartitions(partitioner); > } > return rdd; > } > {code} > In spark history server, i saw that there are 28 tasks in the repartition > sort period while 21 tasks are finished less than 1s and the remaining 7 > tasks spend long time to execute. Is there any way to make the data evenly > assigned to every partition? -- This message was sent by Atlassian JIRA (v6.4.14#64029)