[ https://issues.apache.org/jira/browse/HIVE-7526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Xuefu Zhang updated HIVE-7526: ------------------------------ Resolution: Fixed Fix Version/s: spark-branch Status: Resolved (was: Patch Available) I manually test patch #5, and the basic query works. I left the implementation regarding to sorting (commented in the code) to Rui as he is working on sorting and waiting for this. Patch is committed to trunk. Thanks to Chao for the patch. > Research to use groupby transformation to replace Hive existing > partitionByKey and SparkCollector combination > ------------------------------------------------------------------------------------------------------------- > > Key: HIVE-7526 > URL: https://issues.apache.org/jira/browse/HIVE-7526 > Project: Hive > Issue Type: Task > Components: Spark > Reporter: Xuefu Zhang > Assignee: Chao > Fix For: spark-branch > > Attachments: HIVE-7526.2.patch, HIVE-7526.3.patch, > HIVE-7526.4-spark.patch, HIVE-7526.5-spark.patch, HIVE-7526.patch > > > Currently SparkClient shuffles data by calling paritionByKey(). This > transformation outputs <key, value> tuples. However, Hive's ExecMapper > expects <key, iterator<value>> tuples, and Spark's groupByKey() seems > outputing this directly. Thus, using groupByKey, we may be able to avoid its > own key clustering mechanism (in HiveReduceFunction). This research is to > have a try. -- This message was sent by Atlassian JIRA (v6.2#6252)