[ https://issues.apache.org/jira/browse/HIVE-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14135849#comment-14135849 ]
Xuefu Zhang commented on HIVE-8118: ----------------------------------- I and [~chengxiang li] had an offline discussion and there was just a little bit confusion on understanding the problem, and now we are in the same page. To summarize, the problem comes when a map work or reduce work is connected to multiple reduce works. Currently the a map work or reduce work is only wired with one collector, which collects all data regardless the branch. That data set feeds to all subsequent child reduce works. I also noted that Tez provides a <name, outputcollector> map to its recorder handlers. However, for us, we may not be able to do that, due to the limitations of Spark's RDD transformation APIs. > SparkMapRecorderHandler and SparkReduceRecordHandler should be initialized > with multiple result collectors [Spark Branch] > ------------------------------------------------------------------------------------------------------------------------- > > Key: HIVE-8118 > URL: https://issues.apache.org/jira/browse/HIVE-8118 > Project: Hive > Issue Type: Bug > Components: Spark > Reporter: Xuefu Zhang > Assignee: Venki Korukanti > Labels: Spark-M1 > > In the current implementation, both SparkMapRecordHandler and > SparkReduceRecorderHandler takes only one result collector, which limits that > the corresponding map or reduce task can have only one child. It's very > comment in multi-insert queries where a map/reduce task has more than one > children. A query like the following has two map tasks as parents: > {code} > select name, sum(value) from dec group by name union all select name, value > from dec order by name > {code} > It's possible in the future an optimation may be implemented so that a map > work is followed by two reduce works and then connected to a union work. > Thus, we should take this as a general case. Tez is currently providing a > collector for each child operator in the map-side or reduce side operator > tree. We can take Tez as a reference. > Likely this is a big change and subtasks are possible. > With this, we can have a simpler and clean multi-insert implementation. This > is also the problem observed in HIVE-7731. -- This message was sent by Atlassian JIRA (v6.3.4#6332)