[
https://issues.apache.org/jira/browse/HIVE-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14135849#comment-14135849
]
Xuefu Zhang commented on HIVE-8118:
-----------------------------------
I and [~chengxiang li] had an offline discussion and there was just a little
bit confusion on understanding the problem, and now we are in the same page. To
summarize, the problem comes when a map work or reduce work is connected to
multiple reduce works. Currently the a map work or reduce work is only wired
with one collector, which collects all data regardless the branch. That data
set feeds to all subsequent child reduce works.
I also noted that Tez provides a <name, outputcollector> map to its recorder
handlers. However, for us, we may not be able to do that, due to the
limitations of Spark's RDD transformation APIs.
> SparkMapRecorderHandler and SparkReduceRecordHandler should be initialized
> with multiple result collectors [Spark Branch]
> -------------------------------------------------------------------------------------------------------------------------
>
> Key: HIVE-8118
> URL: https://issues.apache.org/jira/browse/HIVE-8118
> Project: Hive
> Issue Type: Bug
> Components: Spark
> Reporter: Xuefu Zhang
> Assignee: Venki Korukanti
> Labels: Spark-M1
>
> In the current implementation, both SparkMapRecordHandler and
> SparkReduceRecorderHandler takes only one result collector, which limits that
> the corresponding map or reduce task can have only one child. It's very
> comment in multi-insert queries where a map/reduce task has more than one
> children. A query like the following has two map tasks as parents:
> {code}
> select name, sum(value) from dec group by name union all select name, value
> from dec order by name
> {code}
> It's possible in the future an optimation may be implemented so that a map
> work is followed by two reduce works and then connected to a union work.
> Thus, we should take this as a general case. Tez is currently providing a
> collector for each child operator in the map-side or reduce side operator
> tree. We can take Tez as a reference.
> Likely this is a big change and subtasks are possible.
> With this, we can have a simpler and clean multi-insert implementation. This
> is also the problem observed in HIVE-7731.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)