[jira] [Commented] (HIVE-8118) SparkMapRecorderHandler and SparkReduceRecordHandler should be initialized with multiple result collectors[Spark Branch]

2014-09-16 Thread Chengxiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14135151#comment-14135151
 ] 

Chengxiang Li commented on HIVE-8118:
-

Actually, we could generate a spark graph with one map RDD followed by multi 
reduce RDDs, it should not related with SparkMapRecordHandler and 
SparkReduceRecorderHandler, we could wrap each reduce side child operator with 
a separate HiveReduceFunction in SparkCompiler level. 
For a map RDD which is followed by two reduce RDDs and then connected to a 
union RDD, Spark would compute map RDD twice unless map RDD is cached. If two 
reduce share the same shuffle dependency(which means they have same map output 
partitions), the job could be optimized to compute map RDD only once 
theoretically, but i think this should be an Spark framework level 
optimization. while two reduce RDDs don't share the same shuffle dependency, 
map RDD would be computed twice anyway. 
For multi-insert case, if we wrap all FileSinkOperators into one RDD, parent of 
FileSinkOperator would forward rows to each FileSinkOperator, so the data 
source for insert would be only generated once. 
so I think we do not really need multiple result collectors for 
SparkMapRecorderHandler and SparkReduceRecordHandler.

 SparkMapRecorderHandler and SparkReduceRecordHandler should be initialized 
 with multiple result collectors[Spark Branch]
 

 Key: HIVE-8118
 URL: https://issues.apache.org/jira/browse/HIVE-8118
 Project: Hive
  Issue Type: Bug
  Components: Spark
Reporter: Xuefu Zhang
Assignee: Venki Korukanti
  Labels: Spark-M1

 In the current implementation, both SparkMapRecordHandler and 
 SparkReduceRecorderHandler takes only one result collector, which limits that 
 the corresponding map or reduce task can have only one child. It's very 
 comment in multi-insert queries where a map/reduce task has more than one 
 children. A query like the following has two map tasks as parents:
 {code}
 select name, sum(value) from dec group by name union all select name, value 
 from dec order by name
 {code}
 It's possible in the future an optimation may be implemented so that a map 
 work is followed by two reduce works and then connected to a union work.
 Thus, we should take this as a general case. Tez is currently providing a 
 collector for each child operator in the map-side or reduce side operator 
 tree. We can take Tez as a reference.
 Likely this is a big change and subtasks are possible. 
 With this, we can have a simpler and clean multi-insert implementation. This 
 is also the problem observed in HIVE-7731.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-8118) SparkMapRecorderHandler and SparkReduceRecordHandler should be initialized with multiple result collectors[Spark Branch]

2014-09-16 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14135517#comment-14135517
 ] 

Xuefu Zhang commented on HIVE-8118:
---

Hi [~chengxiang li],

Thank you for your input. I'm not sure if I understand your thought right. Let 
me clarify the problem  by giving a SparkWork like this:
{code}
MapWork1 - ReduceWork1
 \- ReduceWork2
{code}
it means that MapWork1 will generate different datasets to feed to ReduceWork1 
and ReduceWork2. In case of multi-insert, ReduceWork1 and ReduceWork2 will have 
a FS operator. Inside MapWork1, there will be two operator branches consuming 
the same data, and push different data sets to two RS operators. (ReduceWork1 
and ReduceWork2 have different HiveReduceFunctions.)

However, current implemenation only takes the first data set and feed it to 
both reduce works. The same problem can happen also if MapWork1 were a reduce 
work following other ReduceWork or MapWork.

With this problem, I'm not sure how we can get around without letting MapWork1 
generate two output RDDs, one for each following reduce work. Potentially, we 
can duplicate MapWork1 and have the following diagram:
{code}
MapWork11 - ReduceWork1
MapWork12 - ReduceWork2
{code}
where MapWork11 and MapWork12 consume the same input table (input table as 
RDD), and feed its first output RDD to ReduceWork1 and the second to 
ReduceWork2. This has its complexity, but more importantly, there will be 
wasted READ (unless SPark is smart enough to cache the input table, which is 
unlikely) and COMPUTATION (computing data twice). I feel that it's unlikely to 
get such optimizations from Spark framework in the near term.

Thus, I think we have to take into consideration that a map work or a reduce 
work might generate multiple RDDs, one feeds to each of its children. Since 
SparkMapRecorderHandler and SparkReduceRecordHandler are doing the data 
processing on map and reduce side, they need to have a way to generate multiple 
outputs.

Please correct me if I understood you wrong. Thanks.


 SparkMapRecorderHandler and SparkReduceRecordHandler should be initialized 
 with multiple result collectors[Spark Branch]
 

 Key: HIVE-8118
 URL: https://issues.apache.org/jira/browse/HIVE-8118
 Project: Hive
  Issue Type: Bug
  Components: Spark
Reporter: Xuefu Zhang
Assignee: Venki Korukanti
  Labels: Spark-M1

 In the current implementation, both SparkMapRecordHandler and 
 SparkReduceRecorderHandler takes only one result collector, which limits that 
 the corresponding map or reduce task can have only one child. It's very 
 comment in multi-insert queries where a map/reduce task has more than one 
 children. A query like the following has two map tasks as parents:
 {code}
 select name, sum(value) from dec group by name union all select name, value 
 from dec order by name
 {code}
 It's possible in the future an optimation may be implemented so that a map 
 work is followed by two reduce works and then connected to a union work.
 Thus, we should take this as a general case. Tez is currently providing a 
 collector for each child operator in the map-side or reduce side operator 
 tree. We can take Tez as a reference.
 Likely this is a big change and subtasks are possible. 
 With this, we can have a simpler and clean multi-insert implementation. This 
 is also the problem observed in HIVE-7731.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-8118) SparkMapRecorderHandler and SparkReduceRecordHandler should be initialized with multiple result collectors [Spark Branch]

2014-09-16 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14135849#comment-14135849
 ] 

Xuefu Zhang commented on HIVE-8118:
---

I and [~chengxiang li] had an offline discussion and there was just a little 
bit confusion on understanding the problem, and now we are in the same page. To 
summarize, the problem comes when a map work or reduce work is connected to 
multiple reduce works. Currently the a map work or reduce work is only wired 
with one collector, which collects all data regardless the branch. That data 
set feeds to all subsequent child reduce works.
 
I also noted that Tez provides a name, outputcollector map to its recorder 
handlers. However, for us, we may not be able to do that, due to the 
limitations of Spark's RDD transformation APIs.


 SparkMapRecorderHandler and SparkReduceRecordHandler should be initialized 
 with multiple result collectors [Spark Branch]
 -

 Key: HIVE-8118
 URL: https://issues.apache.org/jira/browse/HIVE-8118
 Project: Hive
  Issue Type: Bug
  Components: Spark
Reporter: Xuefu Zhang
Assignee: Venki Korukanti
  Labels: Spark-M1

 In the current implementation, both SparkMapRecordHandler and 
 SparkReduceRecorderHandler takes only one result collector, which limits that 
 the corresponding map or reduce task can have only one child. It's very 
 comment in multi-insert queries where a map/reduce task has more than one 
 children. A query like the following has two map tasks as parents:
 {code}
 select name, sum(value) from dec group by name union all select name, value 
 from dec order by name
 {code}
 It's possible in the future an optimation may be implemented so that a map 
 work is followed by two reduce works and then connected to a union work.
 Thus, we should take this as a general case. Tez is currently providing a 
 collector for each child operator in the map-side or reduce side operator 
 tree. We can take Tez as a reference.
 Likely this is a big change and subtasks are possible. 
 With this, we can have a simpler and clean multi-insert implementation. This 
 is also the problem observed in HIVE-7731.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)