[
https://issues.apache.org/jira/browse/PIG-162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12586502#action_12586502
]
shravanmn edited comment on PIG-162 at 4/7/08 12:27 PM:
-----------------------------------------------------------------------------
After some thought, here is what I think. The solution that works for all cases
would be to store the output of the split right away into different hdfs files
and add suitable loads at the other end. However, if we are in the map phase,
the store and load would probably be more expensive than if we were to create
multiple pipelines that are replicas of the pipeline in the map phase that is
below the split and attach it with appropriate filters to the other end. Also
the same applies if there exists a diamond structure and in fact if the split
occurs in the reduce phase, we only need to create pipeline replicas for the
pipeline below the split only till the Map-Reduce boundary. But this solution
would not do that well if we have scenario one pointed out by alan. Because
there is no pipeline on the other end for r1 when store q1 is executed and
hence in the worst case the last job in the job dag that has the split might
run again.
Also the cost of implementing the solution where we check for diamond
structures and differentiate between map and reduce occurences will take time
to implement as the code that does thie Physical to MR translation I wrote a
few days back did not consider this kind of an optimization. It would probably
take a couple of days to modify it.
I have attached a
[figure|https://issues.apache.org/jira/secure/attachment/12379589/split.png]
that shows the replication idea in case of a diamond structure and split
occuring in Reduce phase.
So please suggest if its worth modifying or implementing the store solution and
pushing this optimization either to the optimization layer or to a later point
in time. Also the current pig trunk code also uses the store and load approach.
was (Author: shravanmn):
After some thought, here is what I think. The solution that works for all
cases would be to store the output of the split right away into different hdfs
files and add suitable loads at the other end. However, if we are in the map
phase, the store and load would probably be more expensive than if we were to
create multiple pipelines that are replicas of the pipeline in the map phase
that is below the split and attach it with appropriate filters to the other
end. Also the same applies if there exists a diamond structure and in fact if
the split occurs in the reduce phase, we only need to create pipeline replicas
for the pipeline below the split only till the Map-Reduce boundary. But this
solution would not do that well if we have scenario one pointed out by alan.
Because there is no pipeline on the other end for r1 when store q1 is executed
and hence in the worst case the last job in the job dag that has the split
might run again.
Also the cost of implementing the solution where we check for diamond
structures and differentiate between map and reduce occurences will take time
to implement as the code that does thie Physical to MR translation I wrote a
few days back did not consider this kind of an optimization. It would probably
take a couple of days to modify it.
I have attached a figure that shows the replication idea in case of a diamond
structure and split occuring in Reduce phase.
So please suggest if its worth modifying or implementing the store solution and
pushing this optimization either to the optimization layer or to a later point
in time. Also the current pig trunk code also uses the store and load approach.
> Rework mapreduce submission and monitoring
> ------------------------------------------
>
> Key: PIG-162
> URL: https://issues.apache.org/jira/browse/PIG-162
> Project: Pig
> Issue Type: Sub-task
> Environment: This bug tracks works to rework the submission and
> monitoring interface to map reduce as described in
> http://wiki.apache.org/pig/PigTypesFunctionalSpec
> Reporter: Alan Gates
> Assignee: Alan Gates
> Attachments: split.png
>
>
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.