[ 
https://issues.apache.org/jira/browse/PIG-162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12586502#action_12586502
 ] 

shravanmn edited comment on PIG-162 at 4/7/08 12:27 PM:
-----------------------------------------------------------------------------

After some thought, here is what I think. The solution that works for all cases 
would be to store the output of the split right away into different hdfs files 
and add suitable loads at the other end. However, if we are in the map phase, 
the store and load would probably be more expensive than if we were to create 
multiple pipelines that are replicas of the pipeline in the map phase that is 
below the split and attach it with appropriate filters to the other end. Also 
the same applies if there exists a diamond structure and in fact if the split 
occurs in the reduce phase, we only need to create pipeline replicas for the 
pipeline below the split only till the Map-Reduce boundary. But this solution 
would not do that well if we have scenario one pointed out by alan. Because 
there is no pipeline on the other end for r1 when store q1 is executed and 
hence in the worst case the last job in the job dag that has the split might 
run again.

Also the cost of implementing the solution where we check for diamond 
structures and differentiate between map and reduce occurences will take time 
to implement as the code that does thie Physical to MR translation I wrote a 
few days back did not consider this kind of an optimization. It would probably 
take a couple of days to modify it.

I have attached a 
[figure|https://issues.apache.org/jira/secure/attachment/12379589/split.png] 
that shows the replication idea in case of a diamond structure and split 
occuring in Reduce phase.

So please suggest if its worth modifying or implementing the store solution and 
pushing this optimization either to the optimization layer or to a later point 
in time. Also the current pig trunk code also uses the store and load approach.

      was (Author: shravanmn):
    After some thought, here is what I think. The solution that works for all 
cases would be to store the output of the split right away into different hdfs 
files and add suitable loads at the other end. However, if we are in the map 
phase, the store and load would probably be more expensive than if we were to 
create multiple pipelines that are replicas of the pipeline in the map phase 
that is below the split and attach it with appropriate filters to the other 
end. Also the same applies if there exists a diamond structure and in fact if 
the split occurs in the reduce phase, we only need to create pipeline replicas 
for the pipeline below the split only till the Map-Reduce boundary. But this 
solution would not do that well if we have scenario one pointed out by alan. 
Because there is no pipeline on the other end for r1 when store q1 is executed 
and hence in the worst case the last job in the job dag that has the split 
might run again.

Also the cost of implementing the solution where we check for diamond 
structures and differentiate between map and reduce occurences will take time 
to implement as the code that does thie Physical to MR translation I wrote a 
few days back did not consider this kind of an optimization. It would probably 
take a couple of days to modify it.

I have attached a figure that shows the replication idea in case of a diamond 
structure and split occuring in Reduce phase.

So please suggest if its worth modifying or implementing the store solution and 
pushing this optimization either to the optimization layer or to a later point 
in time. Also the current pig trunk code also uses the store and load approach.
  
> Rework mapreduce submission and monitoring
> ------------------------------------------
>
>                 Key: PIG-162
>                 URL: https://issues.apache.org/jira/browse/PIG-162
>             Project: Pig
>          Issue Type: Sub-task
>         Environment: This bug tracks works to rework the submission and 
> monitoring interface to map reduce as described in  
> http://wiki.apache.org/pig/PigTypesFunctionalSpec
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>         Attachments: split.png
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to