[ 
https://issues.apache.org/jira/browse/PIG-162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12583232#action_12583232
 ] 

Shravan Matthur Narayanamurthy commented on PIG-162:
----------------------------------------------------

What would be the best way for implementing the Split operator. The problem 
with implementing it as an operator would be the buffering required. Since we 
are following the single threaded model, a blocking getNext by say a filter 
operator might actualy read all the tuples from the split which can very well 
be in the reduce side. Since the other branch of the split will execute after 
the filter, there is no other go but to buffer all the tuples.

One way would be to replicate the pipeline during the logical to physical 
translation.

Another would be to construct a databag explicitly inside the Split and store 
all tuples from its input into the bag. Now attach the bag's iterator to the 
splt readers. But this doesn't sound very efficient to me.

Another one would be to differentiate the split processing in map and reduce 
phases. In the map side, we can follow the above approach of using a bag since 
the amount of data is restricted. On the reuce side, since we will have only 
one package, we can use plan folding. That is, make the plan that the split 
operator feeds to an attribute plan of the split. getNext() to split wil read a 
tuple and attach it to the attribute plan and will return whatever, the plan's 
root operator's getNext returns. The folded plan can be implemented as in the 
Map side.

Any suggestions?

> Rework mapreduce submission and monitoring
> ------------------------------------------
>
>                 Key: PIG-162
>                 URL: https://issues.apache.org/jira/browse/PIG-162
>             Project: Pig
>          Issue Type: Sub-task
>         Environment: This bug tracks works to rework the submission and 
> monitoring interface to map reduce as described in  
> http://wiki.apache.org/pig/PigTypesFunctionalSpec
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to