[
https://issues.apache.org/jira/browse/PIG-162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12583928#action_12583928
]
Alan Gates commented on PIG-162:
--------------------------------
It seems to me that split can be used in two different ways:
1)
split p into q if $0 > 5, r if $0 <= 5;
q1 = do something to q;
r1 = do something to r;
store q1 into 'bla';
store r1 into 'bla';
2)
split p into q if $0 > 5, r if $0 <= 5;
q1 = do something to q;
r1 = do something to r;
s = cogroup q1 by $1, r1 by $1;
I suspect the first case is more more likely than the second (though I don't
know it). In the first case pig will have to read all of the data through the
split when it see that first dump. The user most likely expects that pig will
not repeat the processing above the split when it encounter 'store r1'. In
this case it seems like pig should immediately write the r relation to an hdfs
file that can subsequently be read by the r1 transformation. There's no use
trying to buffer it in memory, as it's most likely too large, and the engine
has a number of things to do (everything on the q1 branch) before it can return
to processing the tuples in r.
In the second case, we need the results from q1 and r1 simultaneously. Reading
the split may still get somewhat out of sync, as all the rows destined for q1
may happen to be read before any rows from r1. But it seems reasonable to hope
not. Perhaps the best solution here is to implement a QueueDataBag that
extends DefaultBag, the difference being that when a record has been read from
the iterator it is removed from the bag. This would provide spilling in the
case that the reads for the two sides go too far out of sync, but allow the
engine to keep everything in memory when possible.
> Rework mapreduce submission and monitoring
> ------------------------------------------
>
> Key: PIG-162
> URL: https://issues.apache.org/jira/browse/PIG-162
> Project: Pig
> Issue Type: Sub-task
> Environment: This bug tracks works to rework the submission and
> monitoring interface to map reduce as described in
> http://wiki.apache.org/pig/PigTypesFunctionalSpec
> Reporter: Alan Gates
> Assignee: Alan Gates
>
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.