Pig currently allows implicit splits within the foreach block. An
example that illustrates this behaviour follows:
A = load 'input1';
B = load 'input2';
C = cogroup A by $0, B by $0;
D = foreach C do {
XX = filter A by $0 > 5;
XY = filter B by $0 > 5; //at this point, there is an implicit
split in the foreach plan
generate XX.$1, XY.$1; //here the generate needs to handle the
merge as its inputs are from XX and XY
}
Notice that there is an implicit split in the foreach plan. Each input
tuple from C has to be piped to XX and XY. The generate has to now
handle the merge as both XX and XY serve as inputs. The inputs to
generate are now a DAG and not a tree.
Generate
/ \
XX XY
\ /
Foreach
This makes the execution pipeline fairly complex. Should we restrict the
usage to not allow DAGs as input to the generate?
Thoughts?
Thanks,
Santhosh