Analysis below.
Shravan M Narayanamurthy wrote:
Hi Guys,
I think we need to find a proper set of rules for the project's
schema. The following script kinda of covers all the scenarios:
A = load 'a';
B = group A by $0;
C = foreach B {
C1 = filter A by $0>5;
C2 = distinct C1;
C3 = distinct A;
generate group, udf1(*), udf2(C2), udf3(C2.$1), udf4(C3), udf(C3.$1);
}
I think, we had not thought about the projection in the inner plan of
filter. With this constraint, we need a new set of rules. Can you post
an algorithm that will work to set the return types of the projects?
Thanks & Regards,
--Shravan
<snip>
In this case, the foreach should have the following plans:
0 - proj(0)
1 - proj( * ) -> udf1
2 - proj (1) -> filter -> distinct -> proj( * ) -> udf2
3 - proj (1) -> filter -> distinct -> proj(1) -> udf3
4 - proj(1) -> distinct -> proj( * ) -> udf4
5 - proj(1) -> distinct -> proj(1) -> udf5
In plans 2 and 3, filter will have an inner plan of:
proj(0) -> gt, const(5) -> gt
In discussing the scenario, Santhosh and I saw one issue, which is that
in plan 1, the proj( * ) will be incorrectly trying to accumulate a bag
for udf1, when it should just pass the tuple. Santhosh is going to fix
that by changing the project to determine whether it has a predecessor,
and if so whether that predecessor is a relational operator, instead of
looking at its input to see if it's a relational operator.
I didn't follow your comment on the issue with the project in the filter
plan. It looked fine to me.
Alan.