Analysis below.

Shravan M Narayanamurthy wrote:
Hi Guys,
I think we need to find a proper set of rules for the project's schema. The following script kinda of covers all the scenarios:
A = load 'a';
B = group A by $0;
C = foreach B {
C1 = filter A by $0>5;
C2 = distinct C1;
C3 = distinct A;
generate group, udf1(*), udf2(C2), udf3(C2.$1), udf4(C3), udf(C3.$1);
}

I think, we had not thought about the projection in the inner plan of filter. With this constraint, we need a new set of rules. Can you post an algorithm that will work to set the return types of the projects?

Thanks & Regards,
--Shravan

<snip>
In this case, the foreach should have the following plans:

0 - proj(0)

1 - proj( * ) -> udf1

2 - proj (1) -> filter -> distinct -> proj( * ) -> udf2

3 - proj (1) -> filter -> distinct -> proj(1) -> udf3

4 - proj(1) -> distinct -> proj( * ) -> udf4

5 - proj(1) -> distinct -> proj(1) -> udf5

In plans 2 and 3, filter will have an inner plan of:

proj(0) -> gt, const(5) -> gt

In discussing the scenario, Santhosh and I saw one issue, which is that in plan 1, the proj( * ) will be incorrectly trying to accumulate a bag for udf1, when it should just pass the tuple. Santhosh is going to fix that by changing the project to determine whether it has a predecessor, and if so whether that predecessor is a relational operator, instead of looking at its input to see if it's a relational operator.

I didn't follow your comment on the issue with the project in the filter plan. It looked fine to me.

Alan.

Reply via email to