pig-user  

Re: Nested expressions in FOREACH vs FILTER

Mridul Muralidharan
Fri, 28 Mar 2008 10:59:59 -0700



Hi Alan,

To clarify : consider the query which we are discussing in the other mail -

A = load './a' USING PigStorage('\t') AS (id, name, source_id, target_id, score, atomic_target_val);

B = A;

C = FILTER B by ( ( score > '0' ) and ( ( ( name eq 'Product_id' and atomic_target_val eq '1' and score > '0.5' ) ) or ( ( name eq 'type' and atomic_target_val eq 'Product' ) ) or ( ( name eq 'name' ) ) ) );

D = GROUP C by source_id;

E = FOREACH D {
F1 = FILTER C by ( name eq 'Product_id' and atomic_target_val eq '1' and score > '0.5' ) ; F2 = FILTER C by ( name eq 'type' and atomic_target_val eq 'Product' ) ; GENERATE ( ( COUNT ( F1 ) > '0' ? '1' : '0' ) + ( COUNT ( F2 ) > '0' ? '1' : '0' ) == '2' ? -1 : $0), $1 ; };



If you notice, each tuple in D contains a bag ($1) which has tuples I want to filter on. The idea is, if the constraints (applied to F1, F2, etc) are satisfied by some subset of the tuples in the bag (not necessarily the same tuples) - the bag as a whole satisfies the constraints : else filter it out.


As an example for the query above, we can have :

1 Product_id 1001 10001 1.0 1
2 type 1001 10002 1.0 Product
3 name 1001 10003 1.0 Milk
4 Product_id 1002 10001 1.0 2
5 type 1002 10002 1.0 Product
6 name 1002 10003 1.0 Egg
7 Product_id 1003 10001 1.0 1
8 name 1003 10002 1.0 Bread


In this case, bags for 1002 and 1003 (through group) do not satisfy the constraints, while the bag for 1001 satisfies the constraints).


With the above script, I will need another step :

G = FILTER E by $0 != '-1';

to remove the entities which dont satisfy the constraints to be applied.


If filter supported the same level of functionality as provided by FOREACH, the above could have been written as :


E = FOREACH D {
F1 = FILTER C by ( name eq 'Product_id' and atomic_target_val eq '1' and score > '0.5' ) ; F2 = FILTER C by ( name eq 'type' and atomic_target_val eq 'Product' ) ; BY ( ( COUNT ( F1 ) > '0' ? '1' : '0' ) + ( COUNT ( F2 ) > '0' ? '1' : '0' ) == '2' ;
};



Hope this clarifies.
So any non trivial filtering of the relations (which is not expressible in a single statement) would result in a feature request of this form. In the above, I have made sure to restrict myself to not using UDF's - but you can imagine the constraints or COUNT, etc being replaced by udf's too.


Regards,
Mridul



Alan Gates wrote:
I'm not clear on the semantics you're proposing for filter. I think what you're saying is that pig cannot apply a relation level conditional (instead of record level conditional) in a natural way.

To be clear, pig can do a record level conditional like:

c = foreach b generate ($0 > '1' ? $1 : $2);

But if you instead want to apply the conditional to the entire relation, we have to do something contorted (like the workaround you suggest). You'd like to be able to do something like:

c = b generate (any $0 > '1' ? $1 : $2);

where the 'any' operator is applied to all of $0 instead of being applied a row at a time.

Is that correct, or are you suggesting more than that? Or perhaps something altogether different?

Alan.

On Mar 27, 2008, at 3:37 PM, Mridul Muralidharan wrote:

Hi,

  FOREACH supports nested expressions of form :
var1 = FOREACH var { <expr>'s; GENERATE <tuple> }

Similar functionality does not seem to be available with FILTER.
That is, slightly complex filter expressions - particularly when we need to process the Bag/tuples contained as tuples of the relation in question is not possible.

Mirroring FOREACH functionality, something like this would be great :

var1 = FILTER var {
  t1 = <expr>;
  t2 = <expr>;
  ...
  BY (conds);
}


Workaround for the immediate problem I am facing is to use FOREACH to generate something like $status, <tuple> and then FILTER on $status.
Followed by another FOREACH to remove the status.

Regards,
Mridul