On Oct 15, 2010, at 7:38 PM, Alan Gates wrote:

> Basically it's a matter of clarity.  I agree that it creates a lot of  
> boiler plate, but we thought it made it more clear exactly what was  
> being passed in and out of the macro.  Especially in cases where a  
> macro returns multiple outputs (that is, you can't just look at the  
> last line and see what it is returning).  In the original proposal the  
> store would basically act as a return statement.  But perhaps we're  
> optimizing for the less common case.  If others agree that a more  
> terse (but less clear) syntax is better, I'm open to that.
> 
> One change I would want to make.  In your proposal, it isn't obvious  
> what is input and output without examining the macro.  In cases where  
> the macro is more than a few lines this will be hard to use.  This  
> could be addressed though by adding an 'out' keyword, so that it  
> becomes:
> 
> define bot_cleanser[X out, Y](user) {
>       X = filter Y by not is_a_bot($user);
> }
> 

I agree that explicitly declaring which aliases are produced and consumed is 
better.  I had thought about that but wanted my example to be simple since I 
wasn't sure if there were other reasons why the TempStorage intermediary was 
chosen.  

Additionally, there are issues with pulling schemas through macro expansion and 
avoiding field name ambiguity.


Lets use a more concrete example.

Here is the equivalent of "A join B on ((A.x = B.x) OR (A.y = B.y)) where 
A.time > B.time" in Pig.  This is a disjunctive join, and requires at least N + 
1 MapReduce passes where N is the number of columns to test for equality.


--------------
A = load 'Afoo' .....   as (x chararray, y chararray, time long, i int, j, int, 
k int);
B = load 'Bfoo' .....   as (x chararray, y chararray, time long, u double, v 
double);

... some work makes (x,y,time) unique for A and B, such as a group and some 
math. 


A_tmp = FILTER A BY x != '';
B_tmp = FILTER B BY X != '';
XMATCH = JOIN A_tmp ON x, B_tmp ON x;
XMATCHES = FILTER XMATCH by A_tmp.time > B_tmp.time;
XMATCHES = FOREACH XMATCHES GENERATE A_tmp::x as x, A_tmp::y as y, A_tmp::time 
as time, i as i, j as j, k as k; /*relabeling necessary for sanity in later 
cogroup, also 'project early and often'*/

A_tmp = FILTER A BY y != '';
B_tmp = FILTER B BY y != '';
YXMATCH = JOIN A_tmp ON x, B_tmp ON x;
YMATCHES = FILTER XMATCH by A_tmp.time > B_tmp.time;
YMATCHES = FOREACH YMATCHES GENERATE A_tmp::x as x, A_tmp::y as y, A_tmp::time 
as time, u as u, v as v; /*relabeling necessary for sanity in later cogroup, 
also 'project early and often'*/

MATCHES_GROUP = COGROUP YMATCHES BY (x,y,time) XMATCHES BY (x,y,time);
MATCHES = FOREACH MATCHES_GROUP {
  CHOSEN = (IsEmpty(YMATCHES) ? XMATCHES.(i,j,k : YMATCHES);
  GENERATE FLATTEN(CHOSEN.(x,y,i,j,k,u,v)) as (x,y,i,j,k,u,v); /* project and 
re-label or else later pig code has to know parent alias names*/
}
------------------

A function for disjunctive join would require several parameters:
The 'out' alias.
Two input aliases.
N pairs of fields to match.

It is the last that looks to complicate things a bit.  We should probably pass 
the fields visible to the macro as well as the aliases. 

That might look like:

------------------------
define disjunctive_join_filter[out RESULT, A.(a,b,c), B.(a,b,c)](filterStr) {
  inline join_filter[MATCH_1, A.(a,c), B.(a,c)]($filterStr);
  inline join_filter[MATCH_2, A.(b,c), B.(b,c)]($filterStr);
  RESULT_GROUP = COGROUP MATCH_1 BY (a,b,c), MATCH_2 BY (a,b,c);  // not sure 
if it can avoid alias ambiguity here, would hate to have to project for it
  RESULT = FOREACH RESULT_GROUP {
    CHOSEN = (IsEmpty(MATCH_1) ? MATCH_2 : MATCH_1);
    GENERATE FLATTEN(CHOSEN);
  }
}

define join_filter[out RESULT, A.(foo, bar), B(foo, bar)](filterStr) {
  A_tmp = FILTER A BY foo != $filterStr;
  B_tmp = FILTER B BY foo != $filterStr;
  RESULT = JOIN A ON foo, B ON foo;
  RESULT = FILTER RESULT BY A.bar > B.bar;
}
--------------------
In alias declarations, we have to declare what fields are visible to the macro, 
and give them names for the inside of the macro.  Otherwise you can't write the 
join_filter macro to operate on different columns.  The ALIAS.(field1, field2) 
syntax declares what the macro can 'see'.

Lets inline the whole thing:

--------------------
A = load 'Afoo' .....   as (x chararray, y chararray, time long, i int, j, int, 
k int);
B = load 'Bfoo' .....   as (x chararray, y chararray, time long, u double, v 
double);

... some work makes (x,y,time) unique for A and B, such as a group and some 
math. 

inline disjunctive_join_filter[MATCHES, A.(x,y,time), B.(x,y,time)]('');
FOREACH MATCHES GENERATE x as x, y as y, i as i, j as j, k as k, u as u, v as 
v;  
--------------------

I think the last line above, as well as the COGROUP line would fail due to 
alias disambiguation if a pure 'find and replace' macro expansion was done.  It 
will have to be smart and re-label the PARENT::field aliases to the right scope.
There may also be problems projecting from the 'out' alias due to field 
ambiguity and unknown parent alias names.  The TempStorage thing in this case 
adds 8 extra lines, increasing LOC rather than decreasing it.  Also, in order 
to create join_filter using TempStorage, you would need to project all of A and 
B manually in order to use it, and the caller and macro have to be in sync with 
alias names -- that is the caller has to know the internals of the macro in 
order to use it and will break if it changes internal names, although I suppose 
the macro can use positional indexes to avoid that.



-Scott

> Alan.
> 
> On Oct 15, 2010, at 5:58 PM, Scott Carey wrote:
> 
>> I'm most interested in the macro expansion and importing other files  
>> for shared common code.  I could be missing something, but the  
>> TempStorage thing necessary?
>> 
>> bot_filter.pig:
>> --------------
>> define bot_cleanser(user) {
>>   A = load 'bc_input' using TempStorage();
>>   B = filter A by not is_a_bot($user);
>>   store B into 'bc_output' using TempStorage();
>> }
>> ----------------
>> main.pig:
>> -------------------
>> import bot_filter.pig;
>> 
>> A = load 'fact';
>> store A into 'bc_input' using TempStorage();
>> inline bot_cleanser('username');
>> B = load 'bc_output' using TempStorage();
>> C = group B by user;
>> ...
>> store Z into 'processed';
>> -----------------------
>> 
>> Couldn't we pass aliases in instead and remove lots of boilerplate?
>> 
>> bot_filter.pig:
>> --------------
>> define bot_cleanser[X,Y](user) {
>>   X = filter Y by not is_a_bot($user);
>> }
>> ----------------
>> main.pig:
>> -------------------
>> import bot_filter.pig;
>> 
>> A = load 'fact';
>> inline bot_cleanser[A,B]('username');
>> C = group B by user;
>> ...
>> store Z into 'processed';
>> -----------------------
>> 
>> The inline then would substitute A for X, B for Y, and 'username'  
>> for user.  Aliases are separated from other parameters because we  
>> may actually be declaring new aliases when inlining and it should be  
>> easier to deal with the semantic differences that way.  In  
>> particular, the [A, B] above are essentially declaring that the  
>> macro 'shares' these aliases, and all other aliases do not overlap.
>> 
>> Any aliases not declared up front are renamed as to not collide when  
>> inlined.  I look at the macro expansion and function examples and  
>> see tons of alias naming boilerplate that should IMO be implicit  
>> somehow.  Pig already has a lot of alias and field naming  
>> boilerplate, I would like to avoid introducing more.  Otherwise, I'm  
>> sure I'll use a preprocessor again to get rid of it :).
>> 
>> 
>> 
>> 
>> On Oct 15, 2010, at 4:39 PM, Alan Gates wrote:
>> 
>>> After several months of mulling things around Richard and I have put
>>> together a proposed design for adding control flow to Pig.  See 
>>> http://wiki.apache.org/pig/TuringCompletePig
>>> for complete details.  Please give us your feedback.
>>> 
>>> Alan.
>> 
> 

Reply via email to