Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 

The following page has been changed by PiSong:

  || Legacy Store || bag || file ||
  || LoadTuple || bag || tuple ||
  || StoreTuple || Tuple || Bag ||
- || LoadAtom || Tuple || Atom ||
- || StoreAtom || Atom || Tuple ||
+ || LoadField || Tuple || Field ||
+ || StoreField || Field || Tuple ||
   1. This makes it possible to create a consistent nested processing model.
   1. This gives more flexibility to the language beyond plans only consisting 
of bag-based operators. For example, if I have implemented a UDF called 
TupleMinMax of type Tuple -> Tuple that does find MIN/MAX across all the 
elements. One way to try this out in Grunt would be:-
@@ -87, +87 @@

                        GreaterThan 0
-                        StoreAtom (Will be boolean)
+                        StoreField (Will be boolean)
  NOTE: This inner plan applies to each tuple in the input bag. The FILTER 
operator does iterate through the bag, giving out each tuple to the inner plan, 
and takes output of the inner plan which is boolean atom in this example. The 
data atom then is used by FILTER to determine whether to forward this tuple to 
the output port.  
@@ -99, +99 @@

                         /         \ 
                        /           \
-              AtomProject(0)   AtomProject(1)
+              FieldProject(0)   FieldProject(1)
                    /    |          |
       Const(2)--PLUS     \        /
                   |       \      / 
                   |        MULTIPLY
                   |          |
                   |          |
-             InsertAtom(1)  InsertAtom(0)
+             InsertField(1)  InsertField(0)
                   \          /
                    \        /
@@ -115, +115 @@

  NOTE: In the real implementation, separating inner plan for each output field 
might be simpler to do. For example "GENERATE $1+$2, ($1+$2)*5" can be a plan 
for $1+$2 and a plan for ($1+$2)*5 so that we don't have to care about merging 
them all. /!\ Open question /!\
  [shrav] Pig already kind of does what you are saying here; just that it does 
it implicitly. The loadTuple is infact what happens when a nested plan is 
processed. I guess the way to extend the language would be to just allow all 
the operators that we allow outside a nested plan inside of it. In fact, the 
execution side, that is the Physical side, already supports it. Just that we 
need to make appropriate parser changes and the hard thing would be to do type 
checking and parsing itself.
+ [pi] I think in logical plan change, I would be better to have something that 
indicates the link between outer/inner. I don't find any existing operator fit 
in this. I agree that it doesn't have to be explicit.
  ==== More examples ====
  Given GENERATE: Tuple -> Tuple
@@ -125, +126 @@

              /        \
- AtomProject(1)       AtomProject(2)                   
+ FieldProject(1)       FieldProject(2)                   
     |\                      |                    
     | \                     |  --------Constant(5)
     |  \                    | /
@@ -135, +136 @@

     |      ----------------\|
     |                      MUL 
     |                       |
- InsertAtom(1)           InsertAtom(2)
+ InsertField(1)           InsertField(2)
       \                    /
        \ _____        ____/
               \      /
@@ -143, +144 @@

  Diagram B1
  [shrav] Are you saying that pig does not support this now?
+ [Pi] No, this wiki page is solely from my imagination.
  This looks similar to a common relational plan:-
@@ -234, +236 @@

  JOIN :      This can be constructed by COGroup
- GENERATE looks oversimplified to me. First the input need not just be a 
tuple, it can be a combination of tuple and bag and flatten in that case 
actually produces the cartesian product.
+ [shrav] GENERATE looks oversimplified to me. First the input need not just be 
a tuple, it can be a combination of tuple and bag and flatten in that case 
actually produces the cartesian product.
  ALso in FOREACH, the function inside can be a full plan. So it can process 
bags as well and not just tuples.
+ [pi] f: Tuple x Tuple is the full plan inside ForEach. GENERATE in here is 
still ambiguous (and might not be correct) until we clearly separate 
responsibilities with its counterpart.
  == Problems with current Operators (5-May-2008) ==
@@ -260, +263 @@

  Seems like LOGenerate is not needed at all. GENERATE is more like just a part 
of FOREACH syntax (analogous to BY and FILTER)
  [shrav] I don't agree with this. In fact it is the other way round. The 
Foreach is dummy while the generate does all the work. The foreach just takes 
each input and uses the generate specification to process the input tuple. The 
generate spec is the one that defines the transformation.
+ [pi] We can think about this in two ways: first, only one of them do all the 
work. Second, we split responsibilities. I'm confused with what it is. We 
should come up with clear cut of responsibilies. Though, if you say "foreach 
just takes each input and uses", then it is not a dummy.
  ==== LOProject ====
  This operator is only for mapping input tuple to output tuple (eg. 
{A,B,C,D,E} ==> {A,C,D} ). Given the fact that we allow users to have fields in 
COGROUP, FILTER, FOREACH as expressions, LOProject then becomes just a special 
case when users merely specify direct mapping. Since we have agreed upon the 
concept of inner plans, I think LOProject is not needed.
  [shrav]Project is a consistent way implementing these fields that the user 
mentions without letting the user bother about all the conversions he might 
need to do if we just pass the raw tuple to him. Also you can only project out 
one field and not multiple fields.

Reply via email to