Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by AlanGates:
http://wiki.apache.org/pig/PigTypesFunctionalSpec

New page:
[[Anchor(Pig_Types_Functional_Specification)]]
= Pig Types Functional Specification =



[[Anchor(Purpose)]]
== Purpose ==
This document proposes a functional specification for the Pig type system.
It reviews what has already been implemented and discusses changes that
would be made to the type system to complete it.  In addition to types,
expression operators will be discussed.

[[Anchor(Definitions)]]
== Definitions ==
 * arithmetic operator:  *, /, +, -
 * atomic data types:  Single valued types (e.g. int, float).
 * comparitor:   boolean operations that test equality, inequality, or do 
another comparison (such as MATCHES).
 * complex data types:   Multi-valued types (e.g. bag).
 * datum:  A value, could be of complex or atomic type.
 * expression operators:  Language constructs used in FILTER or FOREACH 
statements that evaluate expressions of data.  These include comparitors, 
arithmetic operators, user defined functions, and language provided functions.
 * relational operators:  Top level language constructs such as FILTER, GROUP 
BY, etc.


[[Anchor(Supported_Types)]]
== Supported Types ==
Currently (pig release 1.2) pig has four data types:
 * bag:  A collection of tuples.
 * tuple:  An ordered set of data (each datum may be of any type).
 * map:  A set of key value pairs, where each key is an atom and each value a 
datum (of any type).
 * atom:  A single valued datum.  Currently these are always strings.

Pig will now have the following data types:
 * bag:  A collection of data (not necessarily tuples).
 * tuple:   An order set of data (no changes).
 * map:  A set of key value pairs, where each key is an atom and each value a 
datum (of any type).  This is not a change except that atoms will not 
necessarily be strings now.
 * int:  32 bit integer
 * long:  64 bit integer
 * float:  32 bit floating point
 * double:  64 bit floating point
 * chararray:  Character array.  Includes the concept of encoding (i.e., how 
are the characters encoded).  Initially supported encodings are UTF16 and none 
(no encoding known or implied, data is simply handled as a series of bytes).
 * unknown:  The type of this data is unknown or unspecified.

Eventually, user defined types (UDTs) may be added.  They are not specified
at this time.  However, nothing should be done in the implementation of types
or operators that will preclude the additions of UDTs at a later time.

[[Anchor(Type_Specification)]]
== Type Specification ==
There is no current way to specify types.  The system simply assumes types
based on how the user makes use of the field (e.g. if # is applied to the
field, it must be a map).

With these changes, users will be able to specify data types for incoming data. 
 Currently, it is
possible to define column names via AS.  This will be extended to allow
defining a type for the column.  The syntax will be:
{{{
LOAD 'myfile' AS (colname [type], ...)
}}}

This will only be legal on AS in LOAD.  Once this is done the types will be
inferable from the output of other operations.  For example, given the script:
{{{
a = LOAD 'myfile' AS (name chararray, salary double, underlings long);
b = GROUP a ALL;
c = FOREACH b GENERATE AVG(salary), AVG(underlings);
}}}
The system will be able to track that a.salary is a double and a.underlings is
a long without the user supplying that information.

Users will not be required to define the type of a given datum.  If the type
is not provided the datum will be assigned the type unknown.  Most expression 
evaluations can still be used with unknown types (see
[#Expression_Operators expression operators below] for details).  Once an 
unknown is coerced into a type by applying
an expression evaluation the field will retain that new type, not revert to 
unknown.  Leaving data as type unknown allows users to work with
data where they either do not want to provide datatypes or do not know them.
However, there is a performance and functionality cost.  Certain performance
optimizations are only possible if the types are known (for example computing
a SUM as a long value instaed of a double).

When defining complex types, users have two choices.  They can define them
completely.  For example, column a is a bag of tuples, each tuple having size
3, with types (long, int, long).  They can also define a complex type to
contain type unknown.  Thus column a could be described as a bag of unknown,
or a bag of tuples of unknown.

The complete syntax for describing types will be:
|| '''type declaration'''    || '''comments'''                                  
                            ||
|| bag(type)             ||                                                     
                    ||
|| tuple(...)            || indicates contents of tuple are not known (but not 
necessarily unknown) ||
|| tuple(type [, type])  ||                                                     
                    ||
|| map(type)             || type indicates key type, value types are always 
unknown                 ||
|| int                   ||                                                     
                    ||
|| long                  ||                                                     
                    ||
|| float                 ||                                                     
                    ||
|| double                ||                                                     
                    ||
|| chararray[(encoding)] || valid encodings are UTF16 and none, if not 
specified defaults to none   ||
|| unknown               ||                                                     
                    ||

For example:
{{{
a = LOAD 'myfile' AS (garage bag(unknown), links bag(chararray(UTF16)), page 
bag(tuple(...)), coordinates bag(tuple(float, chararray)));
}}}

Users must also be able to declare types for their user defined functions
(UDFs).  The syntax for this will follow the specification given in
PigExternalFunctionDev.  Specifically, `DEFINE` will be changed to have the
following syntax:

{{{
DEFINE alias '=' type funcspec '(' arglist ')' ';'

alias: [a-zA-Z_][a-zA-Z_0-9]+

type: bag | tuple | map | int | long | float | double | chararray['(' encoding 
')'] | unknown

funcspec: pathelement
          funcspec '.' pathelement

pathelement: [a-zA-Z_][a-zA-Z_0-9]+

arglist: type [varname]
         arglist ',' expr

varname: [a-zA-Z_][a-zA-Z_0-9]+
}}}

As with other types,
users will not be required to declare the signature of their UDFs, in which case
any arguments can be passed to the UDF and the UDFs return value will be
treated as type unknown.

If the function signature is defined, but used in a different way, this will
result in a compile time error.  For example, the following will result in an
error:

{{{
DEFINE myfunc = com.mycompany.myudfs.myfunc(int i, chararray c);
...
processed = foreach raw generate myfunc($0, $1, $2);
}}}

Also, if a function signature is defined the parser will do type checking on
the functions arguments to ensure they match the definition given for the
function.

Function overloading, default arguments, and indeterminate number of arguments
(i.e. myfunc(int, ...)) will not be supported at this time.  Since users can
alias a given function as many times as they want using DEFINE, if they wish
to use the same function with different arguments they can alias it multiple
times to different names.

[[Anchor(Constants)]]
== Constants ==
Currently, the only supported constants are strings, which must be enclosed in
single quotes.

String constants will be changed to be assumed to be of type chararray(none).
Users can specify these constants as ASCII characters (e.g. `x MATCHES 'abc'` )
or as octal constants (e.g. `x MATCHES '\0141\0142\0143'` ).  This allows
users to use non-ASCII characters in expressions.  Casts to type
chararray(UTF16) will be supported (see [#Cast_Operators casts] below).

Support will be added for numeric
constants.  By default any numeric constant matching the regular expression
[0-9]+ will be assigned a type of int.  Any numeric constant matching the
regular espression [0-9]+[lL] will be assigned a type of long.
Any numeric constant matching the regular
expression [0-9]*\.[0-9]+([eE][+-]?[0-9]+)? will be assigned a type of double.
Any numeric constant matching the regular expression 
[0-9]*\.[0-9]+([eE][+-]?[0-9]+)?[fF] will be assigned a type of float. 

Support will also be added for complex constants.  The syntax for complex
constants will be:
{{{
datalist: datum
          datalist ',' datum

key_value_pair_list: key_value_pair
                     key_value_pair_list ',' key_value_pair

key_value_pair: atomic_datum '#' datum

bag: '{' datalist '}'

tuple: '(' datalist ')'

map: '[' key_value_pair_list ']'
}}}

The use of `()` for tuple, `{}` for bag, and `[]` for map were chosen to match
the current output style of the dump command.

[[Anchor(Nulls)]]
== Nulls ==
The concept of NULL is not currently supported in pig.  This
forces errors in some cases where we would prefer not to error out (such as
divide by 0 and failed UDF calls).  The basic approach to null values in pig
is detailed on the page NullValuesInPig.  While that document introduces the
concept of a multi-null (to indicate that a bag did not finish processing when
an error occurred and there should possibly be more elements in the bag), that
concept will not be implemented in pig at this time.

All existing operators will be changed to work with Nulls.  Those changes will 
conform to SQL NULL semantics.  The following table describes how each operator 
will
handle a null value:

|| '''Operator''' || '''Interaction''' ||
|| Equality operators || If either sub-expression is null, the result of the 
equality comparison will be null ||
|| Matches || If either the string being matched against or the string defining 
the match is null, the result will be null ||
|| Is null || returns true if the tested value is null (duh!) ||
|| Arithmetic operators, concat || If either sub-expression is null, the 
resulting expression is null ||
|| Size || If the tested object is null, size will return null ||
|| Dereference of a map or tuple || If the dereferenced map or tuple is null, 
the result will be null. ||
|| Cast || Casting a null from one type to another will result in a null ||
|| Aggregate || Aggregate functions will ignore nulls, just as they do in SQL.  
This requires changes to the existing implementations. ||

'''Generating Nulls'''

In addition to the table above which indicates when operators will generate 
nulls, the following actions will generate a regular null:
   * Division by zero.
   * A user defined function returning an error for a given row without 
returning a fatal error.
   * A null value in the input data.
   * Dereference of a map key that does not exist in a map.  For example, given 
the map info containing [name#fred, phone#5551212], if the user does 
info#address, a null will be returned.
   * Dereference of a tuple into a non-existent field.  While this may be 
suprising in that an invalid dereference would seem to be an error, pig's free 
flowing data model requires that not all tuples in a relation have the same 
number of entries, which may lead to some tuple dereferences into non-existent 
fields.

[[Anchor(Expression_Operators)]]
== Expression Operators ==
Expression operators do not currently define what operands they support,
though the use of some operands cause run time errors.  For example, a query
such as
{{{
a = load '/user/pig/tests/data/singlefile/studenttab10k';
b = group a by $0;
c = foreach b generate $0 * $1;
}}}

will run through the map stage and then produce an error in the reduce stage.
Ideally the concept of multiplying a bag by a scalar would either be well
defined or produce an error before the map reduce tasks are begun.

Pig will change to clearly define which operators work with which types.
Using an operator with an unsupported type will result in an error at compile
time rather than during run time.

Expression operators are currently forced to use the lowest common denominator
for certain calculations.  For example, all numeric operations are done as
doubles since the system cannot discriminate between floating point types and
integer types.  This is innefficient.  It also breaks the law of least
astonishment, in that anyone who has significant exposure to computer
languages expects integer arithmetic operations to result in integer values,
not floating point values.

For each operator, a table is provided showing how the operator will interact
with the various data types.  The following terms will be used in the tables:
 * error:  It is not valid to use this operator with this/these type(s).
 * not yet:  The use of this operator with this/these types is valid, but will 
not be initially implemented.
 * yes:  This operator is supported with this/these types.

[[Anchor(Comparitors)]]
=== Comparitors ===
Comparitors are currently "perlish" in that they require the user to define
the type of the operands.  If the user wishes to compare two operands
numerically, = = is used, whereas `eq` is used for comparing two operands as
strings.  The existing string comparitors `eq`, `ne`, `lt`, `lte`, `gt`, `gte` 
will
continue to be supported.  But they will only be necessary when working with
data of type unknown.  It will now be legal to use = = , ! = , etc. with data of
type chararray.

[[Anchor(Numeric_Equals_and_Notequals)]]
==== Numeric Equals and Notequals ====
||                || '''bag''' || '''tuple''' || '''map''' || '''int''' || 
'''long''' || '''float''' || '''double''' || '''chararray''' || '''unknown''' ||
|| '''bag'''          || error || error   || error || error || error  || error  
 || error    || error       || error     ||
|| '''tuple'''        ||       || Tuple A is equal to tuple B iff they have the 
same size s, and for all 0 <= i < s A[i] = = B[i] || error || error || error  
|| error   || error    || error       || error     ||
|| '''map'''          ||       ||         || Map A is equal to map B iff A and 
B have the same number of entries, and for every key k1 in A with a value of 
v1, there is a key k2 in B with a value of v2, such that k1 = = k2 and v1 = = 
v2 || error || error  || error   || error    || error       || error     ||
|| '''int'''          ||       ||         ||       || yes   || yes    || yes    
 || yes      || error       || as int    ||
|| '''long'''         ||       ||         ||       ||       || yes    || yes    
 || yes      || error       || as long   ||
|| '''float'''        ||       ||         ||       ||       ||        || yes    
 || yes      || error       || as float  ||
|| '''double'''       ||       ||         ||       ||       ||        ||        
 || yes      || error       || as double ||
|| '''chararray'''    ||       ||         ||       ||       ||        ||        
 ||          || Two chararrays can be compared if they have the same encoding.  
Given a chararray A with encoding X (where X is not none) and chararray B with 
encoding none, chararray B can be encoded using X, and the comparison done.  || 
as chararray with encoding none ||
|| '''unknown'''      ||       ||         ||       ||       ||        ||        
 ||          ||             || as double ||

[[Anchor(String_equality_and_Inequality_Comparators.)]]
==== String equality and Inequality Comparators. ====
These include `eq` `ne` `lt` `lte` `gt` `gte` .

These are only valid for use with chararray and unknown types.  They are only 
required when comparing two unknowns, and the user wishes to
force the comparison to be done as chararrays.

||                || '''chararray'''              || '''unknown'''              
         ||
|| '''chararray'''    || uses = = , etc. function || as chararray with encoding 
none ||
|| '''unknown'''      ||                          || as chararray with encoding 
none ||


[[Anchor(Numeric_Inequality_Operators_Except_Notequals)]]
==== Numeric Inequality Operators Except Notequals ====
||                || '''bag''' || '''tuple''' || '''map''' || '''int''' || 
'''long''' || '''float''' || '''double''' || '''chararray''' || '''unknown''' ||
|| '''bag'''          || error || error   || error || error || error  || error  
 || error    || error       || error     ||
|| '''tuple'''        ||       || error   || error || error || error  || error  
 || error    || error       || error     ||
|| '''map'''          ||       ||         || error || error || error  || error  
 || error    || error       || error     ||
|| '''int'''          ||       ||         ||       || yes   || yes    || yes    
 || yes      || error       || as int    ||
|| '''long'''         ||       ||         ||       ||       || yes    || yes    
 || yes      || error       || as long   ||
|| '''float'''        ||       ||         ||       ||       ||        || yes    
 || yes      || error       || as float  ||
|| '''double'''       ||       ||         ||       ||       ||        ||        
 || yes      || error       || as double ||
|| '''chararray'''    ||       ||         ||       ||       ||        ||        
 ||          || Two chararrays can be compared if they have the same encoding.  
Given a chararray A with encoding X (where X is not none) and chararray B with 
encoding none, chararray B can be encoded using X, and the comparison done.  || 
as chararray ||
|| '''unknown'''      ||       ||         ||       ||       ||        ||        
 ||          ||             || as double ||


[[Anchor(MATCHES)]]
==== MATCHES ====
||                || '''chararray''' || '''unknown'''                       ||
|| '''chararray'''    || Two chararrays can be compared if they have the same 
encoding.  Given a chararray A with encoding X (where X is not none) and 
chararray B with encoding none, chararray B can be encoded using X, and the 
comparison done.  || as chararray with encoding none ||
|| '''unknown'''      ||             || as chararray with encoding none ||

[[Anchor(IS_NULL_)]]
==== IS NULL  ====
This is a new operator.

This operator will be added to check if a datum is null.  For SQL compatibility 
the syntax IS NOT NULL will also be supported.  It can be
applied to any data type.

[[Anchor(Binary_Operators)]]
=== Binary Operators ===

[[Anchor(Multiplication_and_Division)]]
==== Multiplication and Division ====
These are the operators `*` and `/` .

||                || '''bag''' || '''tuple''' || '''map''' || '''int''' || 
'''long''' || '''float''' || '''double''' || '''chararray''' || '''unknown''' ||
|| '''bag'''          || error || error   || error || not yet || not yet || not 
yet || not yet || error     || as double, not yet ||
|| '''tuple'''        ||       || error   || error || not yet || not yet || not 
yet || not yet || error     || error     ||
|| '''map'''          ||       ||         || error || error || error  || error  
 || error    || error       || error     ||
|| '''int'''          ||       ||         ||       || yes   || yes    || yes    
 || yes      || error       || as int    ||
|| '''long'''         ||       ||         ||       ||       || yes    || yes    
 || yes      || error       || as long   ||
|| '''float'''        ||       ||         ||       ||       ||        || yes    
 || yes      || error       || as float  ||
|| '''double'''       ||       ||         ||       ||       ||        ||        
 || yes      || error       || as double ||
|| '''chararray'''    ||       ||         ||       ||       ||        ||        
 ||          || error       || error     ||
|| '''unknown'''      ||       ||         ||       ||       ||        ||        
 ||          ||             || as double ||

[[Anchor(Modulo)]]
==== Modulo ====
This is a new operator, `%` .

||                || '''bag''' || '''tuple''' || '''map''' || '''int''' || 
'''long''' || '''float''' || '''double''' || '''chararray''' || '''unknown''' ||
|| '''bag'''          || error || error   || error || error || error  || error  
 || error    || error       || error     ||
|| '''tuple'''        ||       || not yet || error || error || error  || error  
 || error    || error       || as tuple, not yet ||
|| '''map'''          ||       ||         || error || error || error  || error  
 || error    || error       || error     ||
|| '''int'''          ||       ||         ||       || yes   || yes    || error  
 || error    || error       || as int    ||
|| '''long'''         ||       ||         ||       ||       || yes    || error  
 || error    || error       || as long   ||
|| '''float'''        ||       ||         ||       ||       ||        || error  
 || error    || error       || error     ||
|| '''double'''       ||       ||         ||       ||       ||        ||        
 || error    || error       || error     ||
|| '''chararray'''    ||       ||         ||       ||       ||        ||        
 ||          || error       || error     ||
|| '''unknown'''      ||       ||         ||       ||       ||        ||        
 ||          ||             || as long   ||

We may choose not to implement the mod operator right away, as there are no
immediate user requests for it.


[[Anchor(Addition_and_Subtraction)]]
==== Addition and Subtraction ====
These are the operators `+` and `-` .

||                || '''bag''' || '''tuple''' || '''map''' || '''int''' || 
'''long''' || '''float''' || '''double''' || '''chararray''' || '''unknown''' ||
|| '''bag'''          || error || error   || error || error || error  || error  
 || error    || error       || error     ||
|| '''tuple'''        ||       || not yet || error || error || error  || error  
 || error    || error       || as tuple, not yet ||
|| '''map'''          ||       ||         || error || error || error  || error  
 || error    || error       || error     ||
|| '''int'''          ||       ||         ||       || yes   || yes    || yes    
 || yes      || error       || as int    ||
|| '''long'''         ||       ||         ||       ||       || yes    || yes    
 || yes      || error       || as long   ||
|| '''float'''        ||       ||         ||       ||       ||        || yes    
 || yes      || error       || as float  ||
|| '''double'''       ||       ||         ||       ||       ||        ||        
 || yes      || error       || as double ||
|| '''chararray'''    ||       ||         ||       ||       ||        ||        
 ||          || error       || error     ||
|| '''unknown'''      ||       ||         ||       ||       ||        ||        
 ||          ||             || as double ||

[[Anchor(Concat)]]
=== Concat ===
This is a new operator.

A new operator concat will be added for chararrays.
||                || '''chararray''' || '''unknown''' ||
|| '''chararray'''    || Two char arrays can be concatenated if they have the 
same encoding.  Given a char array A with encoding X (where X is not none) and 
char array B with encoding none, char array B can be encoded using encoding X, 
and the concatenation then done.  The resulting chararrays will have encoding 
X.  || as chararray ||
|| '''unknown'''      ||             || as chararray(none) ||


[[Anchor(Unary_Operators)]]
=== Unary Operators ===
[[Anchor(Negation)]]
==== Negation ====
This is the unary operator `-` .

|| '''bag'''           || error ||
|| '''tuple'''         || error ||
|| '''map'''           || error ||
|| '''int'''           || yes   ||
|| '''long'''          || yes   ||
|| '''float'''         || yes   ||
|| '''double'''        || yes   ||
|| '''chararray'''     || error ||
|| '''unknown'''       || as double ||

[[Anchor(Size_)]]
==== Size  ====
This is a new operator.

A new operator size will be added that can take the size of types.
|| '''bag'''           || returns number of elements in bag ||
|| '''tuple'''         || returns arity of the tuple ||
|| '''map'''           || returns number of key/value pairs in map ||
|| '''int'''           || returns 1 ||
|| '''long'''          || returns 1 ||
|| '''float'''         || returns 1 ||
|| '''double'''        || returns 1 ||
|| '''chararray'''     || returns number of characters in the array ||
|| '''unknown'''       || as chararray(none) ||

[[Anchor(Tuple_Dereference_Operator)]]
==== Tuple Dereference Operator ====
This is the operator `.` .

For tuples, dereferencing into the tuple via . will continue to be supported.  
This dereferencing can be done via name or position.  For example 
`mytuple.myfield`
and `mytuple.$0` are both valid dereferences.  If the dot operator is applied 
to an unkonwn type, the type will be assumed to be a tuple.

[[Anchor(Map_Dereference_Operator)]]
==== Map Dereference Operator ====
This is the operator `#` .

For maps, dereferencing into the map via # will continue to be supported.  This 
dereferencing must be done by key, for example `mymap#mykey`
If the pound operator is applied to an unkonwn type, the type will be assumed 
to be a map.

[[Anchor(Cast_Operators)]]
=== Cast Operators ===
Casts will be added to the language.  Casts will only be supported between 
atomic types.

C/Java like syntax will be used for casts (e.g. `(int)mydouble`).

Cast operators.  Applies only to atomic types
||                || '''to'''  ||||||||||||||
|| '''from'''         || '''int''' || '''long''' || '''float''' || '''double''' 
|| '''chararray'''                                 || '''unknown''' ||
|| '''int'''          ||       || yes    || yes     || yes      || yes          
                               || error     ||
|| '''long'''         || yes   ||        || yes     || yes      || yes          
                               || error     ||
|| '''float'''        || yes   || yes    ||         || yes      || yes          
                               || error     ||
|| '''double'''       || yes   || yes    || yes     ||          || yes          
                               || error     ||
|| '''chararray'''    || yes   || yes    || yes     || yes      || Casts 
between different character encodings || error     ||
|| '''unknown'''      || yes   || yes    || yes     || yes      || yes          
                               ||           ||

Downcasts may cause loss of data (ie casting from long to int may drop bits).

[[Anchor(Aggregate_Functions)]]
=== Aggregate Functions ===
The aggregate functions take a bag of tuples and an indicator of which field in 
the tuple to compute on.

In the following chart, the entry `error` indicates that applying that 
aggregate function to a field of this type is an error.  A data type (e.g. 
long) indicates
that applying that aggregate function to a field of this type returns the 
indicated data type.
||                || '''int''' || '''long''' || '''float''' || '''double''' || 
'''chararray''' || '''unknown''' ||
|| COUNT          || long  || long   || long    || long     || long        || 
long      ||
|| SUM            || long  || long   || double  || double   || error       || 
as double ||
|| AVG            || long  || long   || double  || double   || error       || 
as double ||
|| MIN            || int   || long   || float   || double   || chararray   || 
as double ||
|| MAX            || int   || long   || float   || double   || chararray   || 
as double ||

[[Anchor(Argument_Construction_for_Functions)]]
=== Argument Construction for Functions ===
Because of the way pig breaks up grouping and cogrouping separate from the 
application of aggregation functions and flattening, people who
are used to SQL struggle to understand how to use expression operators in an 
aggregate function.  Consider the following SQL and pig
queries:

{{{
-- SQL query 1
SELECT name, sum(salary * bonus_multiplier)
FROM employee
GROUP BY name;
}}}

{{{
# pig latin query 1
employee = LOAD 'employee' AS (name, salary, bonus_multiplier);
grouped = GROUP employee BY name;
total_compensation = FOREACH grouped GENERATE group, SUM(salary * 
bonus_multiplier);
}}}

At first glance, it would appear that these two are functionally equivalent.  
But they are not.  The schema of `grouped` in the pig query
is {group, {[name, salary, bonus]}} (where {} indicates a bag and [] a tuple).  
The FOREACH in the last line serializes the outer bag, so that
each tuple in it is processed separately.  The function SUM() is then passed 
two bags, one containing a tuple fore each of the entries in
`grouped.employee:salary` and one containing a tuple for each of the entries in 
`grouped.employee:bonus_multiplier`.  Multiplying these two bags is not
what the user intends in this case, and is not a well defined mathematical 
operation.

The proposed solution is to change the semantics of pig, so that expression 
evaluation on function arguments is done before the arguments are
constructed as bags of tuples, rather than afterwards.  This means that the 
semantics would change so that `SUM(salary * bonus_multiplier)` means
that for each tuple in `grouped`, the fields `grouped.employee:salary` and 
`grouped.employee:bonus_multiplier` will be multiplied and the
result formed into tuples that are placed in a bag to be passed to the function 
`SUM()`.  Note that this will not resolve all problems, such as the following:

{{{
-- SQL query 2
SELECT name, sum(e.salary * b.multiplier)
FROM employee JOIN bonuses USING (name)
GROUP BY name;
}}}

This could still not be replicated in pig latin as
{{{
# pig latin query 2
employee = LOAD 'employee' AS (name, salary);
bonuses = LOAD 'bonuses' AS (name, multiplier);
grouped = COGROUP employee BY name, bonuses BY name;
total_compensation = FOREACH grouped GENERATE group, SUM(employee:salary * 
bonuses:multiplier);
}}}

This would not work because `grouped.employee:salary` and 
`grouped.bonuses:multiplier` are in different bags.  There is thus no way to
intelligently construct the bag of tuples `salary * multiplier`.  This would 
instead need to be recast as:
{{{
# pig latin query 2.1
employee = LOAD 'employee' AS (name, salary);
bonuses = LOAD 'bonuses' AS (name, multiplier);
grouped = COGROUP employee BY name, bonuses BY name;
flattened = FOREACH grouped GENERATE group, FLATTEN(employee), FLATTEN(bonuses);
grouped_again = GROUP flattened BY group;
total_compensation = FOREACH grouped_again GENERATE group, SUM(employee:salary 
* bonuses:multiplier);
}}}

This double grouping is unfortunate.  But given that the user wishes to do 
arithmetic evaluation across bags, this seems unavoidable.  The
language will need to detect this and error out when a query such as query 2 is 
given.

Reply via email to