Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Pig Wiki" for change
The following page has been changed by OlgaN:
=== Pig Latin Schemas ===
==== Defining a schema in a LOAD statement ====
The basic grammar for schema definition is taken from the JSON/Python
tuple/list/map definition, and is as follows:
field1 = Atom alias
name : (f1, f2, ...) = Tuple alias and schema
So the schema:
(time, query : (display, normalized), results : [url, title, summary])
would define a Tuple where the first field is an Atom called "time",
the second field is a Tuple called "query" with the Atom fields
"display" and "normalized", and the third field is a Bag called
"results", which contains tuples that have three Atom fields
"url", "title" and "summary".
The "AS" keyword on a LOAD statement allows you to define a schema for a
particular alias. For example,
A = load 'input1' as (tstamp, cookie, query);
B = load 'input2' as (query, url, rank);
associates schemas with A and B.
==== Schema Propagation ====
The system will do its best to infer the schema for a derived alias based on
the schemas of the input aliases.
Continuing with our running example, suppose we have
C = cogroup A by query, B by query;
Then C will be assigned the schema `(group, A: [tstamp, cookie, query] , B:
[query, url, rank])`
==== Referring to Nested Fields, i.e., Nested Projection ====
You can refer to fields up to 1 level below in the nesting. Thus, in the above
example, you can say,
foreach C generate group, A.cookie
==== Name Ambiguity Resolution ====
Sometimes, when using FLATTEN, there might be name ambiguities in schemas from
two different inputs. Thus, if in the above example, we write
D = foreach C generate flatten(A), flatten(B)
There will be a name ambiguity since both flatten(A) and flatten(B) have the
field `query`. To avoid ambiguity in such cases, fields can be referred to by
`<outer-alias>::fieldName`. Thus for C, we can refer to either `A::query` or
`B::query` but not to `query`.
However, the unambiguous fields can be accessed both by their names as well as
by `<outer-alias>::fieldName`. Thus for C, both `url` or `B::url` will access
the same field.
==== Assigning Names to Individual Items in GENERATE ====
Just like in SQL where you can give names to individual items in the select
list, we can name individual items in the generate clause using AS. Thus, in
E = foreach D generate (cookie eq 'null' ? 'null' : url ) as nullifiedUrl, rank
This will assign a schema `(nullifiedUrl, myRank)` to E.
==== Schemas of Functions ====
Eval functions can specify their own output schema by overriding the
outputSchema() method. The builtin function SUM specifies that its output is
called `sum`. Thus,
F = foreach C generate group, SUM(tstamp);
F gets assigned the schema: `(group,sum)`. This can of course be overriden
e.g., `generate group, SUM(tstamp) as sumTstamp`.
==== Last Resort: Overriding system-inferred schemas ====
Sometimes the system cannot infer a schema (e.g., binconds, evalfunctions that
dont specify one). In these cases, and also in others when you want to override
the system-inferred schema you can override it using the AS clause. Thus, you
C = (cogroup A by query, B by query) as (group, foo, bar);
and C would be assigned the schema `(foo,bar)`.