What is a relation?

Alan Gates Fri, 05 Dec 2008 18:05:35 -0800

All,

A question on types in pig.  When you say:


A = load 'myfile';

what exactly is A? For the moment let us call A a relation, since itis a set of records, and we can pass it to a relational operator,such as FILTER, ORDER, etc.

To clarify the question, is a relation equivalent to a bag? In someways it seems to be in our current semantics. Certainly you can turna relation into a bag:


A = load 'myfile';
B = group A all;

The schema of the relation B at this point is <group, A>, where A isa bag. This does not necessarily mean that a relation is a bag,because an operation had to occur to turn the relation into a bag(the group all).

But bags can be turned into relations, and then treated again as ifthey were bags:


C = foreach B {
       C1 = filter A by $0 > 0;
       generate COUNT(C1);
}

Here the bag A created in the previous grouping step is being treatedas it were a relation and passed to a relational operator, and theresulting relation (C1) treated as a bag to be passed COUNT. So at avery minimum it seems that a bag is a type of a relation, even if notall relations are bags.

But, if top level (non-nested) relations are bags, why isn't it legalto do:


A = load 'myfile';
B = A.$0;

The second statement would be legal nested inside a foreach, but isnot legal at the top level.

We have been aware of this discrepancy for a while, and lived withit. But I believe it is time to resolve it. We've noticed that someparts of pig assume an equivalence between bag and relation (e.g. thetypechecker) and other parts do not (e.g. the syntax example above).This inconsistency is confusing to users and developers alike. AsPig Latin matures we need to strive to make it a logically coherentand complete language.


So, thoughts on how it ought to be?

The advantage I see for saying a relation is equivalent to a bag issimplicity of the language. There is no need to introduce anotherdata type. And it allows full relational operations to occur at boththe top level and nested inside foreach.

But this simplicity also seems me the downside. Are we decouplingthe user so far from the underlying implementation that he will notbe able to see side effects of his actions? A top level relation isassumably spread across many chunks and any operation on it willrequire one or more map reduce jobs, whereas a relation nested in aforeach is contained on one node. This also makes pig much morecomplex, because while it may hide this level of detail from theuser, it clearly has to understand the difference between top leveland nested operations and handle both cases.


Alan.

What is a relation?

Reply via email to