I think the question isn't about feature or problem, or how hard it is to implement. It's about what level we want the language at. The advantage of making a relation equivalent to a bag, as Ted and others point out, is ease comprehension on the part of the users. They are not required to maintain an artificial distinction between what's happening in parallel and what's happening on a single node. To put Pradeep's point a slightly different way, by hiding this distinction we make it harder for the users to understand how pig will process their scripts. Moving Pig Latin further from execution will mean that users will, at times, make less optimal choices in writing their scripts because they may not realize that counting a bag at the top level has a very different cost than counting a bag nested in a foreach. So the choice is between a higher level abstraction that is easier to think about (e.g. Python) or a lower level abstraction that forces the user to think more like the machine and thus hopefully make better choices (e.g. C). It sounds like most of the community is voting for the higher abstraction.

To respond to Pradeep's statement about the filter, that
B = filter A.bg by x < 100;

which, if I understood correctly, we would be saying that we're filtering out records of bg where bg.x is < 100, should be legal if we say all relations are bag. I think this is incorrect. The filter in this statement is acting on A, not A.bg. The correct way to write the above statement would be:

B = foreach A {
        A1 = filter bg.x < 100;
        generate A1;

I believe this holds whatever we say about relations being bags. So the semantic is that a relational operator always applies only to the relation/bag it is applied to. In order to access elements of a relation/bag, the foreach operator is provided.


On Dec 11, 2008, at 12:09 PM, Ted Dunning wrote:

All of what you say sounds like a feature to me rather than a problem.

Yes, the implementor needs to do it right, but that kind of goes with the

On Thu, Dec 11, 2008 at 11:32 AM, Pradeep Kamath <prade...@yahoo- inc.com>wrote:

I find it somewhat inconsistent that we treat both relations and bags
the same.

SIZE(A) where A is real bag will be different in implementation than
SIZE(A) where A is a relation - For the former, all the data is already
in a container and one can just inspect the size. For the latter, you
have to do a group ALL-COUNT - this would be very confusing from a
backend implementation point of view.

If we do treat relations and bags as equivalent, then all statements
which currently work on relations should work on bags (say in my input
data). Here is an example:
A = load 'bla' as (bg:{t:(x:int, y, z)}, str:chararray);
B = filter A.bg by x < 100; -- Directly access the bag "bg" inside A
(which is supposed to be bag too) and filter on on it - likewise other
operations possible on relations should work).

Also A = load 'bla'; B = COUNT(A); will have to be supported (implicitly
by a map reduce boundary doing a group ALL -COUNT). This will be done
under the covers and it may not be obvious to a user that and explicit
group ALL - COUNT and a direct COUNT(A) are the same.


-----Original Message-----
From: Olga Natkovich [mailto:ol...@yahoo-inc.com]
Sent: Thursday, December 11, 2008 11:12 AM
To: pig-dev@hadoop.apache.org
Subject: RE: What is a relation?

I think we should consider Bag and relations to be the same so that we
can handle processing in the outer script as well as inside of nested
foreach the same and make it easier to extend the set of operators
allowed inside of foreach block.


-----Original Message-----
From: Alan Gates [mailto:ga...@yahoo-inc.com]
Sent: Friday, December 05, 2008 6:04 PM
To: pig-dev@hadoop.apache.org
Subject: What is a relation?


A question on types in pig.  When you say:

A = load 'myfile';

what exactly is A?  For the moment let us call A a relation,
since it is a set of records, and we can pass it to a
relational operator, such as FILTER, ORDER, etc.

To clarify the question, is a relation equivalent to a bag?
In some ways it seems to be in our current semantics.
Certainly you can turn a relation into a bag:

A = load 'myfile';
B = group A all;

The schema of the relation B at this point is <group, A>,
where A is a bag.  This does not necessarily mean that a
relation is a bag, because an operation had to occur to turn
the relation into a bag (the group all).

But bags can be turned into relations, and then treated again
as if they were bags:

C = foreach B {
        C1 = filter A by $0 > 0;
        generate COUNT(C1);

Here the bag A created in the previous grouping step is being
treated as it were a relation and passed to a relational
operator, and the resulting relation (C1) treated as a bag to
be passed COUNT.  So at a very minimum it seems that a bag is
a type of a relation, even if not all relations are bags.

But, if top level (non-nested) relations are bags, why isn't
it legal to do:

A = load 'myfile';
B = A.$0;

The second statement would be legal nested inside a foreach,
but is not legal at the top level.

We have been aware of this discrepancy for a while, and lived
with it.  But I believe it is time to resolve it.  We've
noticed that some parts of pig assume an equivalence between
bag and relation (e.g. the
typechecker) and other parts do not (e.g. the syntax example
This inconsistency is confusing to users and developers
alike.  As Pig Latin matures we need to strive to make it a
logically coherent and complete language.

So, thoughts on how it ought to be?

The advantage I see for saying a relation is equivalent to a
bag is simplicity of the language.  There is no need to
introduce another data type.  And it allows full relational
operations to occur at both the top level and nested inside foreach.

But this simplicity also seems me the downside.  Are we
decoupling the user so far from the underlying implementation
that he will not be able to see side effects of his actions?
A top level relation is assumably spread across many chunks
and any operation on it will require one or more map reduce
jobs, whereas a relation nested in a
foreach is contained on one node.   This also makes pig much more
complex, because while it may hide this level of detail from
the user, it clearly has to understand the difference between
top level and nested operations and handle both cases.


Ted Dunning, CTO
4600 Bohannon Drive, Suite 220
Menlo Park, CA 94025
650-324-0110, ext. 738
858-414-0013 (m)

Reply via email to