All of what you say sounds like a feature to me rather than a problem.

Yes, the implementor needs to do it right, but that kind of goes with the
territory.

On Thu, Dec 11, 2008 at 11:32 AM, Pradeep Kamath <prade...@yahoo-inc.com>wrote:

> I find it somewhat inconsistent that we treat both relations and bags
> the same.
>
> SIZE(A) where A is real bag will be different in implementation than
> SIZE(A) where A is a relation - For the former, all the data is already
> in a container and one can just inspect the size. For the latter, you
> have to do a group ALL-COUNT - this would be very confusing from a
> backend implementation point of view.
>
> If we do treat relations and bags as equivalent, then all statements
> which currently work on relations should work on bags (say in my input
> data). Here is an example:
> A = load 'bla' as (bg:{t:(x:int, y, z)}, str:chararray);
> B = filter A.bg by x < 100; -- Directly access the bag "bg" inside A
> (which is supposed to be bag too) and filter on on it - likewise other
> operations possible on relations should work).
>
> Also A = load 'bla'; B = COUNT(A); will have to be supported (implicitly
> by a map reduce boundary doing a group ALL -COUNT). This will be done
> under the covers and it may not be obvious to a user that and explicit
> group ALL - COUNT and a direct COUNT(A) are the same.
>
>
> Thanks,
> Pradeep
>
> -----Original Message-----
> From: Olga Natkovich [mailto:ol...@yahoo-inc.com]
> Sent: Thursday, December 11, 2008 11:12 AM
> To: pig-dev@hadoop.apache.org
> Subject: RE: What is a relation?
>
> I think we should consider Bag and relations to be the same so that we
> can handle processing in the outer script as well as inside of nested
> foreach the same and make it easier to extend the set of operators
> allowed inside of foreach block.
>
> Olga
>
> > -----Original Message-----
> > From: Alan Gates [mailto:ga...@yahoo-inc.com]
> > Sent: Friday, December 05, 2008 6:04 PM
> > To: pig-dev@hadoop.apache.org
> > Subject: What is a relation?
> >
> > All,
> >
> > A question on types in pig.  When you say:
> >
> > A = load 'myfile';
> >
> > what exactly is A?  For the moment let us call A a relation,
> > since it is a set of records, and we can pass it to a
> > relational operator, such as FILTER, ORDER, etc.
> >
> > To clarify the question, is a relation equivalent to a bag?
> > In some ways it seems to be in our current semantics.
> > Certainly you can turn a relation into a bag:
> >
> > A = load 'myfile';
> > B = group A all;
> >
> > The schema of the relation B at this point is <group, A>,
> > where A is a bag.  This does not necessarily mean that a
> > relation is a bag, because an operation had to occur to turn
> > the relation into a bag (the group all).
> >
> > But bags can be turned into relations, and then treated again
> > as if they were bags:
> >
> > C = foreach B {
> >         C1 = filter A by $0 > 0;
> >         generate COUNT(C1);
> > }
> >
> > Here the bag A created in the previous grouping step is being
> > treated as it were a relation and passed to a relational
> > operator, and the resulting relation (C1) treated as a bag to
> > be passed COUNT.  So at a very minimum it seems that a bag is
> > a type of a relation, even if not all relations are bags.
> >
> > But, if top level (non-nested) relations are bags, why isn't
> > it legal to do:
> >
> > A = load 'myfile';
> > B = A.$0;
> >
> > The second statement would be legal nested inside a foreach,
> > but is not legal at the top level.
> >
> > We have been aware of this discrepancy for a while, and lived
> > with it.  But I believe it is time to resolve it.  We've
> > noticed that some parts of pig assume an equivalence between
> > bag and relation (e.g. the
> > typechecker) and other parts do not (e.g. the syntax example
> > above).
> > This inconsistency is confusing to users and developers
> > alike.  As Pig Latin matures we need to strive to make it a
> > logically coherent and complete language.
> >
> > So, thoughts on how it ought to be?
> >
> > The advantage I see for saying a relation is equivalent to a
> > bag is simplicity of the language.  There is no need to
> > introduce another data type.  And it allows full relational
> > operations to occur at both the top level and nested inside foreach.
> >
> > But this simplicity also seems me the downside.  Are we
> > decoupling the user so far from the underlying implementation
> > that he will not be able to see side effects of his actions?
> > A top level relation is assumably spread across many chunks
> > and any operation on it will require one or more map reduce
> > jobs, whereas a relation nested in a
> > foreach is contained on one node.   This also makes pig much more
> > complex, because while it may hide this level of detail from
> > the user, it clearly has to understand the difference between
> > top level and nested operations and handle both cases.
> >
> > Alan.
> >
>



-- 
Ted Dunning, CTO
DeepDyve
4600 Bohannon Drive, Suite 220
Menlo Park, CA 94025
www.deepdyve.com
650-324-0110, ext. 738
858-414-0013 (m)

Reply via email to