All of what you say sounds like a feature to me rather than a problem. Yes, the implementor needs to do it right, but that kind of goes with the territory.
On Thu, Dec 11, 2008 at 11:32 AM, Pradeep Kamath <[email protected]>wrote: > I find it somewhat inconsistent that we treat both relations and bags > the same. > > SIZE(A) where A is real bag will be different in implementation than > SIZE(A) where A is a relation - For the former, all the data is already > in a container and one can just inspect the size. For the latter, you > have to do a group ALL-COUNT - this would be very confusing from a > backend implementation point of view. > > If we do treat relations and bags as equivalent, then all statements > which currently work on relations should work on bags (say in my input > data). Here is an example: > A = load 'bla' as (bg:{t:(x:int, y, z)}, str:chararray); > B = filter A.bg by x < 100; -- Directly access the bag "bg" inside A > (which is supposed to be bag too) and filter on on it - likewise other > operations possible on relations should work). > > Also A = load 'bla'; B = COUNT(A); will have to be supported (implicitly > by a map reduce boundary doing a group ALL -COUNT). This will be done > under the covers and it may not be obvious to a user that and explicit > group ALL - COUNT and a direct COUNT(A) are the same. > > > Thanks, > Pradeep > > -----Original Message----- > From: Olga Natkovich [mailto:[email protected]] > Sent: Thursday, December 11, 2008 11:12 AM > To: [email protected] > Subject: RE: What is a relation? > > I think we should consider Bag and relations to be the same so that we > can handle processing in the outer script as well as inside of nested > foreach the same and make it easier to extend the set of operators > allowed inside of foreach block. > > Olga > > > -----Original Message----- > > From: Alan Gates [mailto:[email protected]] > > Sent: Friday, December 05, 2008 6:04 PM > > To: [email protected] > > Subject: What is a relation? > > > > All, > > > > A question on types in pig. When you say: > > > > A = load 'myfile'; > > > > what exactly is A? For the moment let us call A a relation, > > since it is a set of records, and we can pass it to a > > relational operator, such as FILTER, ORDER, etc. > > > > To clarify the question, is a relation equivalent to a bag? > > In some ways it seems to be in our current semantics. > > Certainly you can turn a relation into a bag: > > > > A = load 'myfile'; > > B = group A all; > > > > The schema of the relation B at this point is <group, A>, > > where A is a bag. This does not necessarily mean that a > > relation is a bag, because an operation had to occur to turn > > the relation into a bag (the group all). > > > > But bags can be turned into relations, and then treated again > > as if they were bags: > > > > C = foreach B { > > C1 = filter A by $0 > 0; > > generate COUNT(C1); > > } > > > > Here the bag A created in the previous grouping step is being > > treated as it were a relation and passed to a relational > > operator, and the resulting relation (C1) treated as a bag to > > be passed COUNT. So at a very minimum it seems that a bag is > > a type of a relation, even if not all relations are bags. > > > > But, if top level (non-nested) relations are bags, why isn't > > it legal to do: > > > > A = load 'myfile'; > > B = A.$0; > > > > The second statement would be legal nested inside a foreach, > > but is not legal at the top level. > > > > We have been aware of this discrepancy for a while, and lived > > with it. But I believe it is time to resolve it. We've > > noticed that some parts of pig assume an equivalence between > > bag and relation (e.g. the > > typechecker) and other parts do not (e.g. the syntax example > > above). > > This inconsistency is confusing to users and developers > > alike. As Pig Latin matures we need to strive to make it a > > logically coherent and complete language. > > > > So, thoughts on how it ought to be? > > > > The advantage I see for saying a relation is equivalent to a > > bag is simplicity of the language. There is no need to > > introduce another data type. And it allows full relational > > operations to occur at both the top level and nested inside foreach. > > > > But this simplicity also seems me the downside. Are we > > decoupling the user so far from the underlying implementation > > that he will not be able to see side effects of his actions? > > A top level relation is assumably spread across many chunks > > and any operation on it will require one or more map reduce > > jobs, whereas a relation nested in a > > foreach is contained on one node. This also makes pig much more > > complex, because while it may hide this level of detail from > > the user, it clearly has to understand the difference between > > top level and nested operations and handle both cases. > > > > Alan. > > > -- Ted Dunning, CTO DeepDyve 4600 Bohannon Drive, Suite 220 Menlo Park, CA 94025 www.deepdyve.com 650-324-0110, ext. 738 858-414-0013 (m)
