pi song
Fri, 06 Jun 2008 18:01:36 -0700
We should update Pig Wiki to reflect this. Even me, I have always been
thinking that our semantic of bag == multiset. The only operation that
results in "ordered bag" is "ORDER" and any operation on ordered bag doesn't
preserve the closure of ordered bag for example
B = ORDER A BY $0 ;
C = FILTER B BY $0 == 0
The "FILTER" operator doesn't preserve ordered bag closure and outputs only
a bag.
Also here is what I discussed with Santhosh before regarding:-
A = FOREACH B {
GENERATE B.$0 * B.$1 ;
} ;
that I think is inappropriate because this operation seems to be very
non-deterministic in definition unless we have the notion of order on B.
(Besides that fact that we also don't have definitions of Bag x Bag
operations like this)
Pi
On Sat, Jun 7, 2008 at 8:09 AM, Chris Olston <[EMAIL PROTECTED]> wrote:
> Yes, that's right -- it was *not* a typo. Pig "bags" are ordered.
>
> By the way, the word "table" is also problematic because Pig does not
> require uniform schemas across tuples. Usually "table" implies that all
> member tuples adhere to a given table-level schema.
>
> Bottom line is that conceptually there is one data type that encompasses
> what we currently refer to as "bag" and "table". As for a good name for this
> type, there has been much discussion but no satisfactory outcome. Perhaps
> "TupleList", but that doesn't have a nice ring to it :). Or we could leave
> it as "table" and add an asterisk explaining that it may have a nonuniform
> schema (the common case is probably that there *is* schema uniformity -- I
> would expect irregular schemas to be rare). Or ... ?
>
> -Chris
>
>
>
> On Jun 6, 2008, at 12:22 PM, Ted Dunning wrote:
>
>
>> I think bags are ordered as well, just as he said.
>>
>> The sentence you are mentioning is explaining why Chris thinks the word
>> bag is a bad one (because it implies unordered while the implementation is
>> ordered).
>>
>>
>> -----Original Message-----
>> From: Santhosh Srinivasan [EMAIL PROTECTED]
>> Sent: Fri 6/6/2008 10:23 AM
>> To: pig-user@incubator.apache.org
>> Subject: RE: Dealing with empty data bags
>>
>> Chris,
>>
>> Did you mean unordered when you said "A bag is an ordered multiset of
>> tuples." Further down you say "because "bag" implies unordered".
>>
>> Santhosh
>>
>> -----Original Message-----
>> From: Chris Olston [EMAIL PROTECTED]
>> Sent: Friday, June 06, 2008 10:19 AM
>> To: pig-user@incubator.apache.org
>> Subject: Re: Dealing with empty data bags
>>
>> Prashanth,
>>
>> You bring up a very good point about bags vs. tables.
>>
>> A bag is an ordered multiset of tuples. A table is an ordered
>> multiset of tuples. (Ordered multiset is a fancy way of saying
>> "list", unless I'm overlooking something :)
>>
>> To my knowledge there is no difference between the two, semantically.
>>
>> In our *implementation* we have a special name for bags at the
>> outermost level of nesting: tables. And we treat tables differently
>> from nested bags in our implementation (at present, we parallelize
>> operations over tables, but do not parallelize operations over nested
>> bags.)
>>
>> The fact that the table/bag distinction percolated up to the user
>> level is probably a mistake --- there should only be 3 user-visible
>> types: table, tuple, atom.
>>
>> (I prefer the name "table" over "bag", because "bag" implies
>> unordered, when in fact in Pig our collections are ordered.)
>>
>> Anyone disagree?
>>
>> -Chris
>>
>>
>> On Jun 5, 2008, at 6:36 PM, Prashanth Pappu wrote:
>>
>> Thanks Chris for the response.
>>>
>>> That brings me to a set of questions regarding empty and null
>>> tables/bags
>>> that I've been struggling with and hopefully one of you can resolve
>>> them for
>>> me.
>>>
>>> (a) I read that PIG has four data types - atom, tuple, bag, map.
>>> But, what
>>> is a table? Is it the same as bag? How are they different?
>>>
>>> (b) What is the result data type when we first load data into a
>>> variable?
>>> For example,
>>>
>>> a = load 'xyz' as (x,y,z);
>>>> dump a;
>>>>
>>> (1, 2, 3)
>>> (2, 4, 5)
>>>
>>> What is the data type of a? Is it a bag as in a = {(1,2,3),
>>> (2,4,5)}? Or is
>>> it just a set of tuples (a table) but not a bag? And, we have a
>>> representation for an empty bag (= {}), and an empty 'set of
>>> tuples' is
>>> simply null/empty?
>>>
>>> (c) I'm trying to understand the differences between bags and
>>> tables and
>>> verifying if we have defined the semantics to deal with them
>>> 'consistently'
>>> irrespective of whether they are empty or not. For example,
>>> reference my
>>> earlier email about an implementation 'bug' in PIG execution engine
>>> when
>>> using SPLIT on an empty table.
>>>
>>> Thanks in advance!
>>> Prashanth
>>>
>>> On Thu, Jun 5, 2008 at 4:08 PM, Chris Olston <[EMAIL PROTECTED]>
>>> wrote:
>>>
>>> It's not "buggy" or "incorrect", it's just different from the
>>>> semantics
>>>> that you were hoping for. Group and COUNT each have simple, well-
>>>> defined,
>>>> and correctly-implemented semantics. If you feed an empty table
>>>> into group
>>>> it produces an empty table; Count over an empty table produces an
>>>> empty
>>>> table -- hence their composition produces an empty tuple when
>>>> given an empty
>>>> table.
>>>>
>>>> The question is whether one can construct a Pig program that gives
>>>> the
>>>> semantics you want. Unfortunately off the top of my head the
>>>> answer seems to
>>>> be 'no'. If that's the case we need to look at what needs to be
>>>> added/changed in the language to enable testing for empty
>>>> outermost tables.
>>>> (If I'm overlooking something I'm sure one of my colleagues will
>>>> chime in :)
>>>>
>>>> -Chris
>>>>
>>>>
>>>>
>>>> On Jun 5, 2008, at 3:31 PM, Prashanth Pappu wrote:
>>>>
>>>> (a) I see that at a lot of places where PIG doesn't correctly
>>>> deal with
>>>>
>>>>> results that are empty bags.
>>>>>
>>>>> Here's an example - Counting Tuples. Let's say I want to count
>>>>> number of
>>>>> tuples in 'b' ( a subset of 'a'). I can do the following -
>>>>>
>>>>> a = load 'xyz' as (x,y,z);
>>>>> b = filter a by x==X;
>>>>> c = group b all;
>>>>> d = foreach c generate COUNT(b);
>>>>>
>>>>> Ideally, we want d to be (0) if b has no tuples and non-zero
>>>>> otherwise.
>>>>> Unfortuantely, if b is empty, c is also empty! This is buggy
>>>>> because it
>>>>> causes d to be empty or null and not (0).
>>>>>
>>>>> Whereas, if b is empty, c should ideally be, c = (all, {}). Which
>>>>> will
>>>>> make
>>>>> d = (0).
>>>>>
>>>>> (b) Is there a different way of computing the number of tuples in
>>>>> b that
>>>>> will always (irrespective of whether b is empty or not) give the
>>>>> correct
>>>>> answer?
>>>>>
>>>>> (c) I also see that PIG supports data maps. But I haven't seen any
>>>>> examples
>>>>> that illustrate how to create or manipulate data maps. Is there
>>>>> any such
>>>>> documentation?
>>>>>
>>>>> thanks,
>>>>> Prashanth
>>>>>
>>>>>
>>>> --
>>>> Christopher Olston, Ph.D.
>>>> Sr. Research Scientist
>>>> Yahoo! Research
>>>>
>>>>
>>>>
>>>>
>> --
>> Christopher Olston, Ph.D.
>> Sr. Research Scientist
>> Yahoo! Research
>>
>>
>>
>>
> --
> Christopher Olston, Ph.D.
> Sr. Research Scientist
> Yahoo! Research
>
>
>