pig-user  

Re: Dealing with empty data bags

pi song
Fri, 06 Jun 2008 18:16:57 -0700

Question??

Is there any particular reason why we need the global "order" notion on top
level? I think most SQL users should already be familiar that their tables
are not ordered. By relaxing the notion of order at the top level :-

 - Any plan in any level will have no distinction thus simplifying the
implementation as rules in all level are the same.
 - We can easily do N nested level if we want to

If users want "order", they just do "ORDER" but any operation after that
will not preserve "ORDER". This is also consistent with SQL model.
Whether we parallelize the job/sub-job or not should always be based on the
problem size (easily measured by input  size).

Pi


On Sat, Jun 7, 2008 at 11:01 AM, pi song <[EMAIL PROTECTED]> wrote:

> We should update Pig Wiki to reflect this. Even me, I have always been
> thinking that our semantic of bag == multiset. The only operation that
> results in "ordered bag" is "ORDER" and any operation on ordered bag doesn't
> preserve the closure of ordered bag for example
>
> B = ORDER A BY $0 ;
> C = FILTER B BY $0 == 0
>
> The "FILTER" operator doesn't preserve ordered bag closure and outputs only
> a bag.
>
> Also here is what I discussed with Santhosh before regarding:-
> A = FOREACH B {
>                                         GENERATE B.$0 * B.$1 ;
>                                } ;
> that I think is inappropriate because this operation seems to be very
> non-deterministic in definition unless we have the notion of order on B.
> (Besides that fact that we also don't have definitions of Bag x Bag
> operations like this)
>
> Pi
>
>
>
> On Sat, Jun 7, 2008 at 8:09 AM, Chris Olston <[EMAIL PROTECTED]> wrote:
>
>> Yes, that's right -- it was *not* a typo. Pig "bags" are ordered.
>>
>> By the way, the word "table" is also problematic because Pig does not
>> require uniform schemas across tuples. Usually "table" implies that all
>> member tuples adhere to a given table-level schema.
>>
>> Bottom line is that conceptually there is one data type that encompasses
>> what we currently refer to as "bag" and "table". As for a good name for this
>> type, there has been much discussion but no satisfactory outcome. Perhaps
>> "TupleList", but that doesn't have a nice ring to it :). Or we could leave
>> it as "table" and add an asterisk explaining that it may have a nonuniform
>> schema (the common case is probably that there *is* schema uniformity -- I
>> would expect irregular schemas to be rare). Or ... ?
>>
>> -Chris
>>
>>
>>
>> On Jun 6, 2008, at 12:22 PM, Ted Dunning wrote:
>>
>>
>>> I think bags are ordered as well, just as he said.
>>>
>>> The sentence you are mentioning is explaining why Chris thinks the word
>>> bag is a bad one (because it implies unordered while the implementation is
>>> ordered).
>>>
>>>
>>> -----Original Message-----
>>> From: Santhosh Srinivasan [EMAIL PROTECTED]
>>> Sent: Fri 6/6/2008 10:23 AM
>>> To: pig-user@incubator.apache.org
>>> Subject: RE: Dealing with empty data bags
>>>
>>> Chris,
>>>
>>> Did you mean unordered when you said "A bag is an ordered multiset of
>>> tuples." Further down you say "because "bag" implies unordered".
>>>
>>> Santhosh
>>>
>>> -----Original Message-----
>>> From: Chris Olston [EMAIL PROTECTED]
>>> Sent: Friday, June 06, 2008 10:19 AM
>>> To: pig-user@incubator.apache.org
>>> Subject: Re: Dealing with empty data bags
>>>
>>> Prashanth,
>>>
>>> You bring up a very good point about bags vs. tables.
>>>
>>> A bag is an ordered multiset of tuples. A table is an ordered
>>> multiset of tuples. (Ordered multiset is a fancy way of saying
>>> "list", unless I'm overlooking something :)
>>>
>>> To my knowledge there is no difference between the two, semantically.
>>>
>>> In our *implementation* we have a special name for bags at the
>>> outermost level of nesting: tables. And we treat tables differently
>>> from nested bags in our implementation (at present, we parallelize
>>> operations over tables, but do not parallelize operations over nested
>>> bags.)
>>>
>>> The fact that the table/bag distinction percolated up to the user
>>> level is probably a mistake --- there should only be 3 user-visible
>>> types: table, tuple, atom.
>>>
>>> (I prefer the name "table" over "bag", because "bag" implies
>>> unordered, when in fact in Pig our collections are ordered.)
>>>
>>> Anyone disagree?
>>>
>>> -Chris
>>>
>>>
>>> On Jun 5, 2008, at 6:36 PM, Prashanth Pappu wrote:
>>>
>>>  Thanks Chris for the response.
>>>>
>>>> That brings me to a set of questions regarding empty and null
>>>> tables/bags
>>>> that I've been struggling with and hopefully one of you can resolve
>>>> them for
>>>> me.
>>>>
>>>> (a) I read that PIG has four data types - atom, tuple, bag, map.
>>>> But, what
>>>> is a table? Is it the same as bag? How are they different?
>>>>
>>>> (b) What is the result data type when we first load data into a
>>>> variable?
>>>> For example,
>>>>
>>>>  a = load 'xyz' as (x,y,z);
>>>>> dump a;
>>>>>
>>>> (1, 2, 3)
>>>> (2, 4, 5)
>>>>
>>>> What is the data type of a? Is it a bag as in a = {(1,2,3),
>>>> (2,4,5)}? Or is
>>>> it just a set of tuples (a table) but not a bag? And, we have a
>>>> representation for an empty bag (= {}), and an empty 'set of
>>>> tuples' is
>>>> simply null/empty?
>>>>
>>>> (c) I'm trying to understand the differences between bags and
>>>> tables and
>>>> verifying if we have defined the semantics to deal with them
>>>> 'consistently'
>>>> irrespective of whether they are empty or not. For example,
>>>> reference my
>>>> earlier email about an implementation 'bug' in PIG execution engine
>>>> when
>>>> using SPLIT on an empty table.
>>>>
>>>> Thanks in advance!
>>>> Prashanth
>>>>
>>>> On Thu, Jun 5, 2008 at 4:08 PM, Chris Olston <[EMAIL PROTECTED]>
>>>> wrote:
>>>>
>>>>  It's not "buggy" or "incorrect", it's just different from the
>>>>> semantics
>>>>> that you were hoping for. Group and COUNT each have simple, well-
>>>>> defined,
>>>>> and correctly-implemented semantics. If you feed an empty table
>>>>> into group
>>>>> it produces an empty table; Count over an empty table produces an
>>>>> empty
>>>>> table -- hence their composition produces an empty tuple when
>>>>> given an empty
>>>>> table.
>>>>>
>>>>> The question is whether one can construct a Pig program that gives
>>>>> the
>>>>> semantics you want. Unfortunately off the top of my head the
>>>>> answer seems to
>>>>> be 'no'. If that's the case we need to look at what needs to be
>>>>> added/changed in the language to enable testing for empty
>>>>> outermost tables.
>>>>> (If I'm overlooking something I'm sure one of my colleagues will
>>>>> chime in :)
>>>>>
>>>>> -Chris
>>>>>
>>>>>
>>>>>
>>>>> On Jun 5, 2008, at 3:31 PM, Prashanth Pappu wrote:
>>>>>
>>>>>  (a) I see that at a lot of places where PIG doesn't correctly
>>>>> deal with
>>>>>
>>>>>> results that are empty bags.
>>>>>>
>>>>>> Here's an example - Counting Tuples. Let's say I want to count
>>>>>> number of
>>>>>> tuples in 'b' ( a subset of 'a'). I can do the following -
>>>>>>
>>>>>> a = load 'xyz' as (x,y,z);
>>>>>> b =  filter a by x==X;
>>>>>> c = group b all;
>>>>>> d = foreach c generate COUNT(b);
>>>>>>
>>>>>> Ideally, we want d to be (0) if b has no tuples and non-zero
>>>>>> otherwise.
>>>>>> Unfortuantely, if b is empty, c is also empty! This is buggy
>>>>>> because it
>>>>>> causes d to be empty or null and not (0).
>>>>>>
>>>>>> Whereas, if b is empty, c should ideally be, c = (all, {}). Which
>>>>>> will
>>>>>> make
>>>>>> d = (0).
>>>>>>
>>>>>> (b) Is there a different way of computing the number of tuples in
>>>>>> b that
>>>>>> will always (irrespective of whether b is empty or not) give the
>>>>>> correct
>>>>>> answer?
>>>>>>
>>>>>> (c) I also see that PIG supports data maps. But I haven't seen any
>>>>>> examples
>>>>>> that illustrate how to create or manipulate data maps. Is there
>>>>>> any such
>>>>>> documentation?
>>>>>>
>>>>>> thanks,
>>>>>> Prashanth
>>>>>>
>>>>>>
>>>>> --
>>>>> Christopher Olston, Ph.D.
>>>>> Sr. Research Scientist
>>>>> Yahoo! Research
>>>>>
>>>>>
>>>>>
>>>>>
>>> --
>>> Christopher Olston, Ph.D.
>>> Sr. Research Scientist
>>> Yahoo! Research
>>>
>>>
>>>
>>>
>> --
>> Christopher Olston, Ph.D.
>> Sr. Research Scientist
>> Yahoo! Research
>>
>>
>>
>