pig-user  

Re: Dealing with empty data bags

Prashanth Pappu
Fri, 06 Jun 2008 12:07:12 -0700

Chris,

Thanks for the clarification. I think one reason why users do notice the
distinction between the outer-most tables and other tables is because of the
difference in representation.

- a tuple is always enclosed in '(' and ')'.
- a map is always enclosed in '[' and ']'
- an inner table is always enclosed in '{' and '}'

but an outer table has no enclosing braces!
I think enclosing even the outer-most tables in '{' and '}' will make it
clear that all tables are indentical, atleast, semantically.

For example,
> a= load '/xy' as (x,y);
>dump a
(1,2) ===> should be {(1,2)}

and
> b = filter a by x==3;
> dump b;
[Nothing] ===> should be {}

This will definitely make things a lot easier to understand. And this also
raises a second question -

Why are all functions defined over tables like COUNT, SUM, AVG etc. usable
only from a FOREACH statement?
For example, to count the number of tuples in a table, we currently use  -

> a = load '/xy' as (x,y);
> b = group a all;
> c = foreach b generate COUNT(a);

Now that we know that a is a table like any other, I'm sure many users
wonder why we can't simply use

> a = load 'xy' as (x,y);
> c = COUNT(a);

And, I think I now understand the reason - because operations over
outer-most tables are parallelized and operations over inner tables are not.
So, the above operation would be ok, if we figure out a way to automatically
parallelize table operations (like COUNT(a)).

But I agree, the fact that table operations (like COUNT, AVG etc) cannot be
used on outer-most tables (atleast currently) shouldn't stop us from
thinking that even outermost tables are simply tables. The change in
representation for outer-most tables will help clear the confusion.

Prashanth

On Fri, Jun 6, 2008 at 10:18 AM, Chris Olston <[EMAIL PROTECTED]> wrote:

> Prashanth,
>
> You bring up a very good point about bags vs. tables.
>
> A bag is an ordered multiset of tuples. A table is an ordered multiset of
> tuples. (Ordered multiset is a fancy way of saying "list", unless I'm
> overlooking something :)
>
> To my knowledge there is no difference between the two, semantically.
>
> In our *implementation* we have a special name for bags at the outermost
> level of nesting: tables. And we treat tables differently from nested bags
> in our implementation (at present, we parallelize operations over tables,
> but do not parallelize operations over nested bags.)
>
> The fact that the table/bag distinction percolated up to the user level is
> probably a mistake --- there should only be 3 user-visible types: table,
> tuple, atom.
>
> (I prefer the name "table" over "bag", because "bag" implies unordered,
> when in fact in Pig our collections are ordered.)
>
> Anyone disagree?
>
> -Chris
>
>
>
> On Jun 5, 2008, at 6:36 PM, Prashanth Pappu wrote:
>
>  Thanks Chris for the response.
>>
>> That brings me to a set of questions regarding empty and null tables/bags
>> that I've been struggling with and hopefully one of you can resolve them
>> for
>> me.
>>
>> (a) I read that PIG has four data types - atom, tuple, bag, map. But, what
>> is a table? Is it the same as bag? How are they different?
>>
>> (b) What is the result data type when we first load data into a variable?
>> For example,
>>
>>  a = load 'xyz' as (x,y,z);
>>> dump a;
>>>
>> (1, 2, 3)
>> (2, 4, 5)
>>
>> What is the data type of a? Is it a bag as in a = {(1,2,3), (2,4,5)}? Or
>> is
>> it just a set of tuples (a table) but not a bag? And, we have a
>> representation for an empty bag (= {}), and an empty 'set of tuples' is
>> simply null/empty?
>>
>> (c) I'm trying to understand the differences between bags and tables and
>> verifying if we have defined the semantics to deal with them
>> 'consistently'
>> irrespective of whether they are empty or not. For example, reference my
>> earlier email about an implementation 'bug' in PIG execution engine when
>> using SPLIT on an empty table.
>>
>> Thanks in advance!
>> Prashanth
>>
>> On Thu, Jun 5, 2008 at 4:08 PM, Chris Olston <[EMAIL PROTECTED]>
>> wrote:
>>
>>  It's not "buggy" or "incorrect", it's just different from the semantics
>>> that you were hoping for. Group and COUNT each have simple, well-defined,
>>> and correctly-implemented semantics. If you feed an empty table into
>>> group
>>> it produces an empty table; Count over an empty table produces an empty
>>> table -- hence their composition produces an empty tuple when given an
>>> empty
>>> table.
>>>
>>> The question is whether one can construct a Pig program that gives the
>>> semantics you want. Unfortunately off the top of my head the answer seems
>>> to
>>> be 'no'. If that's the case we need to look at what needs to be
>>> added/changed in the language to enable testing for empty outermost
>>> tables.
>>> (If I'm overlooking something I'm sure one of my colleagues will chime in
>>> :)
>>>
>>> -Chris
>>>
>>>
>>>
>>> On Jun 5, 2008, at 3:31 PM, Prashanth Pappu wrote:
>>>
>>>  (a) I see that at a lot of places where PIG doesn't correctly deal with
>>>
>>>> results that are empty bags.
>>>>
>>>> Here's an example - Counting Tuples. Let's say I want to count number of
>>>> tuples in 'b' ( a subset of 'a'). I can do the following -
>>>>
>>>> a = load 'xyz' as (x,y,z);
>>>> b =  filter a by x==X;
>>>> c = group b all;
>>>> d = foreach c generate COUNT(b);
>>>>
>>>> Ideally, we want d to be (0) if b has no tuples and non-zero otherwise.
>>>> Unfortuantely, if b is empty, c is also empty! This is buggy because it
>>>> causes d to be empty or null and not (0).
>>>>
>>>> Whereas, if b is empty, c should ideally be, c = (all, {}). Which will
>>>> make
>>>> d = (0).
>>>>
>>>> (b) Is there a different way of computing the number of tuples in b that
>>>> will always (irrespective of whether b is empty or not) give the correct
>>>> answer?
>>>>
>>>> (c) I also see that PIG supports data maps. But I haven't seen any
>>>> examples
>>>> that illustrate how to create or manipulate data maps. Is there any such
>>>> documentation?
>>>>
>>>> thanks,
>>>> Prashanth
>>>>
>>>>
>>> --
>>> Christopher Olston, Ph.D.
>>> Sr. Research Scientist
>>> Yahoo! Research
>>>
>>>
>>>
>>>
> --
> Christopher Olston, Ph.D.
> Sr. Research Scientist
> Yahoo! Research
>
>
>