pig-user  

Re: Dealing with empty data bags

Prashanth Pappu
Thu, 05 Jun 2008 18:36:43 -0700

Thanks Chris for the response.

That brings me to a set of questions regarding empty and null tables/bags
that I've been struggling with and hopefully one of you can resolve them for
me.

(a) I read that PIG has four data types - atom, tuple, bag, map. But, what
is a table? Is it the same as bag? How are they different?

(b) What is the result data type when we first load data into a variable?
For example,

> a = load 'xyz' as (x,y,z);
> dump a;
(1, 2, 3)
(2, 4, 5)

What is the data type of a? Is it a bag as in a = {(1,2,3), (2,4,5)}? Or is
it just a set of tuples (a table) but not a bag? And, we have a
representation for an empty bag (= {}), and an empty 'set of tuples' is
simply null/empty?

(c) I'm trying to understand the differences between bags and tables and
verifying if we have defined the semantics to deal with them 'consistently'
irrespective of whether they are empty or not. For example, reference my
earlier email about an implementation 'bug' in PIG execution engine when
using SPLIT on an empty table.

Thanks in advance!
Prashanth

On Thu, Jun 5, 2008 at 4:08 PM, Chris Olston <[EMAIL PROTECTED]> wrote:

> It's not "buggy" or "incorrect", it's just different from the semantics
> that you were hoping for. Group and COUNT each have simple, well-defined,
> and correctly-implemented semantics. If you feed an empty table into group
> it produces an empty table; Count over an empty table produces an empty
> table -- hence their composition produces an empty tuple when given an empty
> table.
>
> The question is whether one can construct a Pig program that gives the
> semantics you want. Unfortunately off the top of my head the answer seems to
> be 'no'. If that's the case we need to look at what needs to be
> added/changed in the language to enable testing for empty outermost tables.
> (If I'm overlooking something I'm sure one of my colleagues will chime in :)
>
> -Chris
>
>
>
> On Jun 5, 2008, at 3:31 PM, Prashanth Pappu wrote:
>
>  (a) I see that at a lot of places where PIG doesn't correctly deal with
>> results that are empty bags.
>>
>> Here's an example - Counting Tuples. Let's say I want to count number of
>> tuples in 'b' ( a subset of 'a'). I can do the following -
>>
>> a = load 'xyz' as (x,y,z);
>> b =  filter a by x==X;
>> c = group b all;
>> d = foreach c generate COUNT(b);
>>
>> Ideally, we want d to be (0) if b has no tuples and non-zero otherwise.
>> Unfortuantely, if b is empty, c is also empty! This is buggy because it
>> causes d to be empty or null and not (0).
>>
>> Whereas, if b is empty, c should ideally be, c = (all, {}). Which will
>> make
>> d = (0).
>>
>> (b) Is there a different way of computing the number of tuples in b that
>> will always (irrespective of whether b is empty or not) give the correct
>> answer?
>>
>> (c) I also see that PIG supports data maps. But I haven't seen any
>> examples
>> that illustrate how to create or manipulate data maps. Is there any such
>> documentation?
>>
>> thanks,
>> Prashanth
>>
>
> --
> Christopher Olston, Ph.D.
> Sr. Research Scientist
> Yahoo! Research
>
>
>