pig-user  

Re: Dealing with empty data bags

Chris Olston
Thu, 05 Jun 2008 16:09:12 -0700

It's not "buggy" or "incorrect", it's just different from the semantics that you were hoping for. Group and COUNT each have simple, well-defined, and correctly-implemented semantics. If you feed an empty table into group it produces an empty table; Count over an empty table produces an empty table -- hence their composition produces an empty tuple when given an empty table.

The question is whether one can construct a Pig program that gives the semantics you want. Unfortunately off the top of my head the answer seems to be 'no'. If that's the case we need to look at what needs to be added/changed in the language to enable testing for empty outermost tables. (If I'm overlooking something I'm sure one of my colleagues will chime in :)

-Chris


On Jun 5, 2008, at 3:31 PM, Prashanth Pappu wrote:

(a) I see that at a lot of places where PIG doesn't correctly deal with
results that are empty bags.

Here's an example - Counting Tuples. Let's say I want to count number of
tuples in 'b' ( a subset of 'a'). I can do the following -

a = load 'xyz' as (x,y,z);
b =  filter a by x==X;
c = group b all;
d = foreach c generate COUNT(b);

Ideally, we want d to be (0) if b has no tuples and non-zero otherwise. Unfortuantely, if b is empty, c is also empty! This is buggy because it
causes d to be empty or null and not (0).

Whereas, if b is empty, c should ideally be, c = (all, {}). Which will make
d = (0).

(b) Is there a different way of computing the number of tuples in b that will always (irrespective of whether b is empty or not) give the correct
answer?

(c) I also see that PIG supports data maps. But I haven't seen any examples that illustrate how to create or manipulate data maps. Is there any such
documentation?

thanks,
Prashanth

--
Christopher Olston, Ph.D.
Sr. Research Scientist
Yahoo! Research