pig-user  

Re: Dealing with empty data bags

Chris Olston
Fri, 06 Jun 2008 15:10:55 -0700

Yes, that's right -- it was *not* a typo. Pig "bags" are ordered.

By the way, the word "table" is also problematic because Pig does not require uniform schemas across tuples. Usually "table" implies that all member tuples adhere to a given table-level schema.

Bottom line is that conceptually there is one data type that encompasses what we currently refer to as "bag" and "table". As for a good name for this type, there has been much discussion but no satisfactory outcome. Perhaps "TupleList", but that doesn't have a nice ring to it :). Or we could leave it as "table" and add an asterisk explaining that it may have a nonuniform schema (the common case is probably that there *is* schema uniformity -- I would expect irregular schemas to be rare). Or ... ?

-Chris


On Jun 6, 2008, at 12:22 PM, Ted Dunning wrote:


I think bags are ordered as well, just as he said.

The sentence you are mentioning is explaining why Chris thinks the word bag is a bad one (because it implies unordered while the implementation is ordered).


-----Original Message-----
From: Santhosh Srinivasan [EMAIL PROTECTED]
Sent: Fri 6/6/2008 10:23 AM
To: pig-user@incubator.apache.org
Subject: RE: Dealing with empty data bags

Chris,

Did you mean unordered when you said "A bag is an ordered multiset of
tuples." Further down you say "because "bag" implies unordered".

Santhosh

-----Original Message-----
From: Chris Olston [EMAIL PROTECTED]
Sent: Friday, June 06, 2008 10:19 AM
To: pig-user@incubator.apache.org
Subject: Re: Dealing with empty data bags

Prashanth,

You bring up a very good point about bags vs. tables.

A bag is an ordered multiset of tuples. A table is an ordered
multiset of tuples. (Ordered multiset is a fancy way of saying
"list", unless I'm overlooking something :)

To my knowledge there is no difference between the two, semantically.

In our *implementation* we have a special name for bags at the
outermost level of nesting: tables. And we treat tables differently
from nested bags in our implementation (at present, we parallelize
operations over tables, but do not parallelize operations over nested
bags.)

The fact that the table/bag distinction percolated up to the user
level is probably a mistake --- there should only be 3 user-visible
types: table, tuple, atom.

(I prefer the name "table" over "bag", because "bag" implies
unordered, when in fact in Pig our collections are ordered.)

Anyone disagree?

-Chris


On Jun 5, 2008, at 6:36 PM, Prashanth Pappu wrote:

Thanks Chris for the response.

That brings me to a set of questions regarding empty and null
tables/bags
that I've been struggling with and hopefully one of you can resolve
them for
me.

(a) I read that PIG has four data types - atom, tuple, bag, map.
But, what
is a table? Is it the same as bag? How are they different?

(b) What is the result data type when we first load data into a
variable?
For example,

a = load 'xyz' as (x,y,z);
dump a;
(1, 2, 3)
(2, 4, 5)

What is the data type of a? Is it a bag as in a = {(1,2,3),
(2,4,5)}? Or is
it just a set of tuples (a table) but not a bag? And, we have a
representation for an empty bag (= {}), and an empty 'set of
tuples' is
simply null/empty?

(c) I'm trying to understand the differences between bags and
tables and
verifying if we have defined the semantics to deal with them
'consistently'
irrespective of whether they are empty or not. For example,
reference my
earlier email about an implementation 'bug' in PIG execution engine
when
using SPLIT on an empty table.

Thanks in advance!
Prashanth

On Thu, Jun 5, 2008 at 4:08 PM, Chris Olston <[EMAIL PROTECTED]>
wrote:

It's not "buggy" or "incorrect", it's just different from the
semantics
that you were hoping for. Group and COUNT each have simple, well-
defined,
and correctly-implemented semantics. If you feed an empty table
into group
it produces an empty table; Count over an empty table produces an
empty
table -- hence their composition produces an empty tuple when
given an empty
table.

The question is whether one can construct a Pig program that gives
the
semantics you want. Unfortunately off the top of my head the
answer seems to
be 'no'. If that's the case we need to look at what needs to be
added/changed in the language to enable testing for empty
outermost tables.
(If I'm overlooking something I'm sure one of my colleagues will
chime in :)

-Chris



On Jun 5, 2008, at 3:31 PM, Prashanth Pappu wrote:

 (a) I see that at a lot of places where PIG doesn't correctly
deal with
results that are empty bags.

Here's an example - Counting Tuples. Let's say I want to count
number of
tuples in 'b' ( a subset of 'a'). I can do the following -

a = load 'xyz' as (x,y,z);
b =  filter a by x==X;
c = group b all;
d = foreach c generate COUNT(b);

Ideally, we want d to be (0) if b has no tuples and non-zero
otherwise.
Unfortuantely, if b is empty, c is also empty! This is buggy
because it
causes d to be empty or null and not (0).

Whereas, if b is empty, c should ideally be, c = (all, {}). Which
will
make
d = (0).

(b) Is there a different way of computing the number of tuples in
b that
will always (irrespective of whether b is empty or not) give the
correct
answer?

(c) I also see that PIG supports data maps. But I haven't seen any
examples
that illustrate how to create or manipulate data maps. Is there
any such
documentation?

thanks,
Prashanth


--
Christopher Olston, Ph.D.
Sr. Research Scientist
Yahoo! Research




--
Christopher Olston, Ph.D.
Sr. Research Scientist
Yahoo! Research




--
Christopher Olston, Ph.D.
Sr. Research Scientist
Yahoo! Research