Ted Dunning
Thu, 05 Jun 2008 18:45:34 -0700
I don't thing that this is the correct semantics if pig intends to be set
theoretically correct.
What would the key be on this one record? An empty bag? But what if two input
records have an empty bag as key?
There are NO correct members of an empty set. It is empty. The empty set of
records has no records in it, not one.
-----Original Message-----
From: Chris Olston [EMAIL PROTECTED]
Sent: Thu 6/5/2008 6:06 PM
To: pig-user@incubator.apache.org
Subject: Re: Dealing with empty data bags
Probably the best fix is to redefine GROUP ALL so that in all cases
it outputs a table with exactly one record. In the case of an empty
input table it would produce an output record containing an empty
bag. Is that what you have in mind, Olga?
-Chris
On Jun 5, 2008, at 4:05 PM, Olga Natkovich wrote:
> I agree with you about the group. Could you, please, open JIRA
> about it.
> I don't think there is a workaround for this issue.
>
> Pig does have a limitted support for maps. None of the existing
> expressions/operators create a map. The only way to get a map is to
> have
> them in your input data or for your UDF to generate them. If you do
> have
> a map, you can retrive individual values as followis:
>
> A = load 'data' as (map);
> B = foreach A generate map#'key1', map#'key2' ...
>
> where key1 and key2 are keys in the map.
>
> Olga
>
>> -----Original Message-----
>> From: [EMAIL PROTECTED]
>> [EMAIL PROTECTED] On Behalf Of Prashanth Pappu
>> Sent: Thursday, June 05, 2008 3:31 PM
>> To: pig-user@incubator.apache.org
>> Subject: Dealing with empty data bags
>>
>> (a) I see that at a lot of places where PIG doesn't correctly
>> deal with results that are empty bags.
>>
>> Here's an example - Counting Tuples. Let's say I want to
>> count number of tuples in 'b' ( a subset of 'a'). I can do
>> the following -
>>
>> a = load 'xyz' as (x,y,z);
>> b = filter a by x==X;
>> c = group b all;
>> d = foreach c generate COUNT(b);
>>
>> Ideally, we want d to be (0) if b has no tuples and non-zero
>> otherwise.
>> Unfortuantely, if b is empty, c is also empty! This is buggy
>> because it causes d to be empty or null and not (0).
>>
>> Whereas, if b is empty, c should ideally be, c = (all, {}).
>> Which will make d = (0).
>>
>> (b) Is there a different way of computing the number of
>> tuples in b that will always (irrespective of whether b is
>> empty or not) give the correct answer?
>>
>> (c) I also see that PIG supports data maps. But I haven't
>> seen any examples that illustrate how to create or manipulate
>> data maps. Is there any such documentation?
>>
>> thanks,
>> Prashanth
>>
--
Christopher Olston, Ph.D.
Sr. Research Scientist
Yahoo! Research