Prashanth Pappu
Thu, 05 Jun 2008 18:56:56 -0700
>> I don't thing that this is the correct semantics if pig intends to be set
theoretically correct. What would the key be on this one record?
Wouldn't it just be the atom 'all' ?
> dump a;
(1,2)
> b = group a all;
> dump b;
(all, {(1,2)})
> dump a;
[Empty]
> b = group a all;
> dump b;
(all, {}) =====> Consistent irrespective of whether a is empty or not.
Prashanth
>
>
>
> -----Original Message-----
> From: Chris Olston [EMAIL PROTECTED]
> Sent: Thu 6/5/2008 6:06 PM
> To: pig-user@incubator.apache.org
> Subject: Re: Dealing with empty data bags
>
> Probably the best fix is to redefine GROUP ALL so that in all cases
> it outputs a table with exactly one record. In the case of an empty
> input table it would produce an output record containing an empty
> bag. Is that what you have in mind, Olga?
>
> -Chris
>
>
> On Jun 5, 2008, at 4:05 PM, Olga Natkovich wrote:
>
> > I agree with you about the group. Could you, please, open JIRA
> > about it.
> > I don't think there is a workaround for this issue.
> >
> > Pig does have a limitted support for maps. None of the existing
> > expressions/operators create a map. The only way to get a map is to
> > have
> > them in your input data or for your UDF to generate them. If you do
> > have
> > a map, you can retrive individual values as followis:
> >
> > A = load 'data' as (map);
> > B = foreach A generate map#'key1', map#'key2' ...
> >
> > where key1 and key2 are keys in the map.
> >
> > Olga
> >
> >> -----Original Message-----
> >> From: [EMAIL PROTECTED]
> >> [EMAIL PROTECTED] On Behalf Of Prashanth Pappu
> >> Sent: Thursday, June 05, 2008 3:31 PM
> >> To: pig-user@incubator.apache.org
> >> Subject: Dealing with empty data bags
> >>
> >> (a) I see that at a lot of places where PIG doesn't correctly
> >> deal with results that are empty bags.
> >>
> >> Here's an example - Counting Tuples. Let's say I want to
> >> count number of tuples in 'b' ( a subset of 'a'). I can do
> >> the following -
> >>
> >> a = load 'xyz' as (x,y,z);
> >> b = filter a by x==X;
> >> c = group b all;
> >> d = foreach c generate COUNT(b);
> >>
> >> Ideally, we want d to be (0) if b has no tuples and non-zero
> >> otherwise.
> >> Unfortuantely, if b is empty, c is also empty! This is buggy
> >> because it causes d to be empty or null and not (0).
> >>
> >> Whereas, if b is empty, c should ideally be, c = (all, {}).
> >> Which will make d = (0).
> >>
> >> (b) Is there a different way of computing the number of
> >> tuples in b that will always (irrespective of whether b is
> >> empty or not) give the correct answer?
> >>
> >> (c) I also see that PIG supports data maps. But I haven't
> >> seen any examples that illustrate how to create or manipulate
> >> data maps. Is there any such documentation?
> >>
> >> thanks,
> >> Prashanth
> >>
>
> --
> Christopher Olston, Ph.D.
> Sr. Research Scientist
> Yahoo! Research
>
>
>
>