pig-user  

RE: Spaces (' ') in data atoms

Ted Dunning
Thu, 05 Jun 2008 11:50:29 -0700

is it possible to ask cogroup to group on a function such as a cast to integer?

I think that cogroup is pretty strongly wedded to eq, but I would also think 
that if there is some sort of way to coerce the group key to a desired type, 
then you can force eq to give you the behavior you want.

Pig isn't the first language to have confusion between objects and their string 
representation.  I pretty commonly have unit tests that fail with messages like 
"Expected 5 but got 5" where one 5 is 5 and the other is "5". 


-----Original Message-----
From: [EMAIL PROTECTED] on behalf of Prashanth Pappu
Sent: Thu 6/5/2008 10:54 AM
To: pig-user@incubator.apache.org; [EMAIL PROTECTED]
Subject: Re: Spaces (' ') in data atoms
 
Explicit type definitions will definitely help. But I still think that in
this example -

grunt> b1 = filter b by y==2;
grunt> dump b1;
(1,  2 )

Either we should throw an error or not return any results. Because the '=='
operator is implicitly stripping the spaces in ' 2 ' and determining that '
2 ' == 2!  (The result of (' 2 ' ==2) should be the same as say, ('X2X'
==2)) And cogroup isn't removing the spaces and hence ' 2 ' ne 2.

Overall, where we do not have explicit type declarations, the behavior of
FILTER, COGROUP etc should be consistent - either they all remove the spaces
or none of them do. The rest, definitely, is users' responsibility.

Prashanth

On Thu, Jun 5, 2008 at 5:47 AM, pi song <[EMAIL PROTECTED]> wrote:

> The main theme in Pig development at the moment is developing a new type
> system which means you will be able to stick typing information to your
> data. Basically, the problem you mention will not happen if you specify all
> the metadata. However if you don't, this might happen again and I think
> dealing with whitespaces is users' responsibility.
>
> Do you think this will resolve your problem?
>
> Pi
>
> On Thu, Jun 5, 2008 at 9:37 AM, Prashanth Pappu <[EMAIL PROTECTED]>
> wrote:
>
> > I think there are a few unresolved issues due to lack of explicit type
> > declarations in PIG. Especially with data atoms that have leading or
> > trialing spaces (' '), implicit typecast into strings and integers/floats
> > can lead to unexpected results.
> >
> > Consider the following example -
> >
> > grunt> a = load '/test' using PigStorage(',') as (x,y);
> > grunt> dump a;
> > (1,  2 )
> > (2,  3 )
> > (3, 4)
> > (4, 5)
> >
> > grunt> b = load '/test' using PigStorage(',') as (x,y);
> > grunt> dump b;
> > (1,  2 )
> > (2,  3 )
> > (3, 4)
> > (4, 5)
> >
> > grunt> a1 = filter a by x==2;
> > grunt> dump a1;
> > (2,  3 )
> >
> > grunt> b1 = filter b by y==2;
> > grunt> dump b1;
> > (1,  2 )
> >
> > So, both a and b have tuples that can be filtered with x==2 and y==2
> > respectively.
> > But, what do we get when we cogroup them?
> >
> > grunt> c = cogroup a by (x) INNER, b by (y) INNER;
> > grunt> dump c;
> >
> > [NOTHING!]
> >
> > This is because COGROUP is treating x and y as strings but FILTER was
> > explicitly asked to treat them as integers/floats with the '=='
> comparator.
> > This can be quite confusing since COGROUP seems to be using the
> comparator
> > 'eq' by default!
> >
> > Hence, the following result when we cogroup without the keywords 'INNER'
> >
> > grunt> c2 = cogroup a by (x), b by (y);
> > grunt> dump c2;
> > (1, {(1,  2 )}, {})
> > (2, {(2,  3 )}, {})
> > (3, {(3, 4)}, {})
> > (4, {(4, 5)}, {(3, 4)})
> > (5, {}, {(4, 5)})
> > ( 2 , {}, {(1,  2 )})
> > ( 3 , {}, {(2,  3 )})
> >
> > I think that we can avoid the confusion by one of two ways
> >
> > (a) Forcing the rule that all data atoms will be stripped of leading and
> > trailing spaces (' '). This is important because data atoms have no
> > explicit
> > type declaration.
> > (b) Or atleast throwing up an error when a string with a leading/trailing
> > space is typecast (implicitly or explicitly) to an integer or float. I.e,
> > only the output of STRING(FLOAT) or STRING(INTEGER) should be allowed as
> > the
> > input of FLOAT(STRING) or INTEGER(STRING) conversions.
> >
> > I see that, if unresolved, this can be an annoying problem as fields in
> > input files often tend to have leading/trailing spaces that are not part
> of
> > the field separator.
> >
> > Prashanth
> >
>