pig-user  

Re: Spaces (' ') in data atoms

Prashanth Pappu
Thu, 05 Jun 2008 10:55:05 -0700

Explicit type definitions will definitely help. But I still think that in
this example -

grunt> b1 = filter b by y==2;
grunt> dump b1;
(1,  2 )

Either we should throw an error or not return any results. Because the '=='
operator is implicitly stripping the spaces in ' 2 ' and determining that '
2 ' == 2!  (The result of (' 2 ' ==2) should be the same as say, ('X2X'
==2)) And cogroup isn't removing the spaces and hence ' 2 ' ne 2.

Overall, where we do not have explicit type declarations, the behavior of
FILTER, COGROUP etc should be consistent - either they all remove the spaces
or none of them do. The rest, definitely, is users' responsibility.

Prashanth

On Thu, Jun 5, 2008 at 5:47 AM, pi song <[EMAIL PROTECTED]> wrote:

> The main theme in Pig development at the moment is developing a new type
> system which means you will be able to stick typing information to your
> data. Basically, the problem you mention will not happen if you specify all
> the metadata. However if you don't, this might happen again and I think
> dealing with whitespaces is users' responsibility.
>
> Do you think this will resolve your problem?
>
> Pi
>
> On Thu, Jun 5, 2008 at 9:37 AM, Prashanth Pappu <[EMAIL PROTECTED]>
> wrote:
>
> > I think there are a few unresolved issues due to lack of explicit type
> > declarations in PIG. Especially with data atoms that have leading or
> > trialing spaces (' '), implicit typecast into strings and integers/floats
> > can lead to unexpected results.
> >
> > Consider the following example -
> >
> > grunt> a = load '/test' using PigStorage(',') as (x,y);
> > grunt> dump a;
> > (1,  2 )
> > (2,  3 )
> > (3, 4)
> > (4, 5)
> >
> > grunt> b = load '/test' using PigStorage(',') as (x,y);
> > grunt> dump b;
> > (1,  2 )
> > (2,  3 )
> > (3, 4)
> > (4, 5)
> >
> > grunt> a1 = filter a by x==2;
> > grunt> dump a1;
> > (2,  3 )
> >
> > grunt> b1 = filter b by y==2;
> > grunt> dump b1;
> > (1,  2 )
> >
> > So, both a and b have tuples that can be filtered with x==2 and y==2
> > respectively.
> > But, what do we get when we cogroup them?
> >
> > grunt> c = cogroup a by (x) INNER, b by (y) INNER;
> > grunt> dump c;
> >
> > [NOTHING!]
> >
> > This is because COGROUP is treating x and y as strings but FILTER was
> > explicitly asked to treat them as integers/floats with the '=='
> comparator.
> > This can be quite confusing since COGROUP seems to be using the
> comparator
> > 'eq' by default!
> >
> > Hence, the following result when we cogroup without the keywords 'INNER'
> >
> > grunt> c2 = cogroup a by (x), b by (y);
> > grunt> dump c2;
> > (1, {(1,  2 )}, {})
> > (2, {(2,  3 )}, {})
> > (3, {(3, 4)}, {})
> > (4, {(4, 5)}, {(3, 4)})
> > (5, {}, {(4, 5)})
> > ( 2 , {}, {(1,  2 )})
> > ( 3 , {}, {(2,  3 )})
> >
> > I think that we can avoid the confusion by one of two ways
> >
> > (a) Forcing the rule that all data atoms will be stripped of leading and
> > trailing spaces (' '). This is important because data atoms have no
> > explicit
> > type declaration.
> > (b) Or atleast throwing up an error when a string with a leading/trailing
> > space is typecast (implicitly or explicitly) to an integer or float. I.e,
> > only the output of STRING(FLOAT) or STRING(INTEGER) should be allowed as
> > the
> > input of FLOAT(STRING) or INTEGER(STRING) conversions.
> >
> > I see that, if unresolved, this can be an annoying problem as fields in
> > input files often tend to have leading/trailing spaces that are not part
> of
> > the field separator.
> >
> > Prashanth
> >
>