pi song
Thu, 05 Jun 2008 05:48:05 -0700
The main theme in Pig development at the moment is developing a new type
system which means you will be able to stick typing information to your
data. Basically, the problem you mention will not happen if you specify all
the metadata. However if you don't, this might happen again and I think
dealing with whitespaces is users' responsibility.
Do you think this will resolve your problem?
Pi
On Thu, Jun 5, 2008 at 9:37 AM, Prashanth Pappu <[EMAIL PROTECTED]>
wrote:
> I think there are a few unresolved issues due to lack of explicit type
> declarations in PIG. Especially with data atoms that have leading or
> trialing spaces (' '), implicit typecast into strings and integers/floats
> can lead to unexpected results.
>
> Consider the following example -
>
> grunt> a = load '/test' using PigStorage(',') as (x,y);
> grunt> dump a;
> (1, 2 )
> (2, 3 )
> (3, 4)
> (4, 5)
>
> grunt> b = load '/test' using PigStorage(',') as (x,y);
> grunt> dump b;
> (1, 2 )
> (2, 3 )
> (3, 4)
> (4, 5)
>
> grunt> a1 = filter a by x==2;
> grunt> dump a1;
> (2, 3 )
>
> grunt> b1 = filter b by y==2;
> grunt> dump b1;
> (1, 2 )
>
> So, both a and b have tuples that can be filtered with x==2 and y==2
> respectively.
> But, what do we get when we cogroup them?
>
> grunt> c = cogroup a by (x) INNER, b by (y) INNER;
> grunt> dump c;
>
> [NOTHING!]
>
> This is because COGROUP is treating x and y as strings but FILTER was
> explicitly asked to treat them as integers/floats with the '==' comparator.
> This can be quite confusing since COGROUP seems to be using the comparator
> 'eq' by default!
>
> Hence, the following result when we cogroup without the keywords 'INNER'
>
> grunt> c2 = cogroup a by (x), b by (y);
> grunt> dump c2;
> (1, {(1, 2 )}, {})
> (2, {(2, 3 )}, {})
> (3, {(3, 4)}, {})
> (4, {(4, 5)}, {(3, 4)})
> (5, {}, {(4, 5)})
> ( 2 , {}, {(1, 2 )})
> ( 3 , {}, {(2, 3 )})
>
> I think that we can avoid the confusion by one of two ways
>
> (a) Forcing the rule that all data atoms will be stripped of leading and
> trailing spaces (' '). This is important because data atoms have no
> explicit
> type declaration.
> (b) Or atleast throwing up an error when a string with a leading/trailing
> space is typecast (implicitly or explicitly) to an integer or float. I.e,
> only the output of STRING(FLOAT) or STRING(INTEGER) should be allowed as
> the
> input of FLOAT(STRING) or INTEGER(STRING) conversions.
>
> I see that, if unresolved, this can be an annoying problem as fields in
> input files often tend to have leading/trailing spaces that are not part of
> the field separator.
>
> Prashanth
>