pig-user  

Re: Spaces (' ') in data atoms

Alan Gates
Thu, 05 Jun 2008 12:01:43 -0700

I think that ' 2', when converted to an integer, should result in 2, not NULL or 0. That definitely seems like the most expected behavior. Your point that cogroup and filter treat ' 2' differently is valid. Fortunately, the addition of types will handle that, mostly. If a user declares something to be an integer and then uses it as cogroup key, it must be compared against something else declared as an integer (the key can't be string on one side and int on the other, because == isn't defined for string == int). So the following will do want you want:

A = load 'myfile' as (a: int, b: chararray);
B = load 'myotherfile' as (x: int, y: float);
C = cogroup A by a, B by x;
...
G = filter F by a == 2;

if 'myfile' contains
2,this is a string

and 'myothefile' contains
2 ,3.141592654

then these tuples will match for cogroup and they will match for the filter in G.

The one place where you'll still get differing results is if you do the following:

A = load 'myfile' as (a, b);
B = load 'myotherfile' as (x, y);
C = cogroup A by a, B by x;
...
G = filter F by a == 2;

In this case, in the cogroup, a and x will be compared as byte arrays because the user did not declare a type. But in the filter a will be cast to an int because the user is comparing to an int. If G is changed to

G = filter F by a == '2';

then the comparison the '2 ' will fail the filter.

Alan.

Prashanth Pappu wrote:
Explicit type definitions will definitely help. But I still think that in
this example -

grunt> b1 = filter b by y==2;
grunt> dump b1;
(1,  2 )

Either we should throw an error or not return any results. Because the '=='
operator is implicitly stripping the spaces in ' 2 ' and determining that '
2 ' == 2!  (The result of (' 2 ' ==2) should be the same as say, ('X2X'
==2)) And cogroup isn't removing the spaces and hence ' 2 ' ne 2.

Overall, where we do not have explicit type declarations, the behavior of
FILTER, COGROUP etc should be consistent - either they all remove the spaces
or none of them do. The rest, definitely, is users' responsibility.

Prashanth

On Thu, Jun 5, 2008 at 5:47 AM, pi song <[EMAIL PROTECTED]> wrote:

The main theme in Pig development at the moment is developing a new type
system which means you will be able to stick typing information to your
data. Basically, the problem you mention will not happen if you specify all
the metadata. However if you don't, this might happen again and I think
dealing with whitespaces is users' responsibility.

Do you think this will resolve your problem?

Pi

On Thu, Jun 5, 2008 at 9:37 AM, Prashanth Pappu <[EMAIL PROTECTED]>
wrote:

I think there are a few unresolved issues due to lack of explicit type
declarations in PIG. Especially with data atoms that have leading or
trialing spaces (' '), implicit typecast into strings and integers/floats
can lead to unexpected results.

Consider the following example -

grunt> a = load '/test' using PigStorage(',') as (x,y);
grunt> dump a;
(1,  2 )
(2,  3 )
(3, 4)
(4, 5)

grunt> b = load '/test' using PigStorage(',') as (x,y);
grunt> dump b;
(1,  2 )
(2,  3 )
(3, 4)
(4, 5)

grunt> a1 = filter a by x==2;
grunt> dump a1;
(2,  3 )

grunt> b1 = filter b by y==2;
grunt> dump b1;
(1,  2 )

So, both a and b have tuples that can be filtered with x==2 and y==2
respectively.
But, what do we get when we cogroup them?

grunt> c = cogroup a by (x) INNER, b by (y) INNER;
grunt> dump c;

[NOTHING!]

This is because COGROUP is treating x and y as strings but FILTER was
explicitly asked to treat them as integers/floats with the '=='
comparator.
This can be quite confusing since COGROUP seems to be using the
comparator
'eq' by default!

Hence, the following result when we cogroup without the keywords 'INNER'

grunt> c2 = cogroup a by (x), b by (y);
grunt> dump c2;
(1, {(1,  2 )}, {})
(2, {(2,  3 )}, {})
(3, {(3, 4)}, {})
(4, {(4, 5)}, {(3, 4)})
(5, {}, {(4, 5)})
( 2 , {}, {(1,  2 )})
( 3 , {}, {(2,  3 )})

I think that we can avoid the confusion by one of two ways

(a) Forcing the rule that all data atoms will be stripped of leading and
trailing spaces (' '). This is important because data atoms have no
explicit
type declaration.
(b) Or atleast throwing up an error when a string with a leading/trailing
space is typecast (implicitly or explicitly) to an integer or float. I.e,
only the output of STRING(FLOAT) or STRING(INTEGER) should be allowed as
the
input of FLOAT(STRING) or INTEGER(STRING) conversions.

I see that, if unresolved, this can be an annoying problem as fields in
input files often tend to have leading/trailing spaces that are not part
of
the field separator.

Prashanth