pig-user  

Re: Spaces (' ') in data atoms

Alan Gates
Mon, 09 Jun 2008 07:38:58 -0700

While the 'eq' etc. operators will be supported for backward compatibility, they will no longer be necessary. It will be possible to use == for any datatype, or even between two datatypes where an implicit cast is defined (e.g. int -> long). For full details of the changes please see http://wiki.apache.org/pig/PigTypesFunctionalSpec

Alan.

pi song wrote:
In fact I plan to show warning messages if there are implicit castings in
the plan. This has already been done by design.

On 6/6/08, Prashanth Pappu <[EMAIL PROTECTED]> wrote:
Also, if we are going to introduce data types, then perhaps we should unify
the string and numerical comparison operators.

I.e., a==b will do a numerical or string comparison based on the data types
of a and b. And we won't need separate operators '==' and 'eq'. This is
similar to SQL.

On Thu, Jun 5, 2008 at 12:00 PM, Alan Gates <[EMAIL PROTECTED]> wrote:

I think that ' 2', when converted to an integer, should result in 2, not
NULL or 0.  That definitely seems like the most expected behavior.  Your
point that cogroup and filter treat ' 2' differently is valid.
Fortunately, the addition of types will handle that, mostly.  If a user
declares something to be an integer and then uses it as cogroup key, it
must
be compared against something else declared as an integer (the key can't
be
string on one side and int on the other, because == isn't defined for
string
== int).  So the following will do want you want:

A = load 'myfile' as (a: int, b: chararray);
B = load 'myotherfile' as (x: int, y: float);
C = cogroup A by a, B by x;
...
G = filter F by a == 2;

if 'myfile' contains
2,this is a string

and 'myothefile' contains
2 ,3.141592654

then these tuples will match for cogroup and they will match for the
filter
in G.

The one place where you'll still get differing results is if you do the
following:

A = load 'myfile' as (a, b);
B = load 'myotherfile' as (x, y);
C = cogroup A by a, B by x;
...
G = filter F by a == 2;

In this case, in the cogroup, a and x will be compared as byte arrays
because the user did not declare a type.  But in the filter a will be
cast
to an int because the user is comparing to an int.  If G is changed to

G = filter F by a == '2';

then the comparison the '2 ' will fail the filter.

Alan.


Prashanth Pappu wrote:

Explicit type definitions will definitely help. But I still think that
in
this example -

grunt> b1 = filter b by y==2;
grunt> dump b1;
(1,  2 )

Either we should throw an error or not return any results. Because the
'=='
operator is implicitly stripping the spaces in ' 2 ' and determining
that
'
2 ' == 2!  (The result of (' 2 ' ==2) should be the same as say, ('X2X'
==2)) And cogroup isn't removing the spaces and hence ' 2 ' ne 2.

Overall, where we do not have explicit type declarations, the behavior
of
FILTER, COGROUP etc should be consistent - either they all remove the
spaces
or none of them do. The rest, definitely, is users' responsibility.

Prashanth

On Thu, Jun 5, 2008 at 5:47 AM, pi song <[EMAIL PROTECTED]> wrote:



The main theme in Pig development at the moment is developing a new
type
system which means you will be able to stick typing information to your
data. Basically, the problem you mention will not happen if you specify
all
the metadata. However if you don't, this might happen again and I think
dealing with whitespaces is users' responsibility.

Do you think this will resolve your problem?

Pi

On Thu, Jun 5, 2008 at 9:37 AM, Prashanth Pappu <[EMAIL PROTECTED]
wrote:



I think there are a few unresolved issues due to lack of explicit type
declarations in PIG. Especially with data atoms that have leading or
trialing spaces (' '), implicit typecast into strings and
integers/floats
can lead to unexpected results.

Consider the following example -

grunt> a = load '/test' using PigStorage(',') as (x,y);
grunt> dump a;
(1,  2 )
(2,  3 )
(3, 4)
(4, 5)

grunt> b = load '/test' using PigStorage(',') as (x,y);
grunt> dump b;
(1,  2 )
(2,  3 )
(3, 4)
(4, 5)

grunt> a1 = filter a by x==2;
grunt> dump a1;
(2,  3 )

grunt> b1 = filter b by y==2;
grunt> dump b1;
(1,  2 )

So, both a and b have tuples that can be filtered with x==2 and y==2
respectively.
But, what do we get when we cogroup them?

grunt> c = cogroup a by (x) INNER, b by (y) INNER;
grunt> dump c;

[NOTHING!]

This is because COGROUP is treating x and y as strings but FILTER was
explicitly asked to treat them as integers/floats with the '=='


comparator.


This can be quite confusing since COGROUP seems to be using the


comparator


'eq' by default!

Hence, the following result when we cogroup without the keywords
'INNER'
grunt> c2 = cogroup a by (x), b by (y);
grunt> dump c2;
(1, {(1,  2 )}, {})
(2, {(2,  3 )}, {})
(3, {(3, 4)}, {})
(4, {(4, 5)}, {(3, 4)})
(5, {}, {(4, 5)})
( 2 , {}, {(1,  2 )})
( 3 , {}, {(2,  3 )})

I think that we can avoid the confusion by one of two ways

(a) Forcing the rule that all data atoms will be stripped of leading
and
trailing spaces (' '). This is important because data atoms have no
explicit
type declaration.
(b) Or atleast throwing up an error when a string with a
leading/trailing
space is typecast (implicitly or explicitly) to an integer or float.
I.e,
only the output of STRING(FLOAT) or STRING(INTEGER) should be allowed
as
the
input of FLOAT(STRING) or INTEGER(STRING) conversions.

I see that, if unresolved, this can be an annoying problem as fields
in
input files often tend to have leading/trailing spaces that are not
part
of


the field separator.

Prashanth