pig-user  

Re: Spaces (' ') in data atoms

Prashanth Pappu
Thu, 05 Jun 2008 14:50:00 -0700

Also, if we are going to introduce data types, then perhaps we should unify
the string and numerical comparison operators.

I.e., a==b will do a numerical or string comparison based on the data types
of a and b. And we won't need separate operators '==' and 'eq'. This is
similar to SQL.

On Thu, Jun 5, 2008 at 12:00 PM, Alan Gates <[EMAIL PROTECTED]> wrote:

> I think that ' 2', when converted to an integer, should result in 2, not
> NULL or 0.  That definitely seems like the most expected behavior.  Your
> point that cogroup and filter treat ' 2' differently is valid.
> Fortunately, the addition of types will handle that, mostly.  If a user
> declares something to be an integer and then uses it as cogroup key, it must
> be compared against something else declared as an integer (the key can't be
> string on one side and int on the other, because == isn't defined for string
> == int).  So the following will do want you want:
>
> A = load 'myfile' as (a: int, b: chararray);
> B = load 'myotherfile' as (x: int, y: float);
> C = cogroup A by a, B by x;
> ...
> G = filter F by a == 2;
>
> if 'myfile' contains
> 2,this is a string
>
> and 'myothefile' contains
> 2 ,3.141592654
>
> then these tuples will match for cogroup and they will match for the filter
> in G.
>
> The one place where you'll still get differing results is if you do the
> following:
>
> A = load 'myfile' as (a, b);
> B = load 'myotherfile' as (x, y);
> C = cogroup A by a, B by x;
> ...
> G = filter F by a == 2;
>
> In this case, in the cogroup, a and x will be compared as byte arrays
> because the user did not declare a type.  But in the filter a will be cast
> to an int because the user is comparing to an int.  If G is changed to
>
> G = filter F by a == '2';
>
> then the comparison the '2 ' will fail the filter.
>
> Alan.
>
>
> Prashanth Pappu wrote:
>
>> Explicit type definitions will definitely help. But I still think that in
>> this example -
>>
>> grunt> b1 = filter b by y==2;
>> grunt> dump b1;
>> (1,  2 )
>>
>> Either we should throw an error or not return any results. Because the
>> '=='
>> operator is implicitly stripping the spaces in ' 2 ' and determining that
>> '
>> 2 ' == 2!  (The result of (' 2 ' ==2) should be the same as say, ('X2X'
>> ==2)) And cogroup isn't removing the spaces and hence ' 2 ' ne 2.
>>
>> Overall, where we do not have explicit type declarations, the behavior of
>> FILTER, COGROUP etc should be consistent - either they all remove the
>> spaces
>> or none of them do. The rest, definitely, is users' responsibility.
>>
>> Prashanth
>>
>> On Thu, Jun 5, 2008 at 5:47 AM, pi song <[EMAIL PROTECTED]> wrote:
>>
>>
>>
>>> The main theme in Pig development at the moment is developing a new type
>>> system which means you will be able to stick typing information to your
>>> data. Basically, the problem you mention will not happen if you specify
>>> all
>>> the metadata. However if you don't, this might happen again and I think
>>> dealing with whitespaces is users' responsibility.
>>>
>>> Do you think this will resolve your problem?
>>>
>>> Pi
>>>
>>> On Thu, Jun 5, 2008 at 9:37 AM, Prashanth Pappu <[EMAIL PROTECTED]>
>>> wrote:
>>>
>>>
>>>
>>>> I think there are a few unresolved issues due to lack of explicit type
>>>> declarations in PIG. Especially with data atoms that have leading or
>>>> trialing spaces (' '), implicit typecast into strings and
>>>> integers/floats
>>>> can lead to unexpected results.
>>>>
>>>> Consider the following example -
>>>>
>>>> grunt> a = load '/test' using PigStorage(',') as (x,y);
>>>> grunt> dump a;
>>>> (1,  2 )
>>>> (2,  3 )
>>>> (3, 4)
>>>> (4, 5)
>>>>
>>>> grunt> b = load '/test' using PigStorage(',') as (x,y);
>>>> grunt> dump b;
>>>> (1,  2 )
>>>> (2,  3 )
>>>> (3, 4)
>>>> (4, 5)
>>>>
>>>> grunt> a1 = filter a by x==2;
>>>> grunt> dump a1;
>>>> (2,  3 )
>>>>
>>>> grunt> b1 = filter b by y==2;
>>>> grunt> dump b1;
>>>> (1,  2 )
>>>>
>>>> So, both a and b have tuples that can be filtered with x==2 and y==2
>>>> respectively.
>>>> But, what do we get when we cogroup them?
>>>>
>>>> grunt> c = cogroup a by (x) INNER, b by (y) INNER;
>>>> grunt> dump c;
>>>>
>>>> [NOTHING!]
>>>>
>>>> This is because COGROUP is treating x and y as strings but FILTER was
>>>> explicitly asked to treat them as integers/floats with the '=='
>>>>
>>>>
>>> comparator.
>>>
>>>
>>>> This can be quite confusing since COGROUP seems to be using the
>>>>
>>>>
>>> comparator
>>>
>>>
>>>> 'eq' by default!
>>>>
>>>> Hence, the following result when we cogroup without the keywords 'INNER'
>>>>
>>>> grunt> c2 = cogroup a by (x), b by (y);
>>>> grunt> dump c2;
>>>> (1, {(1,  2 )}, {})
>>>> (2, {(2,  3 )}, {})
>>>> (3, {(3, 4)}, {})
>>>> (4, {(4, 5)}, {(3, 4)})
>>>> (5, {}, {(4, 5)})
>>>> ( 2 , {}, {(1,  2 )})
>>>> ( 3 , {}, {(2,  3 )})
>>>>
>>>> I think that we can avoid the confusion by one of two ways
>>>>
>>>> (a) Forcing the rule that all data atoms will be stripped of leading and
>>>> trailing spaces (' '). This is important because data atoms have no
>>>> explicit
>>>> type declaration.
>>>> (b) Or atleast throwing up an error when a string with a
>>>> leading/trailing
>>>> space is typecast (implicitly or explicitly) to an integer or float.
>>>> I.e,
>>>> only the output of STRING(FLOAT) or STRING(INTEGER) should be allowed as
>>>> the
>>>> input of FLOAT(STRING) or INTEGER(STRING) conversions.
>>>>
>>>> I see that, if unresolved, this can be an annoying problem as fields in
>>>> input files often tend to have leading/trailing spaces that are not part
>>>>
>>>>
>>> of
>>>
>>>
>>>> the field separator.
>>>>
>>>> Prashanth
>>>>
>>>>
>>>>
>>>
>>
>>
>