pig-user  

Re: Spaces (' ') in data atoms

pi song
Thu, 05 Jun 2008 23:48:32 -0700

In fact I plan to show warning messages if there are implicit castings in
the plan. This has already been done by design.

On 6/6/08, Prashanth Pappu <[EMAIL PROTECTED]> wrote:
>
> Also, if we are going to introduce data types, then perhaps we should unify
> the string and numerical comparison operators.
>
> I.e., a==b will do a numerical or string comparison based on the data types
> of a and b. And we won't need separate operators '==' and 'eq'. This is
> similar to SQL.
>
> On Thu, Jun 5, 2008 at 12:00 PM, Alan Gates <[EMAIL PROTECTED]> wrote:
>
> > I think that ' 2', when converted to an integer, should result in 2, not
> > NULL or 0.  That definitely seems like the most expected behavior.  Your
> > point that cogroup and filter treat ' 2' differently is valid.
> > Fortunately, the addition of types will handle that, mostly.  If a user
> > declares something to be an integer and then uses it as cogroup key, it
> must
> > be compared against something else declared as an integer (the key can't
> be
> > string on one side and int on the other, because == isn't defined for
> string
> > == int).  So the following will do want you want:
> >
> > A = load 'myfile' as (a: int, b: chararray);
> > B = load 'myotherfile' as (x: int, y: float);
> > C = cogroup A by a, B by x;
> > ...
> > G = filter F by a == 2;
> >
> > if 'myfile' contains
> > 2,this is a string
> >
> > and 'myothefile' contains
> > 2 ,3.141592654
> >
> > then these tuples will match for cogroup and they will match for the
> filter
> > in G.
> >
> > The one place where you'll still get differing results is if you do the
> > following:
> >
> > A = load 'myfile' as (a, b);
> > B = load 'myotherfile' as (x, y);
> > C = cogroup A by a, B by x;
> > ...
> > G = filter F by a == 2;
> >
> > In this case, in the cogroup, a and x will be compared as byte arrays
> > because the user did not declare a type.  But in the filter a will be
> cast
> > to an int because the user is comparing to an int.  If G is changed to
> >
> > G = filter F by a == '2';
> >
> > then the comparison the '2 ' will fail the filter.
> >
> > Alan.
> >
> >
> > Prashanth Pappu wrote:
> >
> >> Explicit type definitions will definitely help. But I still think that
> in
> >> this example -
> >>
> >> grunt> b1 = filter b by y==2;
> >> grunt> dump b1;
> >> (1,  2 )
> >>
> >> Either we should throw an error or not return any results. Because the
> >> '=='
> >> operator is implicitly stripping the spaces in ' 2 ' and determining
> that
> >> '
> >> 2 ' == 2!  (The result of (' 2 ' ==2) should be the same as say, ('X2X'
> >> ==2)) And cogroup isn't removing the spaces and hence ' 2 ' ne 2.
> >>
> >> Overall, where we do not have explicit type declarations, the behavior
> of
> >> FILTER, COGROUP etc should be consistent - either they all remove the
> >> spaces
> >> or none of them do. The rest, definitely, is users' responsibility.
> >>
> >> Prashanth
> >>
> >> On Thu, Jun 5, 2008 at 5:47 AM, pi song <[EMAIL PROTECTED]> wrote:
> >>
> >>
> >>
> >>> The main theme in Pig development at the moment is developing a new
> type
> >>> system which means you will be able to stick typing information to your
> >>> data. Basically, the problem you mention will not happen if you specify
> >>> all
> >>> the metadata. However if you don't, this might happen again and I think
> >>> dealing with whitespaces is users' responsibility.
> >>>
> >>> Do you think this will resolve your problem?
> >>>
> >>> Pi
> >>>
> >>> On Thu, Jun 5, 2008 at 9:37 AM, Prashanth Pappu <[EMAIL PROTECTED]
> >
> >>> wrote:
> >>>
> >>>
> >>>
> >>>> I think there are a few unresolved issues due to lack of explicit type
> >>>> declarations in PIG. Especially with data atoms that have leading or
> >>>> trialing spaces (' '), implicit typecast into strings and
> >>>> integers/floats
> >>>> can lead to unexpected results.
> >>>>
> >>>> Consider the following example -
> >>>>
> >>>> grunt> a = load '/test' using PigStorage(',') as (x,y);
> >>>> grunt> dump a;
> >>>> (1,  2 )
> >>>> (2,  3 )
> >>>> (3, 4)
> >>>> (4, 5)
> >>>>
> >>>> grunt> b = load '/test' using PigStorage(',') as (x,y);
> >>>> grunt> dump b;
> >>>> (1,  2 )
> >>>> (2,  3 )
> >>>> (3, 4)
> >>>> (4, 5)
> >>>>
> >>>> grunt> a1 = filter a by x==2;
> >>>> grunt> dump a1;
> >>>> (2,  3 )
> >>>>
> >>>> grunt> b1 = filter b by y==2;
> >>>> grunt> dump b1;
> >>>> (1,  2 )
> >>>>
> >>>> So, both a and b have tuples that can be filtered with x==2 and y==2
> >>>> respectively.
> >>>> But, what do we get when we cogroup them?
> >>>>
> >>>> grunt> c = cogroup a by (x) INNER, b by (y) INNER;
> >>>> grunt> dump c;
> >>>>
> >>>> [NOTHING!]
> >>>>
> >>>> This is because COGROUP is treating x and y as strings but FILTER was
> >>>> explicitly asked to treat them as integers/floats with the '=='
> >>>>
> >>>>
> >>> comparator.
> >>>
> >>>
> >>>> This can be quite confusing since COGROUP seems to be using the
> >>>>
> >>>>
> >>> comparator
> >>>
> >>>
> >>>> 'eq' by default!
> >>>>
> >>>> Hence, the following result when we cogroup without the keywords
> 'INNER'
> >>>>
> >>>> grunt> c2 = cogroup a by (x), b by (y);
> >>>> grunt> dump c2;
> >>>> (1, {(1,  2 )}, {})
> >>>> (2, {(2,  3 )}, {})
> >>>> (3, {(3, 4)}, {})
> >>>> (4, {(4, 5)}, {(3, 4)})
> >>>> (5, {}, {(4, 5)})
> >>>> ( 2 , {}, {(1,  2 )})
> >>>> ( 3 , {}, {(2,  3 )})
> >>>>
> >>>> I think that we can avoid the confusion by one of two ways
> >>>>
> >>>> (a) Forcing the rule that all data atoms will be stripped of leading
> and
> >>>> trailing spaces (' '). This is important because data atoms have no
> >>>> explicit
> >>>> type declaration.
> >>>> (b) Or atleast throwing up an error when a string with a
> >>>> leading/trailing
> >>>> space is typecast (implicitly or explicitly) to an integer or float.
> >>>> I.e,
> >>>> only the output of STRING(FLOAT) or STRING(INTEGER) should be allowed
> as
> >>>> the
> >>>> input of FLOAT(STRING) or INTEGER(STRING) conversions.
> >>>>
> >>>> I see that, if unresolved, this can be an annoying problem as fields
> in
> >>>> input files often tend to have leading/trailing spaces that are not
> part
> >>>>
> >>>>
> >>> of
> >>>
> >>>
> >>>> the field separator.
> >>>>
> >>>> Prashanth
> >>>>
> >>>>
> >>>>
> >>>
> >>
> >>
> >
>