Alan Gates
Mon, 09 Jun 2008 07:38:58 -0700
Alan. pi song wrote:
In fact I plan to show warning messages if there are implicit castings in the plan. This has already been done by design. On 6/6/08, Prashanth Pappu <[EMAIL PROTECTED]> wrote:Also, if we are going to introduce data types, then perhaps we should unify the string and numerical comparison operators. I.e., a==b will do a numerical or string comparison based on the data types of a and b. And we won't need separate operators '==' and 'eq'. This is similar to SQL. On Thu, Jun 5, 2008 at 12:00 PM, Alan Gates <[EMAIL PROTECTED]> wrote:I think that ' 2', when converted to an integer, should result in 2, not NULL or 0. That definitely seems like the most expected behavior. Your point that cogroup and filter treat ' 2' differently is valid. Fortunately, the addition of types will handle that, mostly. If a user declares something to be an integer and then uses it as cogroup key, itmustbe compared against something else declared as an integer (the key can'tbestring on one side and int on the other, because == isn't defined forstring== int). So the following will do want you want: A = load 'myfile' as (a: int, b: chararray); B = load 'myotherfile' as (x: int, y: float); C = cogroup A by a, B by x; ... G = filter F by a == 2; if 'myfile' contains 2,this is a string and 'myothefile' contains 2 ,3.141592654 then these tuples will match for cogroup and they will match for thefilterin G. The one place where you'll still get differing results is if you do the following: A = load 'myfile' as (a, b); B = load 'myotherfile' as (x, y); C = cogroup A by a, B by x; ... G = filter F by a == 2; In this case, in the cogroup, a and x will be compared as byte arrays because the user did not declare a type. But in the filter a will becastto an int because the user is comparing to an int. If G is changed to G = filter F by a == '2'; then the comparison the '2 ' will fail the filter. Alan. Prashanth Pappu wrote:Explicit type definitions will definitely help. But I still think thatinthis example - grunt> b1 = filter b by y==2; grunt> dump b1; (1, 2 ) Either we should throw an error or not return any results. Because the '==' operator is implicitly stripping the spaces in ' 2 ' and determiningthat' 2 ' == 2! (The result of (' 2 ' ==2) should be the same as say, ('X2X' ==2)) And cogroup isn't removing the spaces and hence ' 2 ' ne 2. Overall, where we do not have explicit type declarations, the behaviorofFILTER, COGROUP etc should be consistent - either they all remove the spaces or none of them do. The rest, definitely, is users' responsibility. Prashanth On Thu, Jun 5, 2008 at 5:47 AM, pi song <[EMAIL PROTECTED]> wrote:The main theme in Pig development at the moment is developing a newtypesystem which means you will be able to stick typing information to your data. Basically, the problem you mention will not happen if you specify all the metadata. However if you don't, this might happen again and I think dealing with whitespaces is users' responsibility. Do you think this will resolve your problem? Pi On Thu, Jun 5, 2008 at 9:37 AM, Prashanth Pappu <[EMAIL PROTECTED]wrote:I think there are a few unresolved issues due to lack of explicit type declarations in PIG. Especially with data atoms that have leading or trialing spaces (' '), implicit typecast into strings and integers/floats can lead to unexpected results. Consider the following example - grunt> a = load '/test' using PigStorage(',') as (x,y); grunt> dump a; (1, 2 ) (2, 3 ) (3, 4) (4, 5) grunt> b = load '/test' using PigStorage(',') as (x,y); grunt> dump b; (1, 2 ) (2, 3 ) (3, 4) (4, 5) grunt> a1 = filter a by x==2; grunt> dump a1; (2, 3 ) grunt> b1 = filter b by y==2; grunt> dump b1; (1, 2 ) So, both a and b have tuples that can be filtered with x==2 and y==2 respectively. But, what do we get when we cogroup them? grunt> c = cogroup a by (x) INNER, b by (y) INNER; grunt> dump c; [NOTHING!] This is because COGROUP is treating x and y as strings but FILTER was explicitly asked to treat them as integers/floats with the '=='comparator.This can be quite confusing since COGROUP seems to be using thecomparator'eq' by default! Hence, the following result when we cogroup without the keywords'INNER'grunt> c2 = cogroup a by (x), b by (y); grunt> dump c2; (1, {(1, 2 )}, {}) (2, {(2, 3 )}, {}) (3, {(3, 4)}, {}) (4, {(4, 5)}, {(3, 4)}) (5, {}, {(4, 5)}) ( 2 , {}, {(1, 2 )}) ( 3 , {}, {(2, 3 )}) I think that we can avoid the confusion by one of two ways (a) Forcing the rule that all data atoms will be stripped of leadingandtrailing spaces (' '). This is important because data atoms have no explicit type declaration. (b) Or atleast throwing up an error when a string with a leading/trailing space is typecast (implicitly or explicitly) to an integer or float. I.e, only the output of STRING(FLOAT) or STRING(INTEGER) should be allowedasthe input of FLOAT(STRING) or INTEGER(STRING) conversions. I see that, if unresolved, this can be an annoying problem as fieldsininput files often tend to have leading/trailing spaces that are notpartofthe field separator. Prashanth