[ https://issues.apache.org/jira/browse/PIG-947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Gandul Azul updated PIG-947: ---------------------------- Description: PigStorage parser for bags is not working correctly when a tuple in a bag is proceeded by a space. For example, the following is parsed correctly: {(-5.243084,3.142401,0.000138,2.071200,0),(-6.021349,0.992683,0.000044,0.992683,0),(-10.426160,20.251774,0.000892,5.691086,0)} while this is not: (Note the space before the second tuple) {(-5.243084,3.142401,0.000138,2.071200,0), (-6.021349,0.992683,0.000044,0.992683,0),(-10.426160,20.251774,0.000892,5.691086,0)} It seems that the parser when it encounters the space, treats the rest of the line as a String. With a schema, this results in a typecast of string to databag which results in exception. |WARN builtin.PigStorage: Unable to interpret value [...@2c9b42e6 in field being converted to type bag, caught ParseException <Encountered " <STRING> " "" at |line 1, column 43. |Was expecting: | "(" ... | > field discarded Below is the parser debug output for the parsing of the above error sequence: "2.071200,0), (" from above... ****** FOUND A <DOUBLENUMBER> MATCH (2.071200) ****** Call: AtomDatum Consumed token: <<DOUBLENUMBER>: "2.071200" at line 1 column 31> Return: AtomDatum Return: Datum Matched the empty string as <STRING> token. Current character : , (44) at line 1 column 39 No more string literal token matches are possible. Currently matched the first 1 characters as a "," token. ****** FOUND A "," MATCH (,) ****** Consumed token: <"," at line 1 column 39> Call: Datum Matched the empty string as <STRING> token. Current character : 0 (48) at line 1 column 40 No string literal matches possible. Starting NFA to match one of : { <STRING>, <SIGNEDINTEGER>, <DOUBLENUMBER> } Current character : 0 (48) at line 1 column 40 Currently matched the first 1 characters as a <SIGNEDINTEGER> token. Possible kinds of longer matches : { <STRING>, <SIGNEDINTEGER>, <DOUBLENUMBER>, <LONGINTEGER>, <FLOATNUMBER> } Current character : ) (41) at line 1 column 41 Currently matched the first 1 characters as a <SIGNEDINTEGER> token. Putting back 1 characters into the input stream. ****** FOUND A <SIGNEDINTEGER> MATCH (0) ****** Call: AtomDatum Consumed token: <<SIGNEDINTEGER>: "0" at line 1 column 40> Return: AtomDatum Return: Datum Matched the empty string as <STRING> token. Current character : ) (41) at line 1 column 41 No more string literal token matches are possible. Currently matched the first 1 characters as a ")" token. ****** FOUND A ")" MATCH ()) ****** Return: Tuple Consumed token: <")" at line 1 column 41> Matched the empty string as <STRING> token. Current character : , (44) at line 1 column 42 No more string literal token matches are possible. Currently matched the first 1 characters as a "," token. ****** FOUND A "," MATCH (,) ****** Consumed token: <"," at line 1 column 42> Matched the empty string as <STRING> token. Current character : (32) at line 1 column 43 No string literal matches possible. Starting NFA to match one of : { <STRING>, <SIGNEDINTEGER>, <DOUBLENUMBER> } Current character : (32) at line 1 column 43 Currently matched the first 1 characters as a <STRING> token. Possible kinds of longer matches : { <STRING>, <SIGNEDINTEGER>, <DOUBLENUMBER> } Current character : ( (40) at line 1 column 44 Currently matched the first 1 characters as a <STRING> token. Putting back 1 characters into the input stream. ****** FOUND A <STRING> MATCH ( ) ****** Return: Bag Return: Datum Return: Parse was: PigStorage parser for bags is not working correctly when a tuple in a bag is proceeded by a space. For example, the following is parsed correctly: {(-5.243084,3.142401,0.000138,2.071200,0),(-6.021349,0.992683,0.000044,0.992683,0),(-10.426160,20.251774,0.000892,5.691086,0)} while this is not: (Note the space before the second tuple) {(-5.243084,3.142401,0.000138,2.071200,0), (-6.021349,0.992683,0.000044,0.992683,0),(-10.426160,20.251774,0.000892,5.691086,0)} It seems that the parser when it encounters the space, treats the rest of the line as a String. With a schema, this results in a typecast of string to databag which results in exception. Accordingly, because of this, when using pigstorage to output a bag, it cannot be loaded using pigstorage because of this inconsistency. |WARN builtin.PigStorage: Unable to interpret value [...@2c9b42e6 in field being converted to type bag, caught ParseException <Encountered " <STRING> " "" at |line 1, column 43. |Was expecting: | "(" ... | > field discarded Below is the parser debug output for the parsing of the above error sequence: "2.071200,0), (" from above... ****** FOUND A <DOUBLENUMBER> MATCH (2.071200) ****** Call: AtomDatum Consumed token: <<DOUBLENUMBER>: "2.071200" at line 1 column 31> Return: AtomDatum Return: Datum Matched the empty string as <STRING> token. Current character : , (44) at line 1 column 39 No more string literal token matches are possible. Currently matched the first 1 characters as a "," token. ****** FOUND A "," MATCH (,) ****** Consumed token: <"," at line 1 column 39> Call: Datum Matched the empty string as <STRING> token. Current character : 0 (48) at line 1 column 40 No string literal matches possible. Starting NFA to match one of : { <STRING>, <SIGNEDINTEGER>, <DOUBLENUMBER> } Current character : 0 (48) at line 1 column 40 Currently matched the first 1 characters as a <SIGNEDINTEGER> token. Possible kinds of longer matches : { <STRING>, <SIGNEDINTEGER>, <DOUBLENUMBER>, <LONGINTEGER>, <FLOATNUMBER> } Current character : ) (41) at line 1 column 41 Currently matched the first 1 characters as a <SIGNEDINTEGER> token. Putting back 1 characters into the input stream. ****** FOUND A <SIGNEDINTEGER> MATCH (0) ****** Call: AtomDatum Consumed token: <<SIGNEDINTEGER>: "0" at line 1 column 40> Return: AtomDatum Return: Datum Matched the empty string as <STRING> token. Current character : ) (41) at line 1 column 41 No more string literal token matches are possible. Currently matched the first 1 characters as a ")" token. ****** FOUND A ")" MATCH ()) ****** Return: Tuple Consumed token: <")" at line 1 column 41> Matched the empty string as <STRING> token. Current character : , (44) at line 1 column 42 No more string literal token matches are possible. Currently matched the first 1 characters as a "," token. ****** FOUND A "," MATCH (,) ****** Consumed token: <"," at line 1 column 42> Matched the empty string as <STRING> token. Current character : (32) at line 1 column 43 No string literal matches possible. Starting NFA to match one of : { <STRING>, <SIGNEDINTEGER>, <DOUBLENUMBER> } Current character : (32) at line 1 column 43 Currently matched the first 1 characters as a <STRING> token. Possible kinds of longer matches : { <STRING>, <SIGNEDINTEGER>, <DOUBLENUMBER> } Current character : ( (40) at line 1 column 44 Currently matched the first 1 characters as a <STRING> token. Putting back 1 characters into the input stream. ****** FOUND A <STRING> MATCH ( ) ****** Return: Bag Return: Datum Return: Parse > Parsing Bags by PigStorage is not handled correctly if whitespace before > start of tuple. > ---------------------------------------------------------------------------------------- > > Key: PIG-947 > URL: https://issues.apache.org/jira/browse/PIG-947 > Project: Pig > Issue Type: Bug > Components: data > Environment: Pig on Hadoop 18 > Reporter: Gandul Azul > > PigStorage parser for bags is not working correctly when a tuple in a bag is > proceeded by a space. For example, the following is parsed correctly: > {(-5.243084,3.142401,0.000138,2.071200,0),(-6.021349,0.992683,0.000044,0.992683,0),(-10.426160,20.251774,0.000892,5.691086,0)} > while this is not: (Note the space before the second tuple) > {(-5.243084,3.142401,0.000138,2.071200,0), > (-6.021349,0.992683,0.000044,0.992683,0),(-10.426160,20.251774,0.000892,5.691086,0)} > It seems that the parser when it encounters the space, treats the rest of the > line as a String. With a schema, this results in a typecast of string to > databag which results in exception. > |WARN builtin.PigStorage: Unable to interpret value [...@2c9b42e6 in field > being converted to type bag, caught ParseException <Encountered " <STRING> " > "" at |line 1, column 43. > |Was expecting: > | "(" ... > | > field discarded > Below is the parser debug output for the parsing of the above error sequence: > "2.071200,0), (" from above... > ****** FOUND A <DOUBLENUMBER> MATCH (2.071200) ****** > Call: AtomDatum > Consumed token: <<DOUBLENUMBER>: "2.071200" at line 1 column 31> > Return: AtomDatum > Return: Datum > Matched the empty string as <STRING> token. > Current character : , (44) at line 1 column 39 > No more string literal token matches are possible. > Currently matched the first 1 characters as a "," token. > ****** FOUND A "," MATCH (,) ****** > Consumed token: <"," at line 1 column 39> > Call: Datum > Matched the empty string as <STRING> token. > Current character : 0 (48) at line 1 column 40 > No string literal matches possible. > Starting NFA to match one of : { <STRING>, <SIGNEDINTEGER>, <DOUBLENUMBER> > } > Current character : 0 (48) at line 1 column 40 > Currently matched the first 1 characters as a <SIGNEDINTEGER> token. > Possible kinds of longer matches : { <STRING>, <SIGNEDINTEGER>, > <DOUBLENUMBER>, <LONGINTEGER>, > <FLOATNUMBER> } > Current character : ) (41) at line 1 column 41 > Currently matched the first 1 characters as a <SIGNEDINTEGER> token. > Putting back 1 characters into the input stream. > ****** FOUND A <SIGNEDINTEGER> MATCH (0) ****** > Call: AtomDatum > Consumed token: <<SIGNEDINTEGER>: "0" at line 1 column 40> > Return: AtomDatum > Return: Datum > Matched the empty string as <STRING> token. > Current character : ) (41) at line 1 column 41 > No more string literal token matches are possible. > Currently matched the first 1 characters as a ")" token. > ****** FOUND A ")" MATCH ()) ****** > Return: Tuple > Consumed token: <")" at line 1 column 41> > Matched the empty string as <STRING> token. > Current character : , (44) at line 1 column 42 > No more string literal token matches are possible. > Currently matched the first 1 characters as a "," token. > ****** FOUND A "," MATCH (,) ****** > Consumed token: <"," at line 1 column 42> > Matched the empty string as <STRING> token. > Current character : (32) at line 1 column 43 > No string literal matches possible. > Starting NFA to match one of : { <STRING>, <SIGNEDINTEGER>, <DOUBLENUMBER> > } > Current character : (32) at line 1 column 43 > Currently matched the first 1 characters as a <STRING> token. > Possible kinds of longer matches : { <STRING>, <SIGNEDINTEGER>, > <DOUBLENUMBER> } > Current character : ( (40) at line 1 column 44 > Currently matched the first 1 characters as a <STRING> token. > Putting back 1 characters into the input stream. > ****** FOUND A <STRING> MATCH ( ) ****** > Return: Bag > Return: Datum > Return: Parse -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.