[ 
https://issues.apache.org/jira/browse/PIG-947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gandul Azul updated PIG-947:
----------------------------

    Description: 
PigStorage parser for bags is not working correctly when a tuple in a bag is 
proceeded by a space. For example, the following is parsed correctly:

{(-5.243084,3.142401,0.000138,2.071200,0),(-6.021349,0.992683,0.000044,0.992683,0),(-10.426160,20.251774,0.000892,5.691086,0)}

while this is not: (Note the space before the second tuple)
{(-5.243084,3.142401,0.000138,2.071200,0), 
(-6.021349,0.992683,0.000044,0.992683,0),(-10.426160,20.251774,0.000892,5.691086,0)}

It seems that the parser when it encounters the space, treats the rest of the 
line as a String. With a schema, this results in a typecast of string to 
databag which results in exception. 

|WARN builtin.PigStorage: Unable to interpret value [...@2c9b42e6 in field 
being converted to type bag, caught ParseException <Encountered " <STRING> "  
"" at |line 1, column 43.
|Was expecting:
|    "(" ...
|    > field discarded


Below is the parser debug output for the parsing of the above error sequence: 
"2.071200,0), (" from above...

****** FOUND A <DOUBLENUMBER> MATCH (2.071200) ******

          Call:   AtomDatum
            Consumed token: <<DOUBLENUMBER>: "2.071200" at line 1 column 31>
          Return: AtomDatum
        Return: Datum
   Matched the empty string as <STRING> token.
Current character : , (44) at line 1 column 39
   No more string literal token matches are possible.
   Currently matched the first 1 characters as a "," token.
****** FOUND A "," MATCH (,) ******

        Consumed token: <"," at line 1 column 39>
        Call:   Datum
   Matched the empty string as <STRING> token.
Current character : 0 (48) at line 1 column 40
   No string literal matches possible.
   Starting NFA to match one of : { <STRING>, <SIGNEDINTEGER>, <DOUBLENUMBER> }
Current character : 0 (48) at line 1 column 40
   Currently matched the first 1 characters as a <SIGNEDINTEGER> token.
   Possible kinds of longer matches : { <STRING>, <SIGNEDINTEGER>, 
<DOUBLENUMBER>, <LONGINTEGER>, 
     <FLOATNUMBER> }
Current character : ) (41) at line 1 column 41
   Currently matched the first 1 characters as a <SIGNEDINTEGER> token.
   Putting back 1 characters into the input stream.
****** FOUND A <SIGNEDINTEGER> MATCH (0) ******

          Call:   AtomDatum
            Consumed token: <<SIGNEDINTEGER>: "0" at line 1 column 40>
          Return: AtomDatum
        Return: Datum
   Matched the empty string as <STRING> token.
Current character : ) (41) at line 1 column 41
   No more string literal token matches are possible.
   Currently matched the first 1 characters as a ")" token.
****** FOUND A ")" MATCH ()) ******

      Return: Tuple
      Consumed token: <")" at line 1 column 41>
   Matched the empty string as <STRING> token.
Current character : , (44) at line 1 column 42
   No more string literal token matches are possible.
   Currently matched the first 1 characters as a "," token.
****** FOUND A "," MATCH (,) ******

      Consumed token: <"," at line 1 column 42>
   Matched the empty string as <STRING> token.
Current character :   (32) at line 1 column 43
   No string literal matches possible.
   Starting NFA to match one of : { <STRING>, <SIGNEDINTEGER>, <DOUBLENUMBER> }
Current character :   (32) at line 1 column 43
   Currently matched the first 1 characters as a <STRING> token.
   Possible kinds of longer matches : { <STRING>, <SIGNEDINTEGER>, 
<DOUBLENUMBER> }
Current character : ( (40) at line 1 column 44
   Currently matched the first 1 characters as a <STRING> token.
   Putting back 1 characters into the input stream.
****** FOUND A <STRING> MATCH ( ) ******

    Return: Bag
  Return: Datum
Return: Parse



  was:
PigStorage parser for bags is not working correctly when a tuple in a bag is 
proceeded by a space. For example, the following is parsed correctly:

{(-5.243084,3.142401,0.000138,2.071200,0),(-6.021349,0.992683,0.000044,0.992683,0),(-10.426160,20.251774,0.000892,5.691086,0)}

while this is not: (Note the space before the second tuple)
{(-5.243084,3.142401,0.000138,2.071200,0), 
(-6.021349,0.992683,0.000044,0.992683,0),(-10.426160,20.251774,0.000892,5.691086,0)}

It seems that the parser when it encounters the space, treats the rest of the 
line as a String. With a schema, this results in a typecast of string to 
databag which results in exception. Accordingly, because of this, when using 
pigstorage to output a bag, it cannot be loaded using pigstorage because of 
this inconsistency.

|WARN builtin.PigStorage: Unable to interpret value [...@2c9b42e6 in field 
being converted to type bag, caught ParseException <Encountered " <STRING> "  
"" at |line 1, column 43.
|Was expecting:
|    "(" ...
|    > field discarded


Below is the parser debug output for the parsing of the above error sequence: 
"2.071200,0), (" from above...

****** FOUND A <DOUBLENUMBER> MATCH (2.071200) ******

          Call:   AtomDatum
            Consumed token: <<DOUBLENUMBER>: "2.071200" at line 1 column 31>
          Return: AtomDatum
        Return: Datum
   Matched the empty string as <STRING> token.
Current character : , (44) at line 1 column 39
   No more string literal token matches are possible.
   Currently matched the first 1 characters as a "," token.
****** FOUND A "," MATCH (,) ******

        Consumed token: <"," at line 1 column 39>
        Call:   Datum
   Matched the empty string as <STRING> token.
Current character : 0 (48) at line 1 column 40
   No string literal matches possible.
   Starting NFA to match one of : { <STRING>, <SIGNEDINTEGER>, <DOUBLENUMBER> }
Current character : 0 (48) at line 1 column 40
   Currently matched the first 1 characters as a <SIGNEDINTEGER> token.
   Possible kinds of longer matches : { <STRING>, <SIGNEDINTEGER>, 
<DOUBLENUMBER>, <LONGINTEGER>, 
     <FLOATNUMBER> }
Current character : ) (41) at line 1 column 41
   Currently matched the first 1 characters as a <SIGNEDINTEGER> token.
   Putting back 1 characters into the input stream.
****** FOUND A <SIGNEDINTEGER> MATCH (0) ******

          Call:   AtomDatum
            Consumed token: <<SIGNEDINTEGER>: "0" at line 1 column 40>
          Return: AtomDatum
        Return: Datum
   Matched the empty string as <STRING> token.
Current character : ) (41) at line 1 column 41
   No more string literal token matches are possible.
   Currently matched the first 1 characters as a ")" token.
****** FOUND A ")" MATCH ()) ******

      Return: Tuple
      Consumed token: <")" at line 1 column 41>
   Matched the empty string as <STRING> token.
Current character : , (44) at line 1 column 42
   No more string literal token matches are possible.
   Currently matched the first 1 characters as a "," token.
****** FOUND A "," MATCH (,) ******

      Consumed token: <"," at line 1 column 42>
   Matched the empty string as <STRING> token.
Current character :   (32) at line 1 column 43
   No string literal matches possible.
   Starting NFA to match one of : { <STRING>, <SIGNEDINTEGER>, <DOUBLENUMBER> }
Current character :   (32) at line 1 column 43
   Currently matched the first 1 characters as a <STRING> token.
   Possible kinds of longer matches : { <STRING>, <SIGNEDINTEGER>, 
<DOUBLENUMBER> }
Current character : ( (40) at line 1 column 44
   Currently matched the first 1 characters as a <STRING> token.
   Putting back 1 characters into the input stream.
****** FOUND A <STRING> MATCH ( ) ******

    Return: Bag
  Return: Datum
Return: Parse




> Parsing Bags by PigStorage is not handled correctly if whitespace before 
> start of tuple.
> ----------------------------------------------------------------------------------------
>
>                 Key: PIG-947
>                 URL: https://issues.apache.org/jira/browse/PIG-947
>             Project: Pig
>          Issue Type: Bug
>          Components: data
>         Environment: Pig on Hadoop 18
>            Reporter: Gandul Azul
>
> PigStorage parser for bags is not working correctly when a tuple in a bag is 
> proceeded by a space. For example, the following is parsed correctly:
> {(-5.243084,3.142401,0.000138,2.071200,0),(-6.021349,0.992683,0.000044,0.992683,0),(-10.426160,20.251774,0.000892,5.691086,0)}
> while this is not: (Note the space before the second tuple)
> {(-5.243084,3.142401,0.000138,2.071200,0), 
> (-6.021349,0.992683,0.000044,0.992683,0),(-10.426160,20.251774,0.000892,5.691086,0)}
> It seems that the parser when it encounters the space, treats the rest of the 
> line as a String. With a schema, this results in a typecast of string to 
> databag which results in exception. 
> |WARN builtin.PigStorage: Unable to interpret value [...@2c9b42e6 in field 
> being converted to type bag, caught ParseException <Encountered " <STRING> "  
> "" at |line 1, column 43.
> |Was expecting:
> |    "(" ...
> |    > field discarded
> Below is the parser debug output for the parsing of the above error sequence: 
> "2.071200,0), (" from above...
> ****** FOUND A <DOUBLENUMBER> MATCH (2.071200) ******
>           Call:   AtomDatum
>             Consumed token: <<DOUBLENUMBER>: "2.071200" at line 1 column 31>
>           Return: AtomDatum
>         Return: Datum
>    Matched the empty string as <STRING> token.
> Current character : , (44) at line 1 column 39
>    No more string literal token matches are possible.
>    Currently matched the first 1 characters as a "," token.
> ****** FOUND A "," MATCH (,) ******
>         Consumed token: <"," at line 1 column 39>
>         Call:   Datum
>    Matched the empty string as <STRING> token.
> Current character : 0 (48) at line 1 column 40
>    No string literal matches possible.
>    Starting NFA to match one of : { <STRING>, <SIGNEDINTEGER>, <DOUBLENUMBER> 
> }
> Current character : 0 (48) at line 1 column 40
>    Currently matched the first 1 characters as a <SIGNEDINTEGER> token.
>    Possible kinds of longer matches : { <STRING>, <SIGNEDINTEGER>, 
> <DOUBLENUMBER>, <LONGINTEGER>, 
>      <FLOATNUMBER> }
> Current character : ) (41) at line 1 column 41
>    Currently matched the first 1 characters as a <SIGNEDINTEGER> token.
>    Putting back 1 characters into the input stream.
> ****** FOUND A <SIGNEDINTEGER> MATCH (0) ******
>           Call:   AtomDatum
>             Consumed token: <<SIGNEDINTEGER>: "0" at line 1 column 40>
>           Return: AtomDatum
>         Return: Datum
>    Matched the empty string as <STRING> token.
> Current character : ) (41) at line 1 column 41
>    No more string literal token matches are possible.
>    Currently matched the first 1 characters as a ")" token.
> ****** FOUND A ")" MATCH ()) ******
>       Return: Tuple
>       Consumed token: <")" at line 1 column 41>
>    Matched the empty string as <STRING> token.
> Current character : , (44) at line 1 column 42
>    No more string literal token matches are possible.
>    Currently matched the first 1 characters as a "," token.
> ****** FOUND A "," MATCH (,) ******
>       Consumed token: <"," at line 1 column 42>
>    Matched the empty string as <STRING> token.
> Current character :   (32) at line 1 column 43
>    No string literal matches possible.
>    Starting NFA to match one of : { <STRING>, <SIGNEDINTEGER>, <DOUBLENUMBER> 
> }
> Current character :   (32) at line 1 column 43
>    Currently matched the first 1 characters as a <STRING> token.
>    Possible kinds of longer matches : { <STRING>, <SIGNEDINTEGER>, 
> <DOUBLENUMBER> }
> Current character : ( (40) at line 1 column 44
>    Currently matched the first 1 characters as a <STRING> token.
>    Putting back 1 characters into the input stream.
> ****** FOUND A <STRING> MATCH ( ) ******
>     Return: Bag
>   Return: Datum
> Return: Parse

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to