[ 
https://issues.apache.org/jira/browse/PIG-1339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12862415#action_12862415
 ] 

Pradeep Kamath commented on PIG-1339:
-------------------------------------

I looked into this more and here are my observations:
 Grunt actually parses the unicode chars as ascii characters (my suspicion is 
this is due to Jline's ConsoleReader treating its input as ASCII). So though 
grunt is able to process the script, it actually interprets the column name as 
whatever the equivalent ASCII representation comes out as. So it's not really 
handling it correctly (in the above case, the column name becomes something 
like BDFGH). The reason ascii characters work is because the columname is 
matched by the IDENTIFIER token which only works with  ascii characters. This 
production cannot be extended easily to handle non ascii chars nor can a new 
token (COLNAME?) be used to allow non ascii chars along the lines of 
QUOTEDSTRING. In QUOTEDSTRING, non ascii chars are allowed only within the 
context of enclosing single quotes. Here we need the same functionality in the 
context of schema specification - "as (colname....)". Otherwise most input 
would match to this token. Unfortunately this is a context within the parser 
rather than in the TokenManager which does have the concept of lexical states 
(http://www.engr.mun.ca/~theo/JavaCC-FAQ/javacc-faq-ie.htm#tth_sEc3.11). 
Switching lexical states within the parser is considered unsafe since the 
tokenManager is looking ahead of where the parser is at 
(http://www.engr.mun.ca/~theo/JavaCC-FAQ/javacc-faq-ie.htm#tth_sEc3.12). So at 
this point I am not clear what the approach to this issue should be.

If someone with better javacc knowledge knows how to do parser context based 
tokenizing, please give this issue a try and update with results.

> International characters in column names not supported
> ------------------------------------------------------
>
>                 Key: PIG-1339
>                 URL: https://issues.apache.org/jira/browse/PIG-1339
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.6.0, 0.7.0, 0.8.0
>            Reporter: Viraj Bhat
>
> There is a particular use-case in which someone specifies a column name to be 
> in International characters.
> {code}
> inputdata = load '/user/viraj/inputdata.txt' using PigStorage() as (あいうえお);
> describe inputdata;
> dump inputdata;
> {code}
> ======================================================================================================
> Pig Stack Trace
> ---------------
> ERROR 1000: Error during parsing. Lexical error at line 1, column 64.  
> Encountered: "\u3042" (12354), after : ""
> org.apache.pig.impl.logicalLayer.parser.TokenMgrError: Lexical error at line 
> 1, column 64.  Encountered: "\u3042" (12354), after : ""
>         at 
> org.apache.pig.impl.logicalLayer.parser.QueryParserTokenManager.getNextToken(QueryParserTokenManager.java:1791)
>         at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_scan_token(QueryParser.java:8959)
>         at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_51(QueryParser.java:7462)
>         at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_120(QueryParser.java:7769)
>         at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_106(QueryParser.java:7787)
>         at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_63(QueryParser.java:8609)
>         at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_32(QueryParser.java:8621)
>         at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3_4(QueryParser.java:8354)
>         at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_2_4(QueryParser.java:6903)
>         at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:1249)
>         at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:911)
>         at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:700)
>         at 
> org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63)
>         at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1164)
>         at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1114)
>         at org.apache.pig.PigServer.registerQuery(PigServer.java:425)
>         at 
> org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:737)
>         at 
> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:324)
>         at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162)
>         at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138)
>         at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
>         at org.apache.pig.Main.main(Main.java:391)
> ======================================================================================================
> Thanks Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to