[
https://issues.apache.org/jira/browse/PIG-1339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12862415#action_12862415
]
Pradeep Kamath commented on PIG-1339:
-------------------------------------
I looked into this more and here are my observations:
Grunt actually parses the unicode chars as ascii characters (my suspicion is
this is due to Jline's ConsoleReader treating its input as ASCII). So though
grunt is able to process the script, it actually interprets the column name as
whatever the equivalent ASCII representation comes out as. So it's not really
handling it correctly (in the above case, the column name becomes something
like BDFGH). The reason ascii characters work is because the columname is
matched by the IDENTIFIER token which only works with ascii characters. This
production cannot be extended easily to handle non ascii chars nor can a new
token (COLNAME?) be used to allow non ascii chars along the lines of
QUOTEDSTRING. In QUOTEDSTRING, non ascii chars are allowed only within the
context of enclosing single quotes. Here we need the same functionality in the
context of schema specification - "as (colname....)". Otherwise most input
would match to this token. Unfortunately this is a context within the parser
rather than in the TokenManager which does have the concept of lexical states
(http://www.engr.mun.ca/~theo/JavaCC-FAQ/javacc-faq-ie.htm#tth_sEc3.11).
Switching lexical states within the parser is considered unsafe since the
tokenManager is looking ahead of where the parser is at
(http://www.engr.mun.ca/~theo/JavaCC-FAQ/javacc-faq-ie.htm#tth_sEc3.12). So at
this point I am not clear what the approach to this issue should be.
If someone with better javacc knowledge knows how to do parser context based
tokenizing, please give this issue a try and update with results.
> International characters in column names not supported
> ------------------------------------------------------
>
> Key: PIG-1339
> URL: https://issues.apache.org/jira/browse/PIG-1339
> Project: Pig
> Issue Type: Bug
> Components: impl
> Affects Versions: 0.6.0, 0.7.0, 0.8.0
> Reporter: Viraj Bhat
>
> There is a particular use-case in which someone specifies a column name to be
> in International characters.
> {code}
> inputdata = load '/user/viraj/inputdata.txt' using PigStorage() as (あいうえお);
> describe inputdata;
> dump inputdata;
> {code}
> ======================================================================================================
> Pig Stack Trace
> ---------------
> ERROR 1000: Error during parsing. Lexical error at line 1, column 64.
> Encountered: "\u3042" (12354), after : ""
> org.apache.pig.impl.logicalLayer.parser.TokenMgrError: Lexical error at line
> 1, column 64. Encountered: "\u3042" (12354), after : ""
> at
> org.apache.pig.impl.logicalLayer.parser.QueryParserTokenManager.getNextToken(QueryParserTokenManager.java:1791)
> at
> org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_scan_token(QueryParser.java:8959)
> at
> org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_51(QueryParser.java:7462)
> at
> org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_120(QueryParser.java:7769)
> at
> org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_106(QueryParser.java:7787)
> at
> org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_63(QueryParser.java:8609)
> at
> org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_32(QueryParser.java:8621)
> at
> org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3_4(QueryParser.java:8354)
> at
> org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_2_4(QueryParser.java:6903)
> at
> org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:1249)
> at
> org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:911)
> at
> org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:700)
> at
> org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63)
> at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1164)
> at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1114)
> at org.apache.pig.PigServer.registerQuery(PigServer.java:425)
> at
> org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:737)
> at
> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:324)
> at
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162)
> at
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138)
> at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
> at org.apache.pig.Main.main(Main.java:391)
> ======================================================================================================
> Thanks Viraj
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.