[ 
https://issues.apache.org/jira/browse/PIG-774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12703007#action_12703007
 ] 

Daniel Dai commented on PIG-774:
--------------------------------

Currently Jline does not handle backspace correctly for multibyte characters. 
When we hit backspace in a UTF8 encoding OS, only partial character will be 
deleted. If the OS encoding is native, the situation is even worse, Jline will 
throw an exception for multibyte character entered. This problem is inherent in 
Jline and all applications utilize JLine share this problem. I will try to fix 
it in Jline, however, fixing this problem is out of the scope of Pig. So 
currently, we will have to live with these problem:

# Multibyte character inputing is not supportted if OS encoding is native
# Backspace handling is incorrect if line contains multibyte characters and OS 
encoding is UTF8

Interstingly, under Cygwin, Jline works fine. The above problem are only for 
Unix.

> Pig does not handle Chinese characters (in both the parameter subsitution 
> using -param_file or embedded in the Pig script) correctly
> ------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-774
>                 URL: https://issues.apache.org/jira/browse/PIG-774
>             Project: Pig
>          Issue Type: Bug
>          Components: grunt, impl
>    Affects Versions: 0.0.0
>            Reporter: Viraj Bhat
>            Priority: Critical
>             Fix For: 0.0.0
>
>         Attachments: chinese.txt, chinese_data.pig, nextgen_paramfile, 
> utf8_parser-1.patch
>
>
> I created a very small test case in which I did the following.
> 1) Created a UTF-8 file which contained a query string in Chinese and wrote 
> it to HDFS. I used this dfs file as an input for the tests.
> 2) Created a parameter file which also contained the same query string as in 
> Step 1.
> 3) Created a Pig script which takes in the parametrized query string and hard 
> coded Chinese character.
> ================================================================
> Pig script: chinese_data.pig
> ================================================================
> {code}
> rmf chineseoutput;
> I = load '/user/viraj/chinese.txt' using PigStorage('\u0001');
> J = filter I by $0 == '$querystring';
> --J = filter I by $0 == ' 歌手    香港情牽女人心演唱會';
> store J into 'chineseoutput';
> dump J;
> {code}
> =================================================================
> Parameter file: nextgen_paramfile
> =================================================================
> queryid=20090311
> querystring='   歌手    香港情牽女人心演唱會'
> =================================================================
> Input file: /user/viraj/chinese.txt
> =================================================================
> shell$ hadoop fs -cat /user/viraj/chinese.txt
>         歌手    香港情牽女人心演唱會
> =================================================================
> I ran the above set of inputs in the following ways:
> Run 1:
> =================================================================
> {code}
> java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' 
> org.apache.pig.Main -param_file nextgen_paramfile chinese_data.pig
> {code}
> =================================================================
> 2009-04-22 01:31:35,703 [Thread-7] WARN  org.apache.hadoop.mapred.JobClient - 
> Use GenericOptionsParser for parsing the
> arguments. Applications should implement Tool for the same.
> 2009-04-22 01:31:40,700 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>  -
> 0% complete
> 2009-04-22 01:31:50,720 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>  -
> 100% complete
> 2009-04-22 01:31:50,720 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>  -
> Success!
> =================================================================
> Run 2: removed the parameter substitution in the Pig script instead used the 
> following statement.
> =================================================================
> {code}
> J = filter I by $0 == ' 歌手    香港情牽女人心演唱會';
> {code}
> =================================================================
> java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' 
> org.apache.pig.Main chinese_data_withoutparam.pig
> =================================================================
> 2009-04-22 01:35:22,402 [Thread-7] WARN  org.apache.hadoop.mapred.JobClient - 
> Use GenericOptionsParser for parsing the
> arguments. Applications should implement Tool for the same.
> 2009-04-22 01:35:27,399 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>  -
> 0% complete
> 2009-04-22 01:35:32,415 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>  -
> 100% complete
> 2009-04-22 01:35:32,415 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>  -
> Success!
> =================================================================
> In both cases:
> =================================================================
> {code}
> shell $ hadoop fs -ls /user/viraj/chineseoutput
> Found 2 items
> drwxr-xr-x   - viraj supergroup          0 2009-04-22 01:37 
> /user/viraj/chineseoutput/_logs
> -rw-r--r--   3 viraj supergroup          0 2009-04-22 01:37 
> /user/viraj/chineseoutput/part-00000
> {code}
> =================================================================
> Additionally tried the dry-run option to figure out if the parameter 
> substitution was occurring properly.
> =================================================================
> {code}
> java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' 
> org.apache.pig.Main -param_file nextgen_paramfile -r chinese_data.pig
> {code}
> =================================================================
> {code}
> shell$ file chinese_data.pig.substituted 
> chinese_data.pig.substituted: ASCII text
> shell$ cat chinese_data.pig.substituted 
> {code}
> {code}
> rmf chineseoutput;
> I = load '/user/viraj/chinese.txt' using PigStorage('\u0001');
> J = filter I by $0 == ' ??????  ??????????????????????????????';
> store J into 'chineseoutput';
> {code}
> =================================================================
> This issue has to do with the parser not handling UTF-8 data. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to