[ https://issues.apache.org/jira/browse/PIG-774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Viraj Bhat updated PIG-774: --------------------------- Description: I created a very small test case in which I did the following. 1) Created a UTF-8 file which contained a query string in Chinese and wrote it to HDFS. I used this dfs file as an input for the tests. 2) Created a parameter file which also contained the same query string as in Step 1. 3) Created a Pig script which takes in the parametrized query string and hard coded Chinese character. ================================================================ Pig script: chinese_data.pig ================================================================ {code} rmf chineseoutput; I = load '/user/viraj/chinese.txt' using PigStorage('\u0001'); J = filter I by $0 == '$querystring'; --J = filter I by $0 == ' 歌手 香港情牽女人心演唱會'; store J into 'chineseoutput'; dump J; {code} ================================================================= Parameter file: nextgen_paramfile ================================================================= queryid=20090311 querystring=' 歌手 香港情牽女人心演唱會' ================================================================= Input file: /user/viraj/chinese.txt ================================================================= shell$ hadoop fs -cat /user/viraj/chinese.txt 歌手 香港情牽女人心演唱會 ================================================================= I ran the above set of inputs in the following ways: Run 1: ================================================================= {code} java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' org.apache.pig.Main -param_file nextgen_paramfile chinese_data.pig {code} ================================================================= 2009-04-22 01:31:35,703 [Thread-7] WARN org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2009-04-22 01:31:40,700 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete 2009-04-22 01:31:50,720 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete 2009-04-22 01:31:50,720 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success! ================================================================= Run 2: removed the parameter substitution in the Pig script instead used the following statement. ================================================================= {code} J = filter I by $0 == ' 歌手 香港情牽女人心演唱會'; {code} ================================================================= java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' org.apache.pig.Main chinese_data_withoutparam.pig ================================================================= 2009-04-22 01:35:22,402 [Thread-7] WARN org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2009-04-22 01:35:27,399 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete 2009-04-22 01:35:32,415 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete 2009-04-22 01:35:32,415 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success! ================================================================= In both cases: ================================================================= {code} shell $ hadoop fs -ls /user/viraj/chineseoutput Found 2 items drwxr-xr-x - viraj supergroup 0 2009-04-22 01:37 /user/viraj/chineseoutput/_logs -rw-r--r-- 3 viraj supergroup 0 2009-04-22 01:37 /user/viraj/chineseoutput/part-00000 {code} ================================================================= Additionally tried the dry-run option to figure out if the parameter substitution was occurring properly. ================================================================= {code} java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' org.apache.pig.Main -param_file nextgen_paramfile -r chinese_data.pig {code} ================================================================= {code} shell$ file chinese_data.pig.substituted chinese_data.pig.substituted: ASCII text shell$ cat chinese_data.pig.substituted {code} {code} rmf chineseoutput; I = load '/user/viraj/chinese.txt' using PigStorage('\u0001'); J = filter I by $0 == ' ?????? ??????????????????????????????'; store J into 'chineseoutput'; {code} ================================================================= This issue has to do with the parser not handling UTF-8 data. was: I created a very small test case in which I did the following. 1) Created a UTF-8 file which contained a query string in Chinese and wrote it to HDFS. I used this dfs file as an input for the tests. 2) Created a parameter file which also contained the same query string as in Step 1. 3) Created a Pig script which takes in the parametrized query string and hard coded Chinese character. ================================================================ Pig script: chinese_data.pig ================================================================ {code} rmf chineseoutput; I = load '/user/viraj/chinese.txt' using PigStorage('\u0001'); J = filter I by $0 == '$querystring'; --J = filter I by $0 == ' 歌手 香港情牽女人心演唱會'; store J into 'chineseoutput'; dump J; {code} ================================================================= Parameter file: nextgen_paramfile ================================================================= queryid=20090311 querystring=' 歌手 香港情牽女人心演唱會' ================================================================= Input file: /user/viraj/chinese.txt ================================================================= shell$ hadoop fs -cat /user/viraj/chinese.txt 歌手 香港情牽女人心演唱會 ================================================================= I ran the above set of inputs in the following ways: Run 1: ================================================================= java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' org.apache.pig.Main -param_file nextgen_paramfile chinese_data.pig ================================================================= 2009-04-22 01:31:35,703 [Thread-7] WARN org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2009-04-22 01:31:40,700 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete 2009-04-22 01:31:50,720 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete 2009-04-22 01:31:50,720 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success! ================================================================= Run 2: removed the parameter substitution in the Pig script instead used the following statement. ================================================================= J = filter I by $0 == ' 歌手 香港情牽女人心演唱會'; ================================================================= java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' org.apache.pig.Main chinese_data_withoutparam.pig ================================================================= 2009-04-22 01:35:22,402 [Thread-7] WARN org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2009-04-22 01:35:27,399 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete 2009-04-22 01:35:32,415 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete 2009-04-22 01:35:32,415 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success! ================================================================= In both cases: ================================================================= ucdev6 01:39:22 ~/pig-svn/trunk $ hadoop fs -ls /user/viraj/chineseoutput Found 2 items drwxr-xr-x - viraj supergroup 0 2009-04-22 01:37 /user/viraj/chineseoutput/_logs -rw-r--r-- 3 viraj supergroup 0 2009-04-22 01:37 /user/viraj/chineseoutput/part-00000 ================================================================= Additionally tried the dry-run option to figure out if the parameter substitution was occurring properly. ================================================================= java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' org.apache.pig.Main -param_file nextgen_paramfile -r chinese_data.pig ================================================================= shell$ file chinese_data.pig.substituted chinese_data.pig.substituted: ASCII text shell$ cat chinese_data.pig.substituted {code} rmf chineseoutput; I = load '/user/viraj/chinese.txt' using PigStorage('\u0001'); J = filter I by $0 == ' ?????? ??????????????????????????????'; store J into 'chineseoutput'; {code} ================================================================= This issue has to do with the parser not handling UTF-8 data. > Pig does not handle Chinese characters (in both the parameter subsitution > using -param_file or embedded in the Pig script) correctly > ------------------------------------------------------------------------------------------------------------------------------------ > > Key: PIG-774 > URL: https://issues.apache.org/jira/browse/PIG-774 > Project: Pig > Issue Type: Bug > Components: grunt, impl > Affects Versions: 0.0.0 > Reporter: Viraj Bhat > Priority: Critical > Fix For: 0.0.0 > > Attachments: chinese.txt, chinese_data.pig, nextgen_paramfile > > > I created a very small test case in which I did the following. > 1) Created a UTF-8 file which contained a query string in Chinese and wrote > it to HDFS. I used this dfs file as an input for the tests. > 2) Created a parameter file which also contained the same query string as in > Step 1. > 3) Created a Pig script which takes in the parametrized query string and hard > coded Chinese character. > ================================================================ > Pig script: chinese_data.pig > ================================================================ > {code} > rmf chineseoutput; > I = load '/user/viraj/chinese.txt' using PigStorage('\u0001'); > J = filter I by $0 == '$querystring'; > --J = filter I by $0 == ' 歌手 香港情牽女人心演唱會'; > store J into 'chineseoutput'; > dump J; > {code} > ================================================================= > Parameter file: nextgen_paramfile > ================================================================= > queryid=20090311 > querystring=' 歌手 香港情牽女人心演唱會' > ================================================================= > Input file: /user/viraj/chinese.txt > ================================================================= > shell$ hadoop fs -cat /user/viraj/chinese.txt > 歌手 香港情牽女人心演唱會 > ================================================================= > I ran the above set of inputs in the following ways: > Run 1: > ================================================================= > {code} > java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' > org.apache.pig.Main -param_file nextgen_paramfile chinese_data.pig > {code} > ================================================================= > 2009-04-22 01:31:35,703 [Thread-7] WARN org.apache.hadoop.mapred.JobClient - > Use GenericOptionsParser for parsing the > arguments. Applications should implement Tool for the same. > 2009-04-22 01:31:40,700 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > - > 0% complete > 2009-04-22 01:31:50,720 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > - > 100% complete > 2009-04-22 01:31:50,720 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > - > Success! > ================================================================= > Run 2: removed the parameter substitution in the Pig script instead used the > following statement. > ================================================================= > {code} > J = filter I by $0 == ' 歌手 香港情牽女人心演唱會'; > {code} > ================================================================= > java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' > org.apache.pig.Main chinese_data_withoutparam.pig > ================================================================= > 2009-04-22 01:35:22,402 [Thread-7] WARN org.apache.hadoop.mapred.JobClient - > Use GenericOptionsParser for parsing the > arguments. Applications should implement Tool for the same. > 2009-04-22 01:35:27,399 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > - > 0% complete > 2009-04-22 01:35:32,415 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > - > 100% complete > 2009-04-22 01:35:32,415 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > - > Success! > ================================================================= > In both cases: > ================================================================= > {code} > shell $ hadoop fs -ls /user/viraj/chineseoutput > Found 2 items > drwxr-xr-x - viraj supergroup 0 2009-04-22 01:37 > /user/viraj/chineseoutput/_logs > -rw-r--r-- 3 viraj supergroup 0 2009-04-22 01:37 > /user/viraj/chineseoutput/part-00000 > {code} > ================================================================= > Additionally tried the dry-run option to figure out if the parameter > substitution was occurring properly. > ================================================================= > {code} > java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' > org.apache.pig.Main -param_file nextgen_paramfile -r chinese_data.pig > {code} > ================================================================= > {code} > shell$ file chinese_data.pig.substituted > chinese_data.pig.substituted: ASCII text > shell$ cat chinese_data.pig.substituted > {code} > {code} > rmf chineseoutput; > I = load '/user/viraj/chinese.txt' using PigStorage('\u0001'); > J = filter I by $0 == ' ?????? ??????????????????????????????'; > store J into 'chineseoutput'; > {code} > ================================================================= > This issue has to do with the parser not handling UTF-8 data. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.