[
https://issues.apache.org/jira/browse/PIG-774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12702620#action_12702620
]
Viraj Bhat commented on PIG-774:
--------------------------------
One workaround for this issue is using the FilterFunc, which reads its filter
list from a file written on HDFS.
Care has to be taken in the FilterFunc UDF, to invoke the BufferedReader which
read UTF8 data.
{code}
public class FILTERFROMFILE extends FilterFunc{
private String FilterListFileName = "";
private void init() throws IOException {
Properties props = ConfigurationUtil.toProperties(PigInputFormat.sJob);
InputStream is = FileLocalizer.openDFSFile(FilterListFileName, props);
BufferedReader reader = new BufferedReader(new
InputStreamReader(is,"UTF8"));
}
public Boolean exec(Tuple input) throws IOException {
init();
//do the matching here
}
}
{code}
Pig code to instantiate the filter function UDF
{code}
register pigudf/myfilterfunc.jar;
define MATCHQUERY FILTERFROMFILE('/user/viraj/chinesedata');
rmf chineseoutput;
I = load '/user/viraj/testchinese' using PigStorage('\u0001') as
(teststring:chararray);
J = filter I by MATCHQUERY(teststring);
store J into 'chineseoutput';
{code}
> Pig does not handle Chinese characters (in both the parameter subsitution
> using -param_file or embedded in the Pig script) correctly
> ------------------------------------------------------------------------------------------------------------------------------------
>
> Key: PIG-774
> URL: https://issues.apache.org/jira/browse/PIG-774
> Project: Pig
> Issue Type: Bug
> Components: grunt, impl
> Affects Versions: 0.0.0
> Reporter: Viraj Bhat
> Priority: Critical
> Fix For: 0.0.0
>
> Attachments: chinese.txt, chinese_data.pig, nextgen_paramfile
>
>
> I created a very small test case in which I did the following.
> 1) Created a UTF-8 file which contained a query string in Chinese and wrote
> it to HDFS. I used this dfs file as an input for the tests.
> 2) Created a parameter file which also contained the same query string as in
> Step 1.
> 3) Created a Pig script which takes in the parametrized query string and hard
> coded Chinese character.
> ================================================================
> Pig script: chinese_data.pig
> ================================================================
> {code}
> rmf chineseoutput;
> I = load '/user/viraj/chinese.txt' using PigStorage('\u0001');
> J = filter I by $0 == '$querystring';
> --J = filter I by $0 == ' 歌手 香港情牽女人心演唱會';
> store J into 'chineseoutput';
> dump J;
> {code}
> =================================================================
> Parameter file: nextgen_paramfile
> =================================================================
> queryid=20090311
> querystring=' 歌手 香港情牽女人心演唱會'
> =================================================================
> Input file: /user/viraj/chinese.txt
> =================================================================
> shell$ hadoop fs -cat /user/viraj/chinese.txt
> 歌手 香港情牽女人心演唱會
> =================================================================
> I ran the above set of inputs in the following ways:
> Run 1:
> =================================================================
> java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server=''
> org.apache.pig.Main -param_file nextgen_paramfile chinese_data.pig
> =================================================================
> 2009-04-22 01:31:35,703 [Thread-7] WARN org.apache.hadoop.mapred.JobClient -
> Use GenericOptionsParser for parsing the
> arguments. Applications should implement Tool for the same.
> 2009-04-22 01:31:40,700 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> -
> 0% complete
> 2009-04-22 01:31:50,720 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> -
> 100% complete
> 2009-04-22 01:31:50,720 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> -
> Success!
> =================================================================
> Run 2: removed the parameter substitution in the Pig script instead used the
> following statement.
> =================================================================
> J = filter I by $0 == ' 歌手 香港情牽女人心演唱會';
> =================================================================
> java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server=''
> org.apache.pig.Main chinese_data_withoutparam.pig
> =================================================================
> 2009-04-22 01:35:22,402 [Thread-7] WARN org.apache.hadoop.mapred.JobClient -
> Use GenericOptionsParser for parsing the
> arguments. Applications should implement Tool for the same.
> 2009-04-22 01:35:27,399 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> -
> 0% complete
> 2009-04-22 01:35:32,415 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> -
> 100% complete
> 2009-04-22 01:35:32,415 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> -
> Success!
> =================================================================
> In both cases:
> =================================================================
> ucdev6 01:39:22 ~/pig-svn/trunk $ hadoop fs -ls /user/viraj/chineseoutput
> Found 2 items
> drwxr-xr-x - viraj supergroup 0 2009-04-22 01:37
> /user/viraj/chineseoutput/_logs
> -rw-r--r-- 3 viraj supergroup 0 2009-04-22 01:37
> /user/viraj/chineseoutput/part-00000
> =================================================================
> Additionally tried the dry-run option to figure out if the parameter
> substitution was occurring properly.
> =================================================================
> java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server=''
> org.apache.pig.Main -param_file nextgen_paramfile -r chinese_data.pig
> =================================================================
> shell$ file chinese_data.pig.substituted
> chinese_data.pig.substituted: ASCII text
> shell$ cat chinese_data.pig.substituted
> {code}
> rmf chineseoutput;
> I = load '/user/viraj/chinese.txt' using PigStorage('\u0001');
> J = filter I by $0 == ' ?????? ??????????????????????????????';
> store J into 'chineseoutput';
> {code}
> =================================================================
> This issue has to do with the parser not handling UTF-8 data.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.