[ 
https://issues.apache.org/jira/browse/PIG-738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12713313#action_12713313
 ] 

Milind Bhandarkar commented on PIG-738:
---------------------------------------

This is not an issue in Pig parsing. The Definition of a quoted string in the 
pig parser is:

{code}
TOKEN : { <QUOTEDSTRING :  "'"
(   (~["'","\\","\n","\r"])
  | ("\\"
      ( ["n","t","b","r","f","\\","'"] )
    )
  | ("\\u"
        ["0"-"9","A"-"F","a"-"f"]
        ["0"-"9","A"-"F","a"-"f"]
        ["0"-"9","A"-"F","a"-"f"]
        ["0"-"9","A"-"F","a"-"f"]
    )
)*
"'"> }
{code}

It matches: 
{code}
'www\\.yahoo\\.com/sports'
{code}.

So, its the stages after parsing that the regexp is getting messed up. (My 
suspicion is FuncSpec serialization.)

> Regexp passed from pigscript fails in UDF  
> -------------------------------------------
>
>                 Key: PIG-738
>                 URL: https://issues.apache.org/jira/browse/PIG-738
>             Project: Pig
>          Issue Type: Bug
>          Components: grunt
>    Affects Versions: 0.3.0
>            Reporter: Viraj Bhat
>             Fix For: 0.3.0
>
>         Attachments: myregexp.jar, RegexGroupCount.java, regexp.pig, 
> regexpinput.txt
>
>
> Consider a pig script which parses and counts regular expressions from a text 
> file. 
> The regular expression supplied in the Pig script needs to escape the "."  
> (dot) character.
> {code}
> register myregexp.jar;
> -- pattern not picked up
> define minelogs ci_pig_udfs.RegexGroupCount('www\\.yahoo\\.com/sports');
> A = load '/user/viraj/regexpinput.txt'  using PigStorage() as (source : 
> chararray);
> B = foreach A generate minelogs(source) as sportslogs;
> dump B;
> {code}
> Snippet of UDF RegexGroupCount.java
> {code}
> public class RegexGroupCount extends EvalFunc<Integer> {
>     private final Pattern pattern_;
>     public RegexGroupCount(String patternStr) {
>        System.out.println("My pattern supplied is "+patternStr);
>        System.out.println("Equality test 
> "+patternStr.equals("www\\.yahoo\\.com/sports"));
>        pattern_ = Pattern.compile(patternStr, 
> Pattern.DOTALL|Pattern.CASE_INSENSITIVE);
>    }
>   public Integer exec(Tuple input)  throws IOException {
>    }
> }
> {code}
> Running the above script on the following dataset :
> ====================================================================================================
> dshfdskfwww.yahoo.com/sportsjoadfjdslpdshfdskfwww.yahoo.com/sportsjoadfjdsl
> kas;dka;sd
> jsjsjwww.yahoo.com/sports
> jsdLSJDcom/sports
> wwwJyahooMcom/sports
> ====================================================================================================
> Results in the following:
> My pattern supplied is www\\.yahoo\\.com/sports
> Equality test false
> My pattern supplied is www\\.yahoo\\.com/sports
> Equality test false
> My pattern supplied is www\\.yahoo\\.com/sports
> Equality test false
> My pattern supplied is www\\.yahoo\\.com/sports
> Equality test false
> My pattern supplied is www\\.yahoo\\.com/sports
> Equality test false
> My pattern supplied is www\\.yahoo\\.com/sports
> Equality test false
> Userfunc: (Name: UserFunc viraj-Sat Mar 28 02:06:31 PDT 2009-14 function: 
> ci_pig_udfs.RegexGroupCount('www\\.yahoo\\.com/sports') Operator Key: 
> viraj-Sat Mar 28 02:06:31 PDT 2009-14)
> Userfunc fs: int
> My pattern supplied is www\\.yahoo\\.com/sports
> Equality test false
> My pattern supplied is www\\.yahoo\\.com/sports
> Equality test false
> My pattern supplied is www\\.yahoo\\.com/sports
> Equality test false
> 2009-03-28 02:06:43,923 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>  - 100% complete
> 2009-03-28 02:06:43,923 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>  - Success!
> (0)
> (0)
> (0)
> (0)
> (0)
> ====================================================================================================
> In essence there seems to be no way of passing this type of constructor 
> argument through the Pig script. The only workaround seems to be hard coding 
> the values in the UDF!!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to