Ankit Modi updated PIG-965:

    Attachment: poregex2.patch

These are patches for two implementations 

One (poregex.patch) is an implementation applying optimization mentioned above 
in the JIRA.
Second (poregex2.patch) implementation applies optimization 1 and uses 
dk.brics.automaton for running simple regular expressions. Otherwise it reverts 
back to java.util.regex.

In 1 the decision to use optimization two or use java.util.regex is decided by 
getSimpleString method

In 2 the decision to use dk.brics.automaton is done by 
determineBestRegexMethod. ( changes to build.xml is this patch are temporary )

Both patches use RegexInit as an implementation which makes a decision ( 
calling the above mentioned  decision functions ) and then sets the 
implementation to one decided by the decision function.

In second patch, the decision function was created looking at the support of 
operators in dk.brics.automaton and its grammar. I tried out the classes 
supported and not supported in dk.brics.automaton and decided upon it.

I could not find any specific page mentioning the difference between regex 
language java.util.regex and dk.brics.automaton.

> PERFORMANCE: optimize common case in matches (PORegex)
> ------------------------------------------------------
>                 Key: PIG-965
>                 URL: https://issues.apache.org/jira/browse/PIG-965
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: Thejas M Nair
>            Assignee: Ankit Modi
>         Attachments: poregex.patch, poregex2.patch
> Some frequently seen use cases of 'matches' comparison operator have follow 
> properties -
> 1. The rhs is a constant string . eg "c1 matches 'abc%' "
> 2. Regexes such that look for matching prefix , suffix etc are very common. 
> eg - "abc%', "%abc", '%abc%' 
> To optimize for these common cases , PORegex.java can be changed to -
> 1. Compile the pattern (rhs of matches) re-use it if the pattern string has 
> not changed. 
> 2. Use string comparisons for simple common regexes (in 2 above).
> The implementation of Hive like clause uses similar optimizations.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

Reply via email to