[ 
https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784596#action_12784596
 ] 

Ankit Modi commented on PIG-965:
--------------------------------

I implemented a patch with optimization 1 and 2 mentioned above and another 
patch with optimization 1,2 and dk.brics.automaton.

dk.brics.automaton does not support all features of java.util.regex hence the 
second patch considers that and switches to java.util.regex if the regex can 
only be handled by java.util.regex.

Here are the numbers

||Regex||       svn_trunk       ||Optimization 1 and 2||        
dk.brics.automaton|| comments ||
| .\*ABCD.\*     | 92.74 | 50.92        | 49.32 | Here only optimization 2 is 
used |
| .\*[A-F]{2,3}.\*      |152.3| 133.48| 105.93 | dk.brics.automaton is used |
| A.B.C.D | 54.492 | 44.46 | 44.66 | dk.brics.automaton is used |
|   .\*([A-F]{4})\w\*\1.\* | 129.29 | 112.89 | 109.43 | java.util.regex used in 
all cases |
|   .\*\[A-F\]\{4\}\w\*[N-Z]\{3\}.\* | 129.63 | 108.11 | 54.42 | 
dk.brics.automaton used |


These results were obtained using Local Mode on 1 Billion lines of data of 
following format
f1:Chararray(100) of random chars from [A-Z]
f2:int random integer

dk.brics.automaton provides good performance in case of complex regex. 


> PERFORMANCE: optimize common case in matches (PORegex)
> ------------------------------------------------------
>
>                 Key: PIG-965
>                 URL: https://issues.apache.org/jira/browse/PIG-965
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: Thejas M Nair
>            Assignee: Ankit Modi
>
> Some frequently seen use cases of 'matches' comparison operator have follow 
> properties -
> 1. The rhs is a constant string . eg "c1 matches 'abc%' "
> 2. Regexes such that look for matching prefix , suffix etc are very common. 
> eg - "abc%', "%abc", '%abc%' 
> To optimize for these common cases , PORegex.java can be changed to -
> 1. Compile the pattern (rhs of matches) re-use it if the pattern string has 
> not changed. 
> 2. Use string comparisons for simple common regexes (in 2 above).
> The implementation of Hive like clause uses similar optimizations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to