[ https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ankit Modi updated PIG-965: --------------------------- Attachment: poregex2.patch poregex.patch These are patches for two implementations One (poregex.patch) is an implementation applying optimization mentioned above in the JIRA. Second (poregex2.patch) implementation applies optimization 1 and uses dk.brics.automaton for running simple regular expressions. Otherwise it reverts back to java.util.regex. In 1 the decision to use optimization two or use java.util.regex is decided by getSimpleString method In 2 the decision to use dk.brics.automaton is done by determineBestRegexMethod. ( changes to build.xml is this patch are temporary ) Both patches use RegexInit as an implementation which makes a decision ( calling the above mentioned decision functions ) and then sets the implementation to one decided by the decision function. In second patch, the decision function was created looking at the support of operators in dk.brics.automaton and its grammar. I tried out the classes supported and not supported in dk.brics.automaton and decided upon it. I could not find any specific page mentioning the difference between regex language java.util.regex and dk.brics.automaton. > PERFORMANCE: optimize common case in matches (PORegex) > ------------------------------------------------------ > > Key: PIG-965 > URL: https://issues.apache.org/jira/browse/PIG-965 > Project: Pig > Issue Type: Improvement > Components: impl > Reporter: Thejas M Nair > Assignee: Ankit Modi > Attachments: poregex.patch, poregex2.patch > > > Some frequently seen use cases of 'matches' comparison operator have follow > properties - > 1. The rhs is a constant string . eg "c1 matches 'abc%' " > 2. Regexes such that look for matching prefix , suffix etc are very common. > eg - "abc%', "%abc", '%abc%' > To optimize for these common cases , PORegex.java can be changed to - > 1. Compile the pattern (rhs of matches) re-use it if the pattern string has > not changed. > 2. Use string comparisons for simple common regexes (in 2 above). > The implementation of Hive like clause uses similar optimizations. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.