[ 
https://issues.apache.org/jira/browse/NUTCH-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13067972#comment-13067972
 ] 

Andrzej Bialecki  commented on NUTCH-1014:
------------------------------------------

java.util.regex has the advantage of being a part of the JRE. However, it is 
quite slow for more complex regexes. See e.g. this benchmark: 
http://www.tusker.org/regex/regex_benchmark.html . In my experience with larger 
crawls this is especially important when using regexes for URL filtering and 
normalization - an innocent-looking regex can melt the cpu when processing a 
64kB long junk URL, and consequently it can stall the crawl... In such cases 
it's good to have an option to fall back to a subset of regex features and use 
a DFA-based library like e.g. Brics. ORO is generally faster than j.u.regex 
(but also it isn't maintained anymore). Brics lacks support for many operators, 
but it's fast. Perhaps ICU4j would be a good alternative - it's fully 
JDK-compatible and offers good performance.

> Migrate from Apache ORO to java.util.regex
> ------------------------------------------
>
>                 Key: NUTCH-1014
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1014
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Markus Jelsma
>             Fix For: 1.4, 2.0
>
>
> A separate issue tracking migration of all components from Apache ORO to 
> java.util.regex. Components involved are:
> - RegexURLNormalzier
> - OutlinkExtractor
> - JSParseFilter
> - MoreIndexingFilter
> - BasicURLNormalizer

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to