[
https://issues.apache.org/jira/browse/NUTCH-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13067972#comment-13067972
]
Andrzej Bialecki commented on NUTCH-1014:
------------------------------------------
java.util.regex has the advantage of being a part of the JRE. However, it is
quite slow for more complex regexes. See e.g. this benchmark:
http://www.tusker.org/regex/regex_benchmark.html . In my experience with larger
crawls this is especially important when using regexes for URL filtering and
normalization - an innocent-looking regex can melt the cpu when processing a
64kB long junk URL, and consequently it can stall the crawl... In such cases
it's good to have an option to fall back to a subset of regex features and use
a DFA-based library like e.g. Brics. ORO is generally faster than j.u.regex
(but also it isn't maintained anymore). Brics lacks support for many operators,
but it's fast. Perhaps ICU4j would be a good alternative - it's fully
JDK-compatible and offers good performance.
> Migrate from Apache ORO to java.util.regex
> ------------------------------------------
>
> Key: NUTCH-1014
> URL: https://issues.apache.org/jira/browse/NUTCH-1014
> Project: Nutch
> Issue Type: Improvement
> Reporter: Markus Jelsma
> Fix For: 1.4, 2.0
>
>
> A separate issue tracking migration of all components from Apache ORO to
> java.util.regex. Components involved are:
> - RegexURLNormalzier
> - OutlinkExtractor
> - JSParseFilter
> - MoreIndexingFilter
> - BasicURLNormalizer
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira