[
https://issues.apache.org/jira/browse/NUTCH-1013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13058029#comment-13058029
]
Markus Jelsma commented on NUTCH-1013:
--------------------------------------
Yes, ORO is superior in terms of raw speed, on average ORO is ~17% faster. This
has been measured with a CrawlDB rougly about 2.2 million URLS. The generator
is not limited with -topN.
Java regex averages on 310 seconds whereas ORO averages on 263 seconds run
time. This was on a dedicated machine without Hadoop.
More interesting, in my opinion, is the reduced memory consumption. ORO uses
almost three times more heap space than util.regex. The same generate cycles
show about 12.4% for ORO and util.regex never went higher than 4.8%.
Is the performance penalty considered to be blocking?
> Migrate RegexURLNormalizer from Apache ORO to java.util.regex
> -------------------------------------------------------------
>
> Key: NUTCH-1013
> URL: https://issues.apache.org/jira/browse/NUTCH-1013
> Project: Nutch
> Issue Type: Improvement
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.4, 2.0
>
> Attachments: NUTCH-1013-1.4.patch
>
>
> Apache ORO uses old Perl 5-style regular expressions. Features such as the
> powerful lookbehind are not available. The project has become retired as
> well.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira