Julien Nioche created NUTCH-1666:
------------------------------------
Summary: Optimisation for BasicURLNormalizer
Key: NUTCH-1666
URL: https://issues.apache.org/jira/browse/NUTCH-1666
Project: Nutch
Issue Type: Improvement
Affects Versions: 1.7
Reporter: Julien Nioche
Priority: Minor
Fix For: 1.8
Attachments: NUTCH-1666.patch
The regular expressions in the BasicURLNormalizer are quite costly, the patch
attached allows to skip the processing if a URL does not contain a sequence of
interest (two slashes with zero, one or two dots in between).
This reduces the time spent in post processing the parsing quite a bit.
--
This message was sent by Atlassian JIRA
(v6.1#6144)