[
https://issues.apache.org/jira/browse/NUTCH-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14503072#comment-14503072
]
Julien Nioche commented on NUTCH-1990:
--------------------------------------
Thanks [~wastl-nagel]!
I have extracted 3332418 URLs from a random segment of CommonCrawl
(CC-MAIN-20150226074059-00000-ip-10-28-5-156.ec2.internal.warc.gz). Parsed it
with JSoup, the URLS are meant to be absolute but contains a lot of garbage, so
it is as real life as can be.
I tested the impact of your patch by injecting these URLs. We are getting the
same number of URLs post-normalisation and it seems to take the same amount of
time
{code}
Injector: Total number of urls rejected by filters: 886704
Injector: Total number of urls after normalization: 2445715
Injector: Total new urls injected: 2445715
Injector: finished at 2015-04-20 16:31:30, elapsed: 00:00:59
{code}
Note that the figures above where obtained by removing the patterns for the
regex-based normalisation as well as commenting out
{code}
# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/
{code}
in regex-urlfilter.txt as these operations take most of the time. The
processing time when leaving these files in their default form is 08:23, which
confirms that even if the code modified by your patch was a bit slower (which
is not the case) it would be irrelevant compared to the overall time spent
normalizing and filtering.
See the related discussion in Storm-Crawler
[https://github.com/DigitalPebble/storm-crawler/issues/120].
Later on we might want to have some basic normalization code in
Crawler-Commons, in which case Nutch could leverage it but for now I think this
patch should be committed.
The list of URLs used for these tests can be downloaded from
[https://drive.google.com/open?id=0B4ebzXTbUoiAY0hXNjUtdnJGN3M&authuser=0],
just in case someone wants to reproduce the steps.
> Use URI.normalise() in BasicURLNormalizer
> -----------------------------------------
>
> Key: NUTCH-1990
> URL: https://issues.apache.org/jira/browse/NUTCH-1990
> Project: Nutch
> Issue Type: Improvement
> Affects Versions: 1.9
> Reporter: Julien Nioche
> Assignee: Julien Nioche
> Attachments: NUTCH-1990-trial1.patch
>
>
> One of the things that
> [BasicURLNormalizer|https://github.com/apache/nutch/blob/trunk/src/plugin/urlnormalizer-basic/src/java/org/apache/nutch/net/urlnormalizer/basic/BasicURLNormalizer.java]
> is to remove unnecessary dot segments in path.
> Instead of implementing the logic ourselves with some antiquated regex
> library, we should simply use
> [http://docs.oracle.com/javase/7/docs/api/java/net/URI.html#normalize()]
> which does the same and is probably more efficient.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)