[ 
https://issues.apache.org/jira/browse/NUTCH-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14503072#comment-14503072
 ] 

Julien Nioche commented on NUTCH-1990:
--------------------------------------

Thanks [~wastl-nagel]! 

I have extracted 3332418 URLs from a random segment of CommonCrawl 
(CC-MAIN-20150226074059-00000-ip-10-28-5-156.ec2.internal.warc.gz). Parsed it 
with JSoup, the URLS are meant to be absolute but contains a lot of garbage, so 
it is as real life as can be.

I tested the impact of your patch by injecting these URLs. We are getting the 
same number of URLs post-normalisation and it seems to take the same amount of 
time

{code}
Injector: Total number of urls rejected by filters: 886704
Injector: Total number of urls after normalization: 2445715
Injector: Total new urls injected: 2445715
Injector: finished at 2015-04-20 16:31:30, elapsed: 00:00:59
{code}

Note that the figures above where obtained by removing the patterns for the 
regex-based normalisation as well as commenting out

{code}
# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/
{code}

in regex-urlfilter.txt as these operations take most of the time. The 
processing time when leaving these files in their default form is 08:23, which 
confirms that even if the code modified by your patch was a bit slower (which 
is not the case) it would be irrelevant compared to the overall time spent 
normalizing and filtering.

See the related discussion in Storm-Crawler 
[https://github.com/DigitalPebble/storm-crawler/issues/120].

Later on we might want to have some basic normalization code in 
Crawler-Commons, in which case Nutch could leverage it but for now I think this 
patch should be committed.

The list of URLs used for these tests can be downloaded from 
[https://drive.google.com/open?id=0B4ebzXTbUoiAY0hXNjUtdnJGN3M&authuser=0], 
just in case someone wants to reproduce the steps. 







> Use URI.normalise() in BasicURLNormalizer
> -----------------------------------------
>
>                 Key: NUTCH-1990
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1990
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.9
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>         Attachments: NUTCH-1990-trial1.patch
>
>
> One of the things that 
> [BasicURLNormalizer|https://github.com/apache/nutch/blob/trunk/src/plugin/urlnormalizer-basic/src/java/org/apache/nutch/net/urlnormalizer/basic/BasicURLNormalizer.java]
>  is to remove unnecessary dot segments in path.
> Instead of implementing the logic ourselves with some antiquated regex 
> library, we should simply use 
> [http://docs.oracle.com/javase/7/docs/api/java/net/URI.html#normalize()] 
> which does the same and is probably more efficient.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to