[ http://issues.apache.org/jira/browse/NUTCH-395?page=all ]

Sami Siren updated NUTCH-395:
-----------------------------

    Attachment: NUTCH-395-trunk-metadata-only.patch

Here's a first stab at svn trunk version of nutch that just optimizes the use 
of metadata and splits it into two functionally distict pieces one for plain 
metadata and one for spellchecking over the keys of metadata.

There's propably still room for optimization on both the metadata and IO side 
also.

The same local filesystem fetching bench was run as earlier, this time on trunk 
version. Even if the benchmark was run witl file:// urls it should affect other 
protocols also specifically because it seems to cut down the time needed for 
reduce phase quite aggressively.

I would also recommend adding some kind of base benchmark for crawling 
operations to nutch so we don't kill the performance (again and again) at some 
point.

from svn trunk
----------------------
real    10m43.527s
user    10m11.210s
sys     0m21.837s

fetch breakdown:
5 min 19 sec    effective fetching
7 sec           sort
4 min 30 sec    reduce > reduce


patched version
----------------------
real    4m53.742s
user    4m21.340s
sys     0m19.045s

fetch breakdown:
3 min 36 sec    effective fetching
8 sec           sort
27 sec          reduce > reduce



> Increase fetching speed
> -----------------------
>
>                 Key: NUTCH-395
>                 URL: http://issues.apache.org/jira/browse/NUTCH-395
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 0.9.0, 0.8.1
>            Reporter: Sami Siren
>         Assigned To: Sami Siren
>         Attachments: nutch-0.8-performance.txt, 
> NUTCH-395-trunk-metadata-only.patch
>
>
> There have been some discussion on nutch mailing lists about fetcher being 
> slow, this patch tried to address that. the patch is just a quich hack and 
> needs some cleaning up, it also currently applies to 0.8 branch and not trunk 
> and it has also not been tested in large. What it changes?
> Metadata - the original metadata uses spellchecking, new version does not (a 
> decorator is provided that can do it and it should perhaps be used where http 
> headers are handled but in most of the cases the functionality is not 
> required)
> Reading/writing various data structures - patch tries to do io more 
> efficiently see the patch for details.
> Initial benchmark:
> A small benchmark was done to measure the performance of changes with a 
> script that basically does the following:
> -inject a list of urls into a fresh crawldb
> -create fetchlist (10k urls pointing to local filesystem)
> -fetch
> -updatedb
> original code from 0.8-branch:
> real    10m51.907s
> user    10m9.914s
> sys     0m21.285s
> after applying the patch
> real    4m15.313s
> user    3m42.598s
> sys     0m18.485s

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to