[ http://issues.apache.org/jira/browse/NUTCH-395?page=all ]
Sami Siren updated NUTCH-395:
-----------------------------
Attachment: NUTCH-395-trunk-metadata-only-2.patch
Additional change to Content cuts down time needed in effective fetching. Now
seeing speeds like 45 pages/sec also on http.
real 4m24.126s
user 3m53.835s
sys 0m18.681s
3 min 10 sec effective fetching
6 sec sorting
27 sec reduce > reduce
> Increase fetching speed
> -----------------------
>
> Key: NUTCH-395
> URL: http://issues.apache.org/jira/browse/NUTCH-395
> Project: Nutch
> Issue Type: Improvement
> Components: fetcher
> Affects Versions: 0.9.0, 0.8.1
> Reporter: Sami Siren
> Assigned To: Sami Siren
> Attachments: nutch-0.8-performance.txt,
> NUTCH-395-trunk-metadata-only-2.patch, NUTCH-395-trunk-metadata-only.patch
>
>
> There have been some discussion on nutch mailing lists about fetcher being
> slow, this patch tried to address that. the patch is just a quich hack and
> needs some cleaning up, it also currently applies to 0.8 branch and not trunk
> and it has also not been tested in large. What it changes?
> Metadata - the original metadata uses spellchecking, new version does not (a
> decorator is provided that can do it and it should perhaps be used where http
> headers are handled but in most of the cases the functionality is not
> required)
> Reading/writing various data structures - patch tries to do io more
> efficiently see the patch for details.
> Initial benchmark:
> A small benchmark was done to measure the performance of changes with a
> script that basically does the following:
> -inject a list of urls into a fresh crawldb
> -create fetchlist (10k urls pointing to local filesystem)
> -fetch
> -updatedb
> original code from 0.8-branch:
> real 10m51.907s
> user 10m9.914s
> sys 0m21.285s
> after applying the patch
> real 4m15.313s
> user 3m42.598s
> sys 0m18.485s
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira