Increase fetching speed
-----------------------

                 Key: NUTCH-395
                 URL: http://issues.apache.org/jira/browse/NUTCH-395
             Project: Nutch
          Issue Type: Improvement
          Components: fetcher
    Affects Versions: 0.8.1
            Reporter: Sami Siren
         Assigned To: Sami Siren


There have been some discussion on nutch mailing lists about fetcher being 
slow, this patch tried to address that. the patch is just a quich hack and 
needs some cleaning up, it also currently applies to 0.8 branch and not trunk 
and it has also not been tested in large. What it changes?

Metadata - the original metadata uses spellchecking, new version does not (a 
decorator is provided that can do it and it should perhaps be used where http 
headers are handled but in most of the cases the functionality is not required)

Reading/writing various data structures - patch tries to do io more efficiently 
see the patch for details.

Initial benchmark:

A small benchmark was done to measure the performance of changes with a script 
that basically does the following:
-inject a list of urls into a fresh crawldb
-create fetchlist (10k urls pointing to local filesystem)
-fetch
-updatedb

original code from 0.8-branch:
real    10m51.907s
user    10m9.914s
sys     0m21.285s

after applying the patch
real    4m15.313s
user    3m42.598s
sys     0m18.485s



-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to