[Nutch-dev] [jira] Updated: (NUTCH-395) Increase fetching speed
[ http://issues.apache.org/jira/browse/NUTCH-395?page=all ] Sami Siren updated NUTCH-395: - Attachment: NUTCH-395-trunk-metadata-only-2.patch Additional change to Content cuts down time needed in effective fetching. Now seeing speeds like 45 pages/sec also on http. real4m24.126s user3m53.835s sys 0m18.681s 3 min 10 sec effective fetching 6 sec sorting 27 sec reduce reduce Increase fetching speed --- Key: NUTCH-395 URL: http://issues.apache.org/jira/browse/NUTCH-395 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 0.9.0, 0.8.1 Reporter: Sami Siren Assigned To: Sami Siren Attachments: nutch-0.8-performance.txt, NUTCH-395-trunk-metadata-only-2.patch, NUTCH-395-trunk-metadata-only.patch There have been some discussion on nutch mailing lists about fetcher being slow, this patch tried to address that. the patch is just a quich hack and needs some cleaning up, it also currently applies to 0.8 branch and not trunk and it has also not been tested in large. What it changes? Metadata - the original metadata uses spellchecking, new version does not (a decorator is provided that can do it and it should perhaps be used where http headers are handled but in most of the cases the functionality is not required) Reading/writing various data structures - patch tries to do io more efficiently see the patch for details. Initial benchmark: A small benchmark was done to measure the performance of changes with a script that basically does the following: -inject a list of urls into a fresh crawldb -create fetchlist (10k urls pointing to local filesystem) -fetch -updatedb original code from 0.8-branch: real10m51.907s user10m9.914s sys 0m21.285s after applying the patch real4m15.313s user3m42.598s sys 0m18.485s -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnkkid=120709bid=263057dat=121642 ___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers
[Nutch-dev] [jira] Updated: (NUTCH-395) Increase fetching speed
[ http://issues.apache.org/jira/browse/NUTCH-395?page=all ] Sami Siren updated NUTCH-395: - Affects Version/s: 0.9.0 Increase fetching speed --- Key: NUTCH-395 URL: http://issues.apache.org/jira/browse/NUTCH-395 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 0.9.0, 0.8.1 Reporter: Sami Siren Assigned To: Sami Siren Attachments: nutch-0.8-performance.txt, NUTCH-395-trunk-metadata-only.patch There have been some discussion on nutch mailing lists about fetcher being slow, this patch tried to address that. the patch is just a quich hack and needs some cleaning up, it also currently applies to 0.8 branch and not trunk and it has also not been tested in large. What it changes? Metadata - the original metadata uses spellchecking, new version does not (a decorator is provided that can do it and it should perhaps be used where http headers are handled but in most of the cases the functionality is not required) Reading/writing various data structures - patch tries to do io more efficiently see the patch for details. Initial benchmark: A small benchmark was done to measure the performance of changes with a script that basically does the following: -inject a list of urls into a fresh crawldb -create fetchlist (10k urls pointing to local filesystem) -fetch -updatedb original code from 0.8-branch: real10m51.907s user10m9.914s sys 0m21.285s after applying the patch real4m15.313s user3m42.598s sys 0m18.485s -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnkkid=120709bid=263057dat=121642 ___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers
[Nutch-dev] [jira] Updated: (NUTCH-395) Increase fetching speed
[ http://issues.apache.org/jira/browse/NUTCH-395?page=all ] Sami Siren updated NUTCH-395: - Attachment: NUTCH-395-trunk-metadata-only.patch Here's a first stab at svn trunk version of nutch that just optimizes the use of metadata and splits it into two functionally distict pieces one for plain metadata and one for spellchecking over the keys of metadata. There's propably still room for optimization on both the metadata and IO side also. The same local filesystem fetching bench was run as earlier, this time on trunk version. Even if the benchmark was run witl file:// urls it should affect other protocols also specifically because it seems to cut down the time needed for reduce phase quite aggressively. I would also recommend adding some kind of base benchmark for crawling operations to nutch so we don't kill the performance (again and again) at some point. from svn trunk -- real10m43.527s user10m11.210s sys 0m21.837s fetch breakdown: 5 min 19 seceffective fetching 7 sec sort 4 min 30 secreduce reduce patched version -- real4m53.742s user4m21.340s sys 0m19.045s fetch breakdown: 3 min 36 seceffective fetching 8 sec sort 27 sec reduce reduce Increase fetching speed --- Key: NUTCH-395 URL: http://issues.apache.org/jira/browse/NUTCH-395 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 0.9.0, 0.8.1 Reporter: Sami Siren Assigned To: Sami Siren Attachments: nutch-0.8-performance.txt, NUTCH-395-trunk-metadata-only.patch There have been some discussion on nutch mailing lists about fetcher being slow, this patch tried to address that. the patch is just a quich hack and needs some cleaning up, it also currently applies to 0.8 branch and not trunk and it has also not been tested in large. What it changes? Metadata - the original metadata uses spellchecking, new version does not (a decorator is provided that can do it and it should perhaps be used where http headers are handled but in most of the cases the functionality is not required) Reading/writing various data structures - patch tries to do io more efficiently see the patch for details. Initial benchmark: A small benchmark was done to measure the performance of changes with a script that basically does the following: -inject a list of urls into a fresh crawldb -create fetchlist (10k urls pointing to local filesystem) -fetch -updatedb original code from 0.8-branch: real10m51.907s user10m9.914s sys 0m21.285s after applying the patch real4m15.313s user3m42.598s sys 0m18.485s -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnkkid=120709bid=263057dat=121642 ___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers