[Nutch-dev] [jira] Updated: (NUTCH-395) Increase fetching speed

2006-11-12 Thread Sami Siren (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-395?page=all ]

Sami Siren updated NUTCH-395:
-

Attachment: NUTCH-395-trunk-metadata-only-2.patch

Additional change to Content cuts down time needed in effective fetching. Now 
seeing speeds like 45 pages/sec also on http.

real4m24.126s
user3m53.835s
sys 0m18.681s

3 min 10 sec effective fetching
6 sec   sorting
27 sec  reduce  reduce

 Increase fetching speed
 ---

 Key: NUTCH-395
 URL: http://issues.apache.org/jira/browse/NUTCH-395
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 0.9.0, 0.8.1
Reporter: Sami Siren
 Assigned To: Sami Siren
 Attachments: nutch-0.8-performance.txt, 
 NUTCH-395-trunk-metadata-only-2.patch, NUTCH-395-trunk-metadata-only.patch


 There have been some discussion on nutch mailing lists about fetcher being 
 slow, this patch tried to address that. the patch is just a quich hack and 
 needs some cleaning up, it also currently applies to 0.8 branch and not trunk 
 and it has also not been tested in large. What it changes?
 Metadata - the original metadata uses spellchecking, new version does not (a 
 decorator is provided that can do it and it should perhaps be used where http 
 headers are handled but in most of the cases the functionality is not 
 required)
 Reading/writing various data structures - patch tries to do io more 
 efficiently see the patch for details.
 Initial benchmark:
 A small benchmark was done to measure the performance of changes with a 
 script that basically does the following:
 -inject a list of urls into a fresh crawldb
 -create fetchlist (10k urls pointing to local filesystem)
 -fetch
 -updatedb
 original code from 0.8-branch:
 real10m51.907s
 user10m9.914s
 sys 0m21.285s
 after applying the patch
 real4m15.313s
 user3m42.598s
 sys 0m18.485s

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnkkid=120709bid=263057dat=121642
___
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers


[Nutch-dev] [jira] Updated: (NUTCH-395) Increase fetching speed

2006-11-11 Thread Sami Siren (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-395?page=all ]

Sami Siren updated NUTCH-395:
-

Affects Version/s: 0.9.0

 Increase fetching speed
 ---

 Key: NUTCH-395
 URL: http://issues.apache.org/jira/browse/NUTCH-395
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 0.9.0, 0.8.1
Reporter: Sami Siren
 Assigned To: Sami Siren
 Attachments: nutch-0.8-performance.txt, 
 NUTCH-395-trunk-metadata-only.patch


 There have been some discussion on nutch mailing lists about fetcher being 
 slow, this patch tried to address that. the patch is just a quich hack and 
 needs some cleaning up, it also currently applies to 0.8 branch and not trunk 
 and it has also not been tested in large. What it changes?
 Metadata - the original metadata uses spellchecking, new version does not (a 
 decorator is provided that can do it and it should perhaps be used where http 
 headers are handled but in most of the cases the functionality is not 
 required)
 Reading/writing various data structures - patch tries to do io more 
 efficiently see the patch for details.
 Initial benchmark:
 A small benchmark was done to measure the performance of changes with a 
 script that basically does the following:
 -inject a list of urls into a fresh crawldb
 -create fetchlist (10k urls pointing to local filesystem)
 -fetch
 -updatedb
 original code from 0.8-branch:
 real10m51.907s
 user10m9.914s
 sys 0m21.285s
 after applying the patch
 real4m15.313s
 user3m42.598s
 sys 0m18.485s

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnkkid=120709bid=263057dat=121642
___
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers


[Nutch-dev] [jira] Updated: (NUTCH-395) Increase fetching speed

2006-11-11 Thread Sami Siren (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-395?page=all ]

Sami Siren updated NUTCH-395:
-

Attachment: NUTCH-395-trunk-metadata-only.patch

Here's a first stab at svn trunk version of nutch that just optimizes the use 
of metadata and splits it into two functionally distict pieces one for plain 
metadata and one for spellchecking over the keys of metadata.

There's propably still room for optimization on both the metadata and IO side 
also.

The same local filesystem fetching bench was run as earlier, this time on trunk 
version. Even if the benchmark was run witl file:// urls it should affect other 
protocols also specifically because it seems to cut down the time needed for 
reduce phase quite aggressively.

I would also recommend adding some kind of base benchmark for crawling 
operations to nutch so we don't kill the performance (again and again) at some 
point.

from svn trunk
--
real10m43.527s
user10m11.210s
sys 0m21.837s

fetch breakdown:
5 min 19 seceffective fetching
7 sec   sort
4 min 30 secreduce  reduce


patched version
--
real4m53.742s
user4m21.340s
sys 0m19.045s

fetch breakdown:
3 min 36 seceffective fetching
8 sec   sort
27 sec  reduce  reduce



 Increase fetching speed
 ---

 Key: NUTCH-395
 URL: http://issues.apache.org/jira/browse/NUTCH-395
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 0.9.0, 0.8.1
Reporter: Sami Siren
 Assigned To: Sami Siren
 Attachments: nutch-0.8-performance.txt, 
 NUTCH-395-trunk-metadata-only.patch


 There have been some discussion on nutch mailing lists about fetcher being 
 slow, this patch tried to address that. the patch is just a quich hack and 
 needs some cleaning up, it also currently applies to 0.8 branch and not trunk 
 and it has also not been tested in large. What it changes?
 Metadata - the original metadata uses spellchecking, new version does not (a 
 decorator is provided that can do it and it should perhaps be used where http 
 headers are handled but in most of the cases the functionality is not 
 required)
 Reading/writing various data structures - patch tries to do io more 
 efficiently see the patch for details.
 Initial benchmark:
 A small benchmark was done to measure the performance of changes with a 
 script that basically does the following:
 -inject a list of urls into a fresh crawldb
 -create fetchlist (10k urls pointing to local filesystem)
 -fetch
 -updatedb
 original code from 0.8-branch:
 real10m51.907s
 user10m9.914s
 sys 0m21.285s
 after applying the patch
 real4m15.313s
 user3m42.598s
 sys 0m18.485s

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnkkid=120709bid=263057dat=121642
___
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers