Nutch - Fetcher - HTTP - Performance Testing & Tuning
-----------------------------------------------------
Key: NUTCH-109
URL: http://issues.apache.org/jira/browse/NUTCH-109
Project: Nutch
Type: Improvement
Components: fetcher
Versions: 0.7, 0.6, 0.7.1, 0.8-dev
Environment: Nutch: Windows XP, J2SE 1.4.2_09
Web Server: Suse Linux, Apache HTTPD, apache2-worker, v. 2.0.53
Reporter: Fuad Efendi
1. TCP connection costs a lot, not only for Nutch and end-point web servers,
but also for intermediary network equipment
2. Web Server creates Client thread and hopes that Nutch really uses HTTP/1.1,
or at least Nutch sends "Connection: close" before closing in JVM
"Socket.close()" ...
I need to perform very objective tests, probably 2-3 days; new plugin
crawled/parsed 23,000 pages for 1,321 seconds; it seems that existing
http-plugin needs few days...
I am using separate network segment with Windows XP (Nutch), and Suse Linux
(Apache HTTPD + 120,000 pages)
Please find attached new plugin based on
http://www.innovation.ch/java/HTTPClient/
Please note:
Class HttpFactory contains cache of HTTPConnection objects; each object run
each thread; each object is absolutely thread-safe, so we can send multiple GET
requests using single instance:
private static int CLIENTS_PER_HOST =
NutchConf.get().getInt("http.clients.per.host", 3);
I'll add more comments after finishing tests...
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira