[ http://issues.apache.org/jira/browse/NUTCH-109?page=all ]
Fuad Efendi updated NUTCH-109:
------------------------------
Summary: Nutch - Fetcher - Performance Test - new
Protocol-HTTPClient-Innovation (was: Nutch - Fetcher - HTTP - Performance
Testing & Tuning)
I performed performance tests, using default Apache HTTPD Web-Server
installation, with crawled 120,000 pages (I used Teleport Ultra to crawl HTML
pages from www.apache.org, I spent probably 10 hours)
Everything run in a separate LAN, Windows XP (Client with Nutch 0.7.1), and
Suse Linux 9.3 (Server with Apache)
I measured crawl for 21,000 pages (Depth=6, Threads=20) (it seems to take few
days to crawl all 120,000 pages):
Protocol-HTTPClient-Innovation:
1,321,470 milliseconds
Protocol-HTTP:
26,946,076 milliseconds
Protocol-HttpClient:
27,062,854 milliseconds
P.S.
Please note, Protocol-HTTPClient-Innovation plugin is only basic version,
v.0.1.0,
HttpFactory is growing and contains cache (3 TCP connections per Host)
http://www.innovation.ch/java/HTTPClient/ is very old but _production_ level...
style of a source code may seem too old... you may need to change "enum" to
"enumeration" in downloaded source files in order to compile it :)))
Very popular load-generating tool is based on HTTPClient (Innovation):
http://grinder.sourceforge.net/
http://www.innovation.ch/java/HTTPClient/
> Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation
> -----------------------------------------------------------------------
>
> Key: NUTCH-109
> URL: http://issues.apache.org/jira/browse/NUTCH-109
> Project: Nutch
> Type: Improvement
> Components: fetcher
> Versions: 0.7, 0.8-dev, 0.6, 0.7.1
> Environment: Nutch: Windows XP, J2SE 1.4.2_09
> Web Server: Suse Linux, Apache HTTPD, apache2-worker, v. 2.0.53
> Reporter: Fuad Efendi
> Attachments: protocol-httpclient-innovation-0.1.0.zip
>
> 1. TCP connection costs a lot, not only for Nutch and end-point web servers,
> but also for intermediary network equipment
> 2. Web Server creates Client thread and hopes that Nutch really uses
> HTTP/1.1, or at least Nutch sends "Connection: close" before closing in JVM
> "Socket.close()" ...
> I need to perform very objective tests, probably 2-3 days; new plugin
> crawled/parsed 23,000 pages for 1,321 seconds; it seems that existing
> http-plugin needs few days...
> I am using separate network segment with Windows XP (Nutch), and Suse Linux
> (Apache HTTPD + 120,000 pages)
> Please find attached new plugin based on
> http://www.innovation.ch/java/HTTPClient/
> Please note:
> Class HttpFactory contains cache of HTTPConnection objects; each object run
> each thread; each object is absolutely thread-safe, so we can send multiple
> GET requests using single instance:
> private static int CLIENTS_PER_HOST =
> NutchConf.get().getInt("http.clients.per.host", 3);
> I'll add more comments after finishing tests...
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
-------------------------------------------------------
This SF.Net email is sponsored by:
Power Architecture Resource Center: Free content, downloads, discussions,
and more. http://solutions.newsforge.com/ibmarch.tmpl
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers