[ http://issues.apache.org/jira/browse/NUTCH-109?page=all ]
Fuad Efendi updated NUTCH-109: ------------------------------ Summary: Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation (was: Nutch - Fetcher - HTTP - Performance Testing & Tuning) I performed performance tests, using default Apache HTTPD Web-Server installation, with crawled 120,000 pages (I used Teleport Ultra to crawl HTML pages from www.apache.org, I spent probably 10 hours) Everything run in a separate LAN, Windows XP (Client with Nutch 0.7.1), and Suse Linux 9.3 (Server with Apache) I measured crawl for 21,000 pages (Depth=6, Threads=20) (it seems to take few days to crawl all 120,000 pages): Protocol-HTTPClient-Innovation: 1,321,470 milliseconds Protocol-HTTP: 26,946,076 milliseconds Protocol-HttpClient: 27,062,854 milliseconds P.S. Please note, Protocol-HTTPClient-Innovation plugin is only basic version, v.0.1.0, HttpFactory is growing and contains cache (3 TCP connections per Host) http://www.innovation.ch/java/HTTPClient/ is very old but _production_ level... style of a source code may seem too old... you may need to change "enum" to "enumeration" in downloaded source files in order to compile it :))) Very popular load-generating tool is based on HTTPClient (Innovation): http://grinder.sourceforge.net/ http://www.innovation.ch/java/HTTPClient/ > Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation > ----------------------------------------------------------------------- > > Key: NUTCH-109 > URL: http://issues.apache.org/jira/browse/NUTCH-109 > Project: Nutch > Type: Improvement > Components: fetcher > Versions: 0.7, 0.8-dev, 0.6, 0.7.1 > Environment: Nutch: Windows XP, J2SE 1.4.2_09 > Web Server: Suse Linux, Apache HTTPD, apache2-worker, v. 2.0.53 > Reporter: Fuad Efendi > Attachments: protocol-httpclient-innovation-0.1.0.zip > > 1. TCP connection costs a lot, not only for Nutch and end-point web servers, > but also for intermediary network equipment > 2. Web Server creates Client thread and hopes that Nutch really uses > HTTP/1.1, or at least Nutch sends "Connection: close" before closing in JVM > "Socket.close()" ... > I need to perform very objective tests, probably 2-3 days; new plugin > crawled/parsed 23,000 pages for 1,321 seconds; it seems that existing > http-plugin needs few days... > I am using separate network segment with Windows XP (Nutch), and Suse Linux > (Apache HTTPD + 120,000 pages) > Please find attached new plugin based on > http://www.innovation.ch/java/HTTPClient/ > Please note: > Class HttpFactory contains cache of HTTPConnection objects; each object run > each thread; each object is absolutely thread-safe, so we can send multiple > GET requests using single instance: > private static int CLIENTS_PER_HOST = > NutchConf.get().getInt("http.clients.per.host", 3); > I'll add more comments after finishing tests... -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira