[ http://issues.apache.org/jira/browse/NUTCH-109?page=comments#action_12332079 ]
Andrzej Bialecki commented on NUTCH-109: ----------------------------------------- Fuad, please read again carefully what Otis said: such behaviour by a crawler IS generally considered rude / impolite, even if the target machine survives this. Whether you use a single TCP connection or multiple connections makes almost no difference - you are abusing someone's public service, and prevent other users from using it. You made your tests with a bunch of static pages - fine, but in real life there is some logic and DBs behind them, and by flooding the target servers you monopolize those resources, too, degrading the service for all others. If you really want to flood your target servers with requests, it's up to you - you can re-configure Nutch to do it - and you should be prepared to suffer from this when the target servers ban your crawler's IP. But the Nutch project should not advocate such irresponsible behaviour. Consequently, we should never use such settings as default. Aside from the above, as I said before the Innovation code is covered by LGPL, so it cannot be imported to Nutch repository. > Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation > ----------------------------------------------------------------------- > > Key: NUTCH-109 > URL: http://issues.apache.org/jira/browse/NUTCH-109 > Project: Nutch > Type: Improvement > Components: fetcher > Versions: 0.7, 0.6, 0.7.1, 0.8-dev > Environment: Nutch: Windows XP, J2SE 1.4.2_09 > Web Server: Suse Linux, Apache HTTPD, apache2-worker, v. 2.0.53 > Reporter: Fuad Efendi > Attachments: protocol-httpclient-innovation-0.1.0.zip, test_results.txt > > 1. TCP connection costs a lot, not only for Nutch and end-point web servers, > but also for intermediary network equipment > 2. Web Server creates Client thread and hopes that Nutch really uses > HTTP/1.1, or at least Nutch sends "Connection: close" before closing in JVM > "Socket.close()" ... > I need to perform very objective tests, probably 2-3 days; new plugin > crawled/parsed 23,000 pages for 1,321 seconds; it seems that existing > http-plugin needs few days... > I am using separate network segment with Windows XP (Nutch), and Suse Linux > (Apache HTTPD + 120,000 pages) > Please find attached new plugin based on > http://www.innovation.ch/java/HTTPClient/ > Please note: > Class HttpFactory contains cache of HTTPConnection objects; each object run > each thread; each object is absolutely thread-safe, so we can send multiple > GET requests using single instance: > private static int CLIENTS_PER_HOST = > NutchConf.get().getInt("http.clients.per.host", 3); > I'll add more comments after finishing tests... -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira ------------------------------------------------------- This SF.Net email is sponsored by: Power Architecture Resource Center: Free content, downloads, discussions, and more. http://solutions.newsforge.com/ibmarch.tmpl _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers