[ 
http://issues.apache.org/jira/browse/NUTCH-109?page=comments#action_12332079 ] 

Andrzej Bialecki  commented on NUTCH-109:
-----------------------------------------

Fuad, please read again carefully what Otis said: such behaviour by a crawler 
IS generally considered rude / impolite, even if the target machine survives 
this. Whether you use a single TCP connection or multiple connections makes 
almost no difference - you are abusing someone's public service, and prevent 
other users from using it. You made your tests with a bunch of static pages - 
fine, but in real life there is some logic and DBs behind them, and by flooding 
the target servers you monopolize those resources, too, degrading the service 
for all others.

If you really want to flood your target servers with requests, it's up to you - 
you can re-configure Nutch to do it - and you should be prepared to suffer from 
this when the target servers ban your crawler's IP. But the Nutch project 
should not advocate such irresponsible behaviour. Consequently, we should never 
use such settings as default.

Aside from the above, as I said before the Innovation code is covered by LGPL, 
so it cannot be imported to Nutch repository.

> Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation
> -----------------------------------------------------------------------
>
>          Key: NUTCH-109
>          URL: http://issues.apache.org/jira/browse/NUTCH-109
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7, 0.6, 0.7.1, 0.8-dev
>  Environment: Nutch: Windows XP, J2SE 1.4.2_09
> Web Server: Suse Linux, Apache HTTPD, apache2-worker,  v. 2.0.53
>     Reporter: Fuad Efendi
>  Attachments: protocol-httpclient-innovation-0.1.0.zip, test_results.txt
>
> 1. TCP connection costs a lot, not only for Nutch and end-point web servers, 
> but also for intermediary network equipment 
> 2. Web Server creates Client thread and hopes that Nutch really uses 
> HTTP/1.1, or at least Nutch sends "Connection: close" before closing in JVM 
> "Socket.close()" ...
> I need to perform very objective tests, probably 2-3 days; new plugin 
> crawled/parsed 23,000 pages for 1,321 seconds; it seems that existing 
> http-plugin needs few days...
> I am using separate network segment with Windows XP (Nutch), and Suse Linux 
> (Apache HTTPD + 120,000 pages)
> Please find attached new plugin based on 
> http://www.innovation.ch/java/HTTPClient/
> Please note: 
> Class HttpFactory contains cache of HTTPConnection objects; each object run 
> each thread; each object is absolutely thread-safe, so we can send multiple 
> GET requests using single instance:
>    private static int CLIENTS_PER_HOST = 
> NutchConf.get().getInt("http.clients.per.host", 3);
> I'll add more comments after finishing tests...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



-------------------------------------------------------
This SF.Net email is sponsored by:
Power Architecture Resource Center: Free content, downloads, discussions,
and more. http://solutions.newsforge.com/ibmarch.tmpl
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to