Nutch - Fetcher - HTTP - Performance Testing & Tuning
-----------------------------------------------------

         Key: NUTCH-109
         URL: http://issues.apache.org/jira/browse/NUTCH-109
     Project: Nutch
        Type: Improvement
  Components: fetcher  
    Versions: 0.7, 0.6, 0.7.1, 0.8-dev    
 Environment: Nutch: Windows XP, J2SE 1.4.2_09
Web Server: Suse Linux, Apache HTTPD, apache2-worker,  v. 2.0.53
    Reporter: Fuad Efendi


1. TCP connection costs a lot, not only for Nutch and end-point web servers, 
but also for intermediary network equipment 

2. Web Server creates Client thread and hopes that Nutch really uses HTTP/1.1, 
or at least Nutch sends "Connection: close" before closing in JVM 
"Socket.close()" ...

I need to perform very objective tests, probably 2-3 days; new plugin 
crawled/parsed 23,000 pages for 1,321 seconds; it seems that existing 
http-plugin needs few days...

I am using separate network segment with Windows XP (Nutch), and Suse Linux 
(Apache HTTPD + 120,000 pages)

Please find attached new plugin based on 
http://www.innovation.ch/java/HTTPClient/

Please note: 

Class HttpFactory contains cache of HTTPConnection objects; each object run 
each thread; each object is absolutely thread-safe, so we can send multiple GET 
requests using single instance:
   private static int CLIENTS_PER_HOST = 
NutchConf.get().getInt("http.clients.per.host", 3);

I'll add more comments after finishing tests...



-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply via email to