[ 
http://issues.apache.org/jira/browse/NUTCH-109?page=comments#action_12331904 ] 

Fuad Efendi commented on NUTCH-109:
-----------------------------------

I can't use Email right now, so put comments here:
===
>Have you seen Kelvin Tan's patch?
>You should take a look, it's in JIRA, and addresses some of the
>HTTP/1.1 issues that you are concerned about.

And my reply:
http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg01037.html

===
>>   private static InetAddress blockAddr(URL url) throws
>> ProtocolException {...}

>Where is this method?

[plugin-httpclient] & [protocol-http], Http.java

===
>> 1. we are establishing TCP transport, 100-300 milliseconds X 2-3
>> times (TCP HandShake? some IP packets...)
>> 2. Apache HTTPD Server creates Client thread to handle our requests,
>> 1 second (more or less, try Internet Explorer, first page takes few
>> second to download, then browsing works very fast - we have personal
>> Thread on the Server).

>This is often be due to the initial hostname address lookup, when the
>domain name server doesn't have the host name IP address already
>cached.

Not. DNS Lookup happens only onse per JVM lifecycle; 1 & 2 HandShakes happen 
meny times.

===
>> We have network equipment limitations too, we can't reach more than
>> 65000 threads over single LAN card, and JVM is good (but better is to
>> have multiple JVM/processes, 100 threads each...) 

>65000 threads?  What are you trying to fetch?  The whole web?

It was a sample for people trying to use more threads for better performenca; 
they can't use more that 65000. Also, nobody tested JVM, SUN's JVM 1.3.1 
performed ugly with more than 100 threads (at least, on Windows).


> Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation
> -----------------------------------------------------------------------
>
>          Key: NUTCH-109
>          URL: http://issues.apache.org/jira/browse/NUTCH-109
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7, 0.8-dev, 0.6, 0.7.1
>  Environment: Nutch: Windows XP, J2SE 1.4.2_09
> Web Server: Suse Linux, Apache HTTPD, apache2-worker,  v. 2.0.53
>     Reporter: Fuad Efendi
>  Attachments: protocol-httpclient-innovation-0.1.0.zip
>
> 1. TCP connection costs a lot, not only for Nutch and end-point web servers, 
> but also for intermediary network equipment 
> 2. Web Server creates Client thread and hopes that Nutch really uses 
> HTTP/1.1, or at least Nutch sends "Connection: close" before closing in JVM 
> "Socket.close()" ...
> I need to perform very objective tests, probably 2-3 days; new plugin 
> crawled/parsed 23,000 pages for 1,321 seconds; it seems that existing 
> http-plugin needs few days...
> I am using separate network segment with Windows XP (Nutch), and Suse Linux 
> (Apache HTTPD + 120,000 pages)
> Please find attached new plugin based on 
> http://www.innovation.ch/java/HTTPClient/
> Please note: 
> Class HttpFactory contains cache of HTTPConnection objects; each object run 
> each thread; each object is absolutely thread-safe, so we can send multiple 
> GET requests using single instance:
>    private static int CLIENTS_PER_HOST = 
> NutchConf.get().getInt("http.clients.per.host", 3);
> I'll add more comments after finishing tests...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply via email to