[ http://issues.apache.org/jira/browse/NUTCH-109?page=comments#action_12331904 ]
Fuad Efendi commented on NUTCH-109: ----------------------------------- I can't use Email right now, so put comments here: === >Have you seen Kelvin Tan's patch? >You should take a look, it's in JIRA, and addresses some of the >HTTP/1.1 issues that you are concerned about. And my reply: http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg01037.html === >> private static InetAddress blockAddr(URL url) throws >> ProtocolException {...} >Where is this method? [plugin-httpclient] & [protocol-http], Http.java === >> 1. we are establishing TCP transport, 100-300 milliseconds X 2-3 >> times (TCP HandShake? some IP packets...) >> 2. Apache HTTPD Server creates Client thread to handle our requests, >> 1 second (more or less, try Internet Explorer, first page takes few >> second to download, then browsing works very fast - we have personal >> Thread on the Server). >This is often be due to the initial hostname address lookup, when the >domain name server doesn't have the host name IP address already >cached. Not. DNS Lookup happens only onse per JVM lifecycle; 1 & 2 HandShakes happen meny times. === >> We have network equipment limitations too, we can't reach more than >> 65000 threads over single LAN card, and JVM is good (but better is to >> have multiple JVM/processes, 100 threads each...) >65000 threads? What are you trying to fetch? The whole web? It was a sample for people trying to use more threads for better performenca; they can't use more that 65000. Also, nobody tested JVM, SUN's JVM 1.3.1 performed ugly with more than 100 threads (at least, on Windows). > Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation > ----------------------------------------------------------------------- > > Key: NUTCH-109 > URL: http://issues.apache.org/jira/browse/NUTCH-109 > Project: Nutch > Type: Improvement > Components: fetcher > Versions: 0.7, 0.8-dev, 0.6, 0.7.1 > Environment: Nutch: Windows XP, J2SE 1.4.2_09 > Web Server: Suse Linux, Apache HTTPD, apache2-worker, v. 2.0.53 > Reporter: Fuad Efendi > Attachments: protocol-httpclient-innovation-0.1.0.zip > > 1. TCP connection costs a lot, not only for Nutch and end-point web servers, > but also for intermediary network equipment > 2. Web Server creates Client thread and hopes that Nutch really uses > HTTP/1.1, or at least Nutch sends "Connection: close" before closing in JVM > "Socket.close()" ... > I need to perform very objective tests, probably 2-3 days; new plugin > crawled/parsed 23,000 pages for 1,321 seconds; it seems that existing > http-plugin needs few days... > I am using separate network segment with Windows XP (Nutch), and Suse Linux > (Apache HTTPD + 120,000 pages) > Please find attached new plugin based on > http://www.innovation.ch/java/HTTPClient/ > Please note: > Class HttpFactory contains cache of HTTPConnection objects; each object run > each thread; each object is absolutely thread-safe, so we can send multiple > GET requests using single instance: > private static int CLIENTS_PER_HOST = > NutchConf.get().getInt("http.clients.per.host", 3); > I'll add more comments after finishing tests... -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
