I never tried Kelvin's OC, I only browsed source code a little. We need to make test with JVM 1.4, and JVM 1.5 (Kelvin's OC).
If I am right, we are simply _killing_ many many sites with default Apache HTTPD installation (Microsoft IIS, etc.) (150 keep-alive client threads; I configured 6000 threads for Worker model, but it was very unusual). Those client threads are created each time for each single HTTP request from Nutch, after 150 pages we are simply overloading Web Server, and we receive "connection timeout exception". We need to use real Web Server during tests, and HTTP Proxy (http://grinder.sourceforge.net - very simple Java based proxy) -----Original Message----- From: Michael Ji [mailto:[EMAIL PROTECTED] Sent: Sunday, October 02, 2005 5:37 PM To: [email protected] Subject: RE: what contibute to fetch slowing down Kelvin's OC implementation is queuing fetching request according to the host and using http 1.1 protocol. It is a nutch patch currently. Michael Ji, --- Fuad Efendi <[EMAIL PROTECTED]> wrote: > Some suggestion to improve performance: > > > 1. Decrease randomization of FetchList. > > Here is comment from FetchListTool: > /** > * The TableSet class will allocate a given FetchListEntry > * into one of several ArrayFiles. It chooses > which > * ArrayFile based on a hash of the URL's domain > name. > * > * It uses a hash of the domain name so that > pages are > * allocated to a random ArrayFile, but > same-host pages > * go to the same file (for efficiency purposes > during > * fetch). > * > * Further, within a given file, the > FetchListEntry items > * appear in random order. This is so that we > don't > * hammer the same site over and over again > during fetch. > * > * Each table should receive a roughly > * even number of entries, but all URLs for a > specific > * domain name will be found in a single table. > If > * the dataset is weirdly skewed toward large > domains, > * there may be an uneven distribution. > */ > > Same "same-host pages go to the same file" - they > should go in a > sequence, without mixing/randomizing with other > host-pages... > > We are fetching single URL, then we forget about > existense of this > TCP/IP connection, we even forget that Web Server > created Client Process > to handle our HTTP requests, it is called Keep > Alive. Creation of TCP > connection, and additionally creation of a such > Client Process on a Web > Server costs a lot of CPU on both sides, Nutch & > WebServer. > > I suggest to use single Keep-Alive thread to fetch > single Host, without > randomization. > > > 2. Use/Investigate more staff from Socket API such > as > public void setSoTimeout(int timeout) > public void setReuseAddress(true) > > I found this in J2SE API for > setReuseAddress(default: false): > ===== > When a TCP connection is closed the connection may > remain in a timeout > state for a period of time after the connection is > closed (typically > known as the TIME_WAIT state or 2MSL wait state). > For applications using > a well known socket address or port it may not be > possible to bind a > socket to the required SocketAddress if there is a > connection in the > timeout state involving the socket address or port. > ===== > > It probably means that we are reaching huge amount > (65000!) of "waiting" > TCP ports after Socket.close(); and Fetcher Theads > are blocking by OS > waiting when OS release some of these ports... Am I > right? > > > P.S. > Anyway, using Keep-Alive option is very important > not only for us but > also for Production Web Sites. > > Thanks, > Fuad > > > > > > -----Original Message----- > From: Fuad Efendi [mailto:[EMAIL PROTECTED] > Sent: Friday, September 30, 2005 10:58 PM > To: [email protected]; [EMAIL PROTECTED] > Subject: RE: what contibute to fetch slowing down > > > Dear Nutchers, > > > I noticed same problem twise, with PentiumMobile2Mhz > & WindowsXP & 2Gb, > and with 2xOpteron252 x SuseLinux x 4Gb > > I have only one explanation which should be probably > mirrored at JIRA: > > > ================ > Network. > ======== > > > 1. > I never had such a problem with The Grinder, > http://grinder.sourceforge.net, which is based on alternate HTTPClient > http://www.innovation.ch/java/HTTPClient/index.html. > Apache SF should > really review their HttpClient RC3(!!!) accordingly, > HTTPClient > (upper--HTTP-case)is not "alpha", it is production > version... I used > Grinder a lot, it allows to execute 32 processes > with 64 threads each on > 2048Mb RAM... > > > 2. > I found at SUN API this: > java.net.Socket > public void setReuseAddress(boolean on) - please > check API!!! > > > 3. > I saw in your PROTOCOL-HTTP this code: > ... HTTP/1.0 ... > Why? Why version 1.0??? It should understand > server's reply such as > "Connection: close" "Connection: keep-alive" etc. > (pls ignore typo). > > > 4. > By the way, how many files UNIX needs in order to > maintain 65536 network > sockets? > > > Respectfully, > Fuad > > P.S. > Sorry guys, I don't have anough time to > participate... Could you please > test this suspicious behaviour, and very strange > opinion? Should I > create a new bug report at JIRA? > > SUN's Socket, Apache's HttpClient, UNIX's > networking... > > > > > -----Original Message----- > From: Daniele Menozzi [mailto:[EMAIL PROTECTED] > Sent: Wednesday, September 28, 2005 4:42 PM > To: [email protected] > Subject: Re: what contibute to fetch slowing down > > > On 10:27:55 28/Sep , AJ Chen wrote: > > I started the crawler with about 2000 sites. The > fetcher could > > achieve > > 7 pages/sec initially, but the performance > gradually dropped to about > 2 > > pages/sec, sometimes even 0.5 pages/sec. The > fetch list had 300k > pages > > and I used 500 threads. What are the main causes > of this slowing down? > > > I have the same problem; I've tried with different > number of fetchers > (10,20,50,100,..), but the download rate always > decrease === message truncated === ______________________________________________________ Yahoo! for Good Donate to the Hurricane Katrina relief effort. http://store.yahoo.com/redcross-donate3/
