Kelvin's OC implementation is queuing fetching request according to the host and using http 1.1 protocol. It is a nutch patch currently.
Michael Ji, --- Fuad Efendi <[EMAIL PROTECTED]> wrote: > Some suggestion to improve performance: > > > 1. Decrease randomization of FetchList. > > Here is comment from FetchListTool: > /** > * The TableSet class will allocate a given > FetchListEntry > * into one of several ArrayFiles. It chooses > which > * ArrayFile based on a hash of the URL's domain > name. > * > * It uses a hash of the domain name so that > pages are > * allocated to a random ArrayFile, but > same-host pages > * go to the same file (for efficiency purposes > during > * fetch). > * > * Further, within a given file, the > FetchListEntry items > * appear in random order. This is so that we > don't > * hammer the same site over and over again > during fetch. > * > * Each table should receive a roughly > * even number of entries, but all URLs for a > specific > * domain name will be found in a single table. > If > * the dataset is weirdly skewed toward large > domains, > * there may be an uneven distribution. > */ > > Same "same-host pages go to the same file" - they > should go in a > sequence, without mixing/randomizing with other > host-pages... > > We are fetching single URL, then we forget about > existense of this > TCP/IP connection, we even forget that Web Server > created Client Process > to handle our HTTP requests, it is called Keep > Alive. Creation of TCP > connection, and additionally creation of a such > Client Process on a Web > Server costs a lot of CPU on both sides, Nutch & > WebServer. > > I suggest to use single Keep-Alive thread to fetch > single Host, without > randomization. > > > 2. Use/Investigate more staff from Socket API such > as > public void setSoTimeout(int timeout) > public void setReuseAddress(true) > > I found this in J2SE API for > setReuseAddress(default: false): > ===== > When a TCP connection is closed the connection may > remain in a timeout > state for a period of time after the connection is > closed (typically > known as the TIME_WAIT state or 2MSL wait state). > For applications using > a well known socket address or port it may not be > possible to bind a > socket to the required SocketAddress if there is a > connection in the > timeout state involving the socket address or port. > ===== > > It probably means that we are reaching huge amount > (65000!) of "waiting" > TCP ports after Socket.close(); and Fetcher Theads > are blocking by OS > waiting when OS release some of these ports... Am I > right? > > > P.S. > Anyway, using Keep-Alive option is very important > not only for us but > also for Production Web Sites. > > Thanks, > Fuad > > > > > > -----Original Message----- > From: Fuad Efendi [mailto:[EMAIL PROTECTED] > Sent: Friday, September 30, 2005 10:58 PM > To: [email protected]; [EMAIL PROTECTED] > Subject: RE: what contibute to fetch slowing down > > > Dear Nutchers, > > > I noticed same problem twise, with PentiumMobile2Mhz > & WindowsXP & 2Gb, > and with 2xOpteron252 x SuseLinux x 4Gb > > I have only one explanation which should be probably > mirrored at JIRA: > > > ================ > Network. > ======== > > > 1. > I never had such a problem with The Grinder, > http://grinder.sourceforge.net, which is based on > alternate HTTPClient > http://www.innovation.ch/java/HTTPClient/index.html. > Apache SF should > really review their HttpClient RC3(!!!) accordingly, > HTTPClient > (upper--HTTP-case)is not "alpha", it is production > version... I used > Grinder a lot, it allows to execute 32 processes > with 64 threads each on > 2048Mb RAM... > > > 2. > I found at SUN API this: > java.net.Socket > public void setReuseAddress(boolean on) - please > check API!!! > > > 3. > I saw in your PROTOCOL-HTTP this code: > ... HTTP/1.0 ... > Why? Why version 1.0??? It should understand > server's reply such as > "Connection: close" "Connection: keep-alive" etc. > (pls ignore typo). > > > 4. > By the way, how many files UNIX needs in order to > maintain 65536 network > sockets? > > > Respectfully, > Fuad > > P.S. > Sorry guys, I don't have anough time to > participate... Could you please > test this suspicious behaviour, and very strange > opinion? Should I > create a new bug report at JIRA? > > SUN's Socket, Apache's HttpClient, UNIX's > networking... > > > > > -----Original Message----- > From: Daniele Menozzi [mailto:[EMAIL PROTECTED] > Sent: Wednesday, September 28, 2005 4:42 PM > To: [email protected] > Subject: Re: what contibute to fetch slowing down > > > On 10:27:55 28/Sep , AJ Chen wrote: > > I started the crawler with about 2000 sites. The > fetcher could > > achieve > > 7 pages/sec initially, but the performance > gradually dropped to about > 2 > > pages/sec, sometimes even 0.5 pages/sec. The > fetch list had 300k > pages > > and I used 500 threads. What are the main causes > of this slowing down? > > > I have the same problem; I've tried with different > number of fetchers > (10,20,50,100,..), but the download rate always > decrease === message truncated === ______________________________________________________ Yahoo! for Good Donate to the Hurricane Katrina relief effort. http://store.yahoo.com/redcross-donate3/
