Some suggestion to improve performance:
1. Decrease randomization of FetchList.
Here is comment from FetchListTool:
/**
* The TableSet class will allocate a given FetchListEntry
* into one of several ArrayFiles. It chooses which
* ArrayFile based on a hash of the URL's domain name.
*
* It uses a hash of the domain name so that pages are
* allocated to a random ArrayFile, but same-host pages
* go to the same file (for efficiency purposes during
* fetch).
*
* Further, within a given file, the FetchListEntry items
* appear in random order. This is so that we don't
* hammer the same site over and over again during fetch.
*
* Each table should receive a roughly
* even number of entries, but all URLs for a specific
* domain name will be found in a single table. If
* the dataset is weirdly skewed toward large domains,
* there may be an uneven distribution.
*/
Same "same-host pages go to the same file" - they should go in a
sequence, without mixing/randomizing with other host-pages...
We are fetching single URL, then we forget about existense of this
TCP/IP connection, we even forget that Web Server created Client Process
to handle our HTTP requests, it is called Keep Alive. Creation of TCP
connection, and additionally creation of a such Client Process on a Web
Server costs a lot of CPU on both sides, Nutch & WebServer.
I suggest to use single Keep-Alive thread to fetch single Host, without
randomization.
2. Use/Investigate more staff from Socket API such as
public void setSoTimeout(int timeout)
public void setReuseAddress(true)
I found this in J2SE API for setReuseAddress(default: false):
=====
When a TCP connection is closed the connection may remain in a timeout
state for a period of time after the connection is closed (typically
known as the TIME_WAIT state or 2MSL wait state). For applications using
a well known socket address or port it may not be possible to bind a
socket to the required SocketAddress if there is a connection in the
timeout state involving the socket address or port.
=====
It probably means that we are reaching huge amount (65000!) of "waiting"
TCP ports after Socket.close(); and Fetcher Theads are blocking by OS
waiting when OS release some of these ports... Am I right?
P.S.
Anyway, using Keep-Alive option is very important not only for us but
also for Production Web Sites.
Thanks,
Fuad
-----Original Message-----
From: Fuad Efendi [mailto:[EMAIL PROTECTED]
Sent: Friday, September 30, 2005 10:58 PM
To: [email protected]; [EMAIL PROTECTED]
Subject: RE: what contibute to fetch slowing down
Dear Nutchers,
I noticed same problem twise, with PentiumMobile2Mhz & WindowsXP & 2Gb,
and with 2xOpteron252 x SuseLinux x 4Gb
I have only one explanation which should be probably mirrored at JIRA:
================
Network.
========
1.
I never had such a problem with The Grinder,
http://grinder.sourceforge.net, which is based on alternate HTTPClient
http://www.innovation.ch/java/HTTPClient/index.html. Apache SF should
really review their HttpClient RC3(!!!) accordingly, HTTPClient
(upper--HTTP-case)is not "alpha", it is production version... I used
Grinder a lot, it allows to execute 32 processes with 64 threads each on
2048Mb RAM...
2.
I found at SUN API this:
java.net.Socket
public void setReuseAddress(boolean on) - please check API!!!
3.
I saw in your PROTOCOL-HTTP this code:
... HTTP/1.0 ...
Why? Why version 1.0??? It should understand server's reply such as
"Connection: close" "Connection: keep-alive" etc. (pls ignore typo).
4.
By the way, how many files UNIX needs in order to maintain 65536 network
sockets?
Respectfully,
Fuad
P.S.
Sorry guys, I don't have anough time to participate... Could you please
test this suspicious behaviour, and very strange opinion? Should I
create a new bug report at JIRA?
SUN's Socket, Apache's HttpClient, UNIX's networking...
-----Original Message-----
From: Daniele Menozzi [mailto:[EMAIL PROTECTED]
Sent: Wednesday, September 28, 2005 4:42 PM
To: [email protected]
Subject: Re: what contibute to fetch slowing down
On 10:27:55 28/Sep , AJ Chen wrote:
> I started the crawler with about 2000 sites. The fetcher could
> achieve
> 7 pages/sec initially, but the performance gradually dropped to about
2
> pages/sec, sometimes even 0.5 pages/sec. The fetch list had 300k
pages
> and I used 500 threads. What are the main causes of this slowing down?
I have the same problem; I've tried with different number of fetchers
(10,20,50,100,..), but the download rate always decrease sistematically,
page after page. The machine is a p4 1.7, 768 MB ram, running debian on
2.6.12 kernel. The bandwidth isn't a problem (10Mbit in and 10Mbit out),
but I cannot obtain a stable, and high, page/s rate. I've also tried to
change machine and kernel, but the problem still remains. Can you please
give us some advice? Thank you for your help,
Menoz
--
Free Software Enthusiast
Debian Powered Linux User #332564
http://menoz.homelinux.org