RE: what contibute to fetch slowing down

Fuad Efendi Sun, 02 Oct 2005 20:15:04 -0700

Unfortunately this is commented in Kelvin's code:

//      reqStr.append("Connection: Keep-Alive\r\n");



I found only     

reqStr.append(" HTTP/1.1\r\n");

- but it does not mean implementation of HTTP/1.1 features.


Teleport Ultra v.1.29 needs just a few hours to download all plain HTML
from SUN, Nutch needs few days. 8mbps/800kbps, download/upload.


-----Original Message-----
From: Michael Ji [mailto:[EMAIL PROTECTED] 
Sent: Sunday, October 02, 2005 5:37 PM
To: [email protected]
Subject: RE: what contibute to fetch slowing down


Kelvin's OC implementation is queuing fetching request according to the
host and using http 1.1 protocol. It is a nutch patch currently.

Michael Ji,

--- Fuad Efendi <[EMAIL PROTECTED]> wrote:

> Some suggestion to improve performance:
> 
> 
> 1. Decrease randomization of FetchList.
>  
> Here is comment from FetchListTool:
>    /**
>      * The TableSet class will allocate a given FetchListEntry
>      * into one of several ArrayFiles.  It chooses
> which
>      * ArrayFile based on a hash of the URL's domain
> name.
>      *
>      * It uses a hash of the domain name so that
> pages are
>      * allocated to a random ArrayFile, but
> same-host pages
>      * go to the same file (for efficiency purposes
> during
>      * fetch).
>      *
>      * Further, within a given file, the
> FetchListEntry items
>      * appear in random order.  This is so that we
> don't
>      * hammer the same site over and over again
> during fetch.
>      *
>      * Each table should receive a roughly
>      * even number of entries, but all URLs for a 
> specific 
>      * domain name will be found in a single table. 
> If
>      * the dataset is weirdly skewed toward large
> domains,
>      * there may be an uneven distribution.
>      */
> 
> Same "same-host pages go to the same file" - they
> should go in a
> sequence, without mixing/randomizing with other
> host-pages...
> 
> We are fetching single URL, then we forget about
> existense of this
> TCP/IP connection, we even forget that Web Server
> created Client Process
> to handle our HTTP requests, it is called Keep
> Alive. Creation of TCP
> connection, and additionally creation of a such
> Client Process on a Web
> Server costs a lot of CPU on both sides, Nutch &
> WebServer.
> 
> I suggest to use single Keep-Alive thread to fetch
> single Host, without
> randomization.
> 
> 
> 2. Use/Investigate more staff from Socket API such
> as
> public void setSoTimeout(int timeout)
> public void setReuseAddress(true)
> 
> I found this in J2SE API for
> setReuseAddress(default: false):
> =====
> When a TCP connection is closed the connection may
> remain in a timeout
> state for a period of time after the connection is
> closed (typically
> known as the TIME_WAIT state or 2MSL wait state).
> For applications using
> a well known socket address or port it may not be
> possible to bind a
> socket to the required SocketAddress if there is a
> connection in the
> timeout state involving the socket address or port.
> =====
> 
> It probably means that we are reaching huge amount
> (65000!) of "waiting"
> TCP ports after Socket.close(); and Fetcher Theads
> are blocking by OS
> waiting when OS release some of these ports... Am I
> right?
> 
> 
> P.S.
> Anyway, using Keep-Alive option is very important
> not only for us but
> also for Production Web Sites.
> 
> Thanks,
> Fuad
> 
> 
> 
> 
> 
> -----Original Message-----
> From: Fuad Efendi [mailto:[EMAIL PROTECTED]
> Sent: Friday, September 30, 2005 10:58 PM
> To: [email protected]; [EMAIL PROTECTED]
> Subject: RE: what contibute to fetch slowing down
> 
> 
> Dear Nutchers,
> 
> 
> I noticed same problem twise, with PentiumMobile2Mhz
> & WindowsXP & 2Gb,
> and with 2xOpteron252 x SuseLinux x 4Gb
> 
> I have only one explanation which should be probably
> mirrored at JIRA:
> 
> 
> ================
> Network.
> ========
> 
> 
> 1.
> I never had such a problem with The Grinder, 
> http://grinder.sourceforge.net, which is based on alternate HTTPClient
> http://www.innovation.ch/java/HTTPClient/index.html.
> Apache SF should
> really review their HttpClient RC3(!!!) accordingly,
> HTTPClient
> (upper--HTTP-case)is not "alpha", it is production
> version... I used
> Grinder a lot, it allows to execute 32 processes
> with 64 threads each on
> 2048Mb RAM...
> 
> 
> 2.
> I found at SUN API this:
> java.net.Socket
> public void setReuseAddress(boolean on) - please
> check API!!!
> 
> 
> 3.
> I saw in your PROTOCOL-HTTP this code:
> ... HTTP/1.0 ...
> Why? Why version 1.0??? It should understand
> server's reply such as
> "Connection: close" "Connection: keep-alive" etc.
> (pls ignore typo).
> 
> 
> 4.
> By the way, how many files UNIX needs in order to
> maintain 65536 network
> sockets?
> 
> 
> Respectfully,
> Fuad
> 
> P.S.
> Sorry guys, I don't have anough time to
> participate... Could you please
> test this suspicious behaviour, and very strange
> opinion? Should I
> create a new bug report at JIRA?
> 
> SUN's Socket, Apache's HttpClient, UNIX's
> networking...
> 
> 
> 
> 
> -----Original Message-----
> From: Daniele Menozzi [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, September 28, 2005 4:42 PM
> To: [email protected]
> Subject: Re: what contibute to fetch slowing down
> 
> 
> On  10:27:55 28/Sep , AJ Chen wrote:
> > I started the crawler with about 2000 sites.  The
> fetcher could
> > achieve
> > 7 pages/sec initially, but the performance
> gradually dropped to about
> 2
> > pages/sec, sometimes even 0.5 pages/sec.  The
> fetch list had 300k
> pages
> > and I used 500 threads. What are the main causes
> of this slowing down?
> 
> 
> I have the same problem; I've tried with different
> number of fetchers
> (10,20,50,100,..), but the download rate always
> decrease
=== message truncated ===



        
                
______________________________________________________ 
Yahoo! for Good 
Donate to the Hurricane Katrina relief effort. 
http://store.yahoo.com/redcross-donate3/

RE: what contibute to fetch slowing down

Reply via email to