Re: [Nutch-general] On fetcher slowness

Murat Ali Bayir Mon, 14 Aug 2006 02:24:35 -0700

Hi everbody, I want to learn the error percent when using 2000+ threads. 
When I use the number of threads equal to 2000 I got %80-90 error. When 
we use 1000 thread we got %20 error.  Is there any relation between the 
number of threads and error? Can anyone give any  configuration with 
2000+ threads getting lower errors? We have random set of urls from 
diverse set of hosts.


Dennis Kubes wrote:

> Sorry, yeah it was 344 not 334.
>
> Dennis
>
> Ken Krugler wrote:
>
>>> Here is a lightly tested patch for the crawl delay that allows urls 
>>> with crawl delay set greater than x number of seconds to be ignored. 
>>> I have currently run this on over 2 million urls and it is working 
>>> good.  With this patch I also added this to my nutch-site.xml file 
>>> for ignoring sites with crawl delay > 30 seconds.  The value can be 
>>> changed to suit.  I have seen crawl delays as high at 259200 seconds 
>>> in our crawls.
>>>
>>> <property>
>>> <name>http.max.crawl.delay</name>
>>> <value>30</value>
>>> <description>
>>> If the crawl delay in robots.txt is set to greater than this value 
>>> then the
>>> fetcher will ignore this page.  If set to -1 the fetcher will never 
>>> ignore
>>> pages and will wait the amount of time retrieved from robots.txt 
>>> crawl delay.
>>> This can cause hung threads if the delay is >= task timeout value.  
>>> If all
>>> threads get hung it can cause the fetcher task to about prematurely.
>>> </description>
>>> </property>
>>
>>
>> Thanks, this is useful. We did a survey of a bunch of sites, and 
>> found crawl delay values up to 99999 seconds.
>>
>>> The most recent patches to fetcher (not fetcher2) with NUTCH-334 
>>> seems to have speed up our fetching dramatically.  We are only using 
>>> about 50 fetchers but are consistently fetcher 1M + urls per day. 
>>> The patch attatched and the 334 patches will help if staying on 0.8. 
>>> If moving forward I think the new Fetcher2 codebase is a better 
>>> solution though still a new one.
>>
>>
>> NUTCH-334 or NUTCH-344? I'm assuming the latter 
>> (http://issues.apache.org/jira/browse/NUTCH-344).
>>
>> Thanks,
>>
>> -- Ken
>>
>>
>>> Ken Krugler wrote:
>>>
>>>>> On 8/12/06, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> Several people reported issues with slow fetcher in 0.8...
>>>>>>
>>>>>> I run Nutch on a dual CPU (+HT) box, and have noticed that the 
>>>>>> fetch speed didn't increase when I went from using 100 threads, 
>>>>>> to 200 threads.  Has anyone else observed the same?
>>>>>>
>>>>>> I was using 2 map tasks (mapred.map.tasks property) in both 
>>>>>> cases, and the aggregate fetch speed was between 20 and 40 
>>>>>> pages/sec. This was a fetch of 50K+ URLs from a diverse set of 
>>>>>> servers.
>>>>>>
>>>>>> While crawling, strace -p<PID> and strace -ff<PID> shows a LOT of 
>>>>>> gettimeofday calls.  Running strace several times in a row kept 
>>>>>> showing that gettimeofday is the most frequent system call.
>>>>>> Has anyone tried tracing the fetcher process?  Where do these 
>>>>>> calls come from?  Any call to new Date() or 
>>>>>> Calendar.getInstance(), as must be done for every single logging 
>>>>>> call, perhaps?
>>>>>>
>>>>>> I can certainly be impolite and lower fetcher.server.delay to 1 
>>>>>> second or even 0, but I'd like to be polite.
>>>>>>
>>>>>> I saw Ken Krugle's email suggesting to increast the number of 
>>>>>> fetcher threads to 2000+ and set the maximal java thread stack 
>>>>>> size to 512k with -Xss.  Has anyone other than Ken tried this 
>>>>>> with success?  Wouldn't the JVM go crazy context switching 
>>>>>> between this many threads?
>>>>>
>>>>
>>>> Note that most of the time these fetcher threads are all blocked, 
>>>> waiting for other threads that are already fetching from the same 
>>>> IP address. So there's not a lot of thrashing.
>>>>
>>>>> I been working with 512k -Xss (Ken Krugle's suggestion) and it works
>>>>> well. However number of fetcher for my part is 2500+.. I had to play
>>>>> around with this number to match my bandwidth limitation, but now I
>>>>> maximize my full bandwidth. But the problem that I run into are the
>>>>> fetcher threads hangs, and for crawl delay/robots.txt file (Please 
>>>>> see
>>>>> Dennis Kubes posting on this).
>>>>
>>>>
>>>> Yes, these are definitely problems.
>>>>
>>>> Stefan has been working on a queue-based fetcher that uses NIO. 
>>>> Seems very promising, but not yet ready for prime time.
>>>>
>>>> -- Ken
>>>
>>>
>>>
>>> Index: src/java/org/apache/nutch/protocol/ProtocolStatus.java
>>> ===================================================================
>>> --- src/java/org/apache/nutch/protocol/ProtocolStatus.java     
>>> (revision 428457)
>>> +++ src/java/org/apache/nutch/protocol/ProtocolStatus.java    
>>> (working copy)
>>> @@ -60,6 +60,10 @@
>>>    public static final int NOTFETCHING          = 20;
>>>    /** Unchanged since the last fetch. */
>>>    public static final int NOTMODIFIED          = 21;
>>> +  /** Request was refused by protocol plugins, because it would block.
>>> +   * The expected number of milliseconds to wait before retry may 
>>> be provided
>>> +   * in args. */
>>> +  public static final int WOULDBLOCK          = 22;
>>>      // Useful static instances for status codes that don't usually 
>>> require any
>>>    // additional arguments.
>>> Index: 
>>> src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java
>>>  
>>>
>>> ===================================================================
>>> --- 
>>> src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java
>>>  
>>>     (revision 430392)
>>> +++ 
>>> src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java
>>>  
>>>     (working copy)
>>> @@ -125,6 +125,9 @@
>>>
>>>    /** Do we use HTTP/1.1? */
>>>    protected boolean useHttp11 = false;
>>> + +  /** Ignore page if crawl delay over this number of seconds */
>>> +  private int maxCrawlDelay = -1;
>>>
>>>    /** Creates a new instance of HttpBase */
>>>    public HttpBase() {
>>> @@ -152,6 +155,7 @@
>>>          this.userAgent = 
>>> getAgentString(conf.get("http.agent.name"), 
>>> conf.get("http.agent.version"), conf
>>>                  .get("http.agent.description"), 
>>> conf.get("http.agent.url"), conf.get("http.agent.email"));
>>>          this.serverDelay = (long) 
>>> (conf.getFloat("fetcher.server.delay", 1.0f) * 1000);
>>> +        this.maxCrawlDelay = (int) 
>>> conf.getInt("http.max.crawl.delay", -1);
>>>          // backward-compatible default setting
>>>          this.byIP = 
>>> conf.getBoolean("fetcher.threads.per.host.by.ip", true);
>>>          this.useHttp11 = conf.getBoolean("http.http11", false);
>>> @@ -185,6 +189,17 @@
>>>              long crawlDelay = robots.getCrawlDelay(this, u);
>>>        long delay = crawlDelay > 0 ? crawlDelay : serverDelay;
>>> +     +      int crawlDelaySeconds = (int)(delay / 1000);
>>> +      if (crawlDelaySeconds > maxCrawlDelay && maxCrawlDelay >= 0) {
>>> +        LOGGER.info("Ignoring " + u + ", exceeds max crawl delay: 
>>> max=" +
>>> +          maxCrawlDelay + ", crawl delay=" + crawlDelaySeconds +
>>> +          " seconds");
>>> +        Content c = new Content(u.toString(), u.toString(), 
>>> EMPTY_CONTENT,
>>> +          null, null, this.conf);
>>> +        return new ProtocolOutput(c, new 
>>> ProtocolStatus(ProtocolStatus.WOULDBLOCK));
>>> +      }
>>> +            String host = blockAddr(u, delay);
>>>        Response response;
>>>        try {
>>
>>
>>
>
>
>


-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] On fetcher slowness

Reply via email to