Sorry, yeah it was 344 not 334.

Dennis

Ken Krugler wrote:
>> Here is a lightly tested patch for the crawl delay that allows urls 
>> with crawl delay set greater than x number of seconds to be ignored. 
>> I have currently run this on over 2 million urls and it is working 
>> good.  With this patch I also added this to my nutch-site.xml file 
>> for ignoring sites with crawl delay > 30 seconds.  The value can be 
>> changed to suit.  I have seen crawl delays as high at 259200 seconds 
>> in our crawls.
>>
>> <property>
>> <name>http.max.crawl.delay</name>
>> <value>30</value>
>> <description>
>> If the crawl delay in robots.txt is set to greater than this value 
>> then the
>> fetcher will ignore this page.  If set to -1 the fetcher will never 
>> ignore
>> pages and will wait the amount of time retrieved from robots.txt 
>> crawl delay.
>> This can cause hung threads if the delay is >= task timeout value.  
>> If all
>> threads get hung it can cause the fetcher task to about prematurely.
>> </description>
>> </property>
>
> Thanks, this is useful. We did a survey of a bunch of sites, and found 
> crawl delay values up to 99999 seconds.
>
>> The most recent patches to fetcher (not fetcher2) with NUTCH-334 
>> seems to have speed up our fetching dramatically.  We are only using 
>> about 50 fetchers but are consistently fetcher 1M + urls per day. The 
>> patch attatched and the 334 patches will help if staying on 0.8. If 
>> moving forward I think the new Fetcher2 codebase is a better solution 
>> though still a new one.
>
> NUTCH-334 or NUTCH-344? I'm assuming the latter 
> (http://issues.apache.org/jira/browse/NUTCH-344).
>
> Thanks,
>
> -- Ken
>
>
>> Ken Krugler wrote:
>>>> On 8/12/06, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
>>>>> Hello,
>>>>>
>>>>> Several people reported issues with slow fetcher in 0.8...
>>>>>
>>>>> I run Nutch on a dual CPU (+HT) box, and have noticed that the 
>>>>> fetch speed didn't increase when I went from using 100 threads, to 
>>>>> 200 threads.  Has anyone else observed the same?
>>>>>
>>>>> I was using 2 map tasks (mapred.map.tasks property) in both cases, 
>>>>> and the aggregate fetch speed was between 20 and 40 pages/sec. 
>>>>> This was a fetch of 50K+ URLs from a diverse set of servers.
>>>>>
>>>>> While crawling, strace -p<PID> and strace -ff<PID> shows a LOT of 
>>>>> gettimeofday calls.  Running strace several times in a row kept 
>>>>> showing that gettimeofday is the most frequent system call.
>>>>> Has anyone tried tracing the fetcher process?  Where do these 
>>>>> calls come from?  Any call to new Date() or 
>>>>> Calendar.getInstance(), as must be done for every single logging 
>>>>> call, perhaps?
>>>>>
>>>>> I can certainly be impolite and lower fetcher.server.delay to 1 
>>>>> second or even 0, but I'd like to be polite.
>>>>>
>>>>> I saw Ken Krugle's email suggesting to increast the number of 
>>>>> fetcher threads to 2000+ and set the maximal java thread stack 
>>>>> size to 512k with -Xss.  Has anyone other than Ken tried this with 
>>>>> success?  Wouldn't the JVM go crazy context switching between this 
>>>>> many threads?
>>>
>>> Note that most of the time these fetcher threads are all blocked, 
>>> waiting for other threads that are already fetching from the same IP 
>>> address. So there's not a lot of thrashing.
>>>
>>>> I been working with 512k -Xss (Ken Krugle's suggestion) and it works
>>>> well. However number of fetcher for my part is 2500+.. I had to play
>>>> around with this number to match my bandwidth limitation, but now I
>>>> maximize my full bandwidth. But the problem that I run into are the
>>>> fetcher threads hangs, and for crawl delay/robots.txt file (Please see
>>>> Dennis Kubes posting on this).
>>>
>>> Yes, these are definitely problems.
>>>
>>> Stefan has been working on a queue-based fetcher that uses NIO. 
>>> Seems very promising, but not yet ready for prime time.
>>>
>>> -- Ken
>>
>>
>> Index: src/java/org/apache/nutch/protocol/ProtocolStatus.java
>> ===================================================================
>> --- src/java/org/apache/nutch/protocol/ProtocolStatus.java 
>>     (revision 428457)
>> +++ src/java/org/apache/nutch/protocol/ProtocolStatus.java    
>> (working copy)
>> @@ -60,6 +60,10 @@
>>    public static final int NOTFETCHING          = 20;
>>    /** Unchanged since the last fetch. */
>>    public static final int NOTMODIFIED          = 21;
>> +  /** Request was refused by protocol plugins, because it would block.
>> +   * The expected number of milliseconds to wait before retry may be 
>> provided
>> +   * in args. */
>> +  public static final int WOULDBLOCK          = 22;
>>      // Useful static instances for status codes that don't usually 
>> require any
>>    // additional arguments.
>> Index: 
>> src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java
>>  
>>
>> ===================================================================
>> --- 
>> src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java
>>  
>>     (revision 430392)
>> +++ 
>> src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java
>>  
>>     (working copy)
>> @@ -125,6 +125,9 @@
>>
>>    /** Do we use HTTP/1.1? */
>>    protected boolean useHttp11 = false;
>> + +  /** Ignore page if crawl delay over this number of seconds */
>> +  private int maxCrawlDelay = -1;
>>
>>    /** Creates a new instance of HttpBase */
>>    public HttpBase() {
>> @@ -152,6 +155,7 @@
>>          this.userAgent = getAgentString(conf.get("http.agent.name"), 
>> conf.get("http.agent.version"), conf
>>                  .get("http.agent.description"), 
>> conf.get("http.agent.url"), conf.get("http.agent.email"));
>>          this.serverDelay = (long) 
>> (conf.getFloat("fetcher.server.delay", 1.0f) * 1000);
>> +        this.maxCrawlDelay = (int) 
>> conf.getInt("http.max.crawl.delay", -1);
>>          // backward-compatible default setting
>>          this.byIP = 
>> conf.getBoolean("fetcher.threads.per.host.by.ip", true);
>>          this.useHttp11 = conf.getBoolean("http.http11", false);
>> @@ -185,6 +189,17 @@
>>              long crawlDelay = robots.getCrawlDelay(this, u);
>>        long delay = crawlDelay > 0 ? crawlDelay : serverDelay;
>> +     +      int crawlDelaySeconds = (int)(delay / 1000);
>> +      if (crawlDelaySeconds > maxCrawlDelay && maxCrawlDelay >= 0) {
>> +        LOGGER.info("Ignoring " + u + ", exceeds max crawl delay: 
>> max=" +
>> +          maxCrawlDelay + ", crawl delay=" + crawlDelaySeconds +
>> +          " seconds");
>> +        Content c = new Content(u.toString(), u.toString(), 
>> EMPTY_CONTENT,
>> +          null, null, this.conf);
>> +        return new ProtocolOutput(c, new 
>> ProtocolStatus(ProtocolStatus.WOULDBLOCK));
>> +      }
>> +            String host = blockAddr(u, delay);
>>        Response response;
>>        try {
>
>

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to