[ 
http://issues.apache.org/jira/browse/NUTCH-293?page=comments#action_12415202 ] 

Andrzej Bialecki  commented on NUTCH-293:
-----------------------------------------

Stefan, as you remember we had a discussion on modifying the fetcher, and 
specifically changing the thread spin-waiting mechanism into a worker-queue. As 
it is now this is a can of worms that I'd rather not touch - there are many 
subtle conditions here that would be further complicated by this patch. E.g. 
the number of spin-waiting threads vs. the number of free threads is normally 
affected only by five factors: total number of threads, non-uniqueness rate in 
the current fetchlist, sites' bandwidth, configured delay between requests, and 
allowed # of threads/host. This patch adds a sixth factor, variable per site .. 
which makes it much harder to predict how many threads you need to avoid 
dead-locking all of them.

I'm not strongly opposed to this change, quite contrary - this is a useful 
functionality. It's just that I'm concerned that it adds yet another 
functionality to a messy code that needs to be rewritten from scratch.

OTOH, it's a non-intrusive quick hack. If we have to have it now, it's 
definitely better than waiting for some distant future when we rewrite the 
fetcher ... ;)

> support for Crawl-delay in Robots.txt
> -------------------------------------
>
>          Key: NUTCH-293
>          URL: http://issues.apache.org/jira/browse/NUTCH-293
>      Project: Nutch
>         Type: Improvement

>   Components: fetcher
>     Versions: 0.8-dev
>     Reporter: Stefan Groschupf
>     Priority: Critical
>  Attachments: crawlDelayv1.patch
>
> Nutch need support for Crawl-delay defined in robots.txt, it is not a 
> standard but a de-facto standard.
> See:
> http://help.yahoo.com/help/us/ysearch/slurp/slurp-03.html
> Webmasters start blocking nutch since we do not support it.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to