[ http://issues.apache.org/jira/browse/NUTCH-293?page=comments#action_12415202 ]
Andrzej Bialecki commented on NUTCH-293: ----------------------------------------- Stefan, as you remember we had a discussion on modifying the fetcher, and specifically changing the thread spin-waiting mechanism into a worker-queue. As it is now this is a can of worms that I'd rather not touch - there are many subtle conditions here that would be further complicated by this patch. E.g. the number of spin-waiting threads vs. the number of free threads is normally affected only by five factors: total number of threads, non-uniqueness rate in the current fetchlist, sites' bandwidth, configured delay between requests, and allowed # of threads/host. This patch adds a sixth factor, variable per site .. which makes it much harder to predict how many threads you need to avoid dead-locking all of them. I'm not strongly opposed to this change, quite contrary - this is a useful functionality. It's just that I'm concerned that it adds yet another functionality to a messy code that needs to be rewritten from scratch. OTOH, it's a non-intrusive quick hack. If we have to have it now, it's definitely better than waiting for some distant future when we rewrite the fetcher ... ;) > support for Crawl-delay in Robots.txt > ------------------------------------- > > Key: NUTCH-293 > URL: http://issues.apache.org/jira/browse/NUTCH-293 > Project: Nutch > Type: Improvement > Components: fetcher > Versions: 0.8-dev > Reporter: Stefan Groschupf > Priority: Critical > Attachments: crawlDelayv1.patch > > Nutch need support for Crawl-delay defined in robots.txt, it is not a > standard but a de-facto standard. > See: > http://help.yahoo.com/help/us/ysearch/slurp/slurp-03.html > Webmasters start blocking nutch since we do not support it. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
