Dennis Kubes wrote:
I am currently implementing a patch for the older 0.8 code that allows pages with crawl delay > x seconds to be ignored where the number of seconds is configurable. What do you think the best way to return from the HttpBase would be? Would it be to throw an HttpException or return a ProtocolStatus with say GONE or something like that?
In the latest patch in NUTCH-339 I added a ProtocolStatus.WOULDBLOCK, and a section in Fetcher2 which is supposed to handle that - although after I removed the block/unblockAddr from lib-http there is no code in that patch that uses this status code.
-- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
