Re: retry later

Andrzej Bialecki Wed, 08 Mar 2006 01:45:18 -0800

mos wrote:

when you get an error while fetching, and you get the
org.apache.nutch.protocol.retrylater because the max retries have been
reached, nutch says it has given up and will retry later, when does that
retry occur?


That's an issue I reported some weeks ago and which is in my opinion
an annoying bug in Nutch 0.7.1:

Nutch says that it "will retry later" those pages. In reality the next
fetch date
is set to infinite and those pages are lost forever.
In consequence this means that pages which are temporary not available,
would be never indexed when doing recrawls.
That's the reason why the recrawl on bases of an existing webdb doesn't make
sense witch Nutch 0.7.1.  To make sure that temporary not available pages
are considered, you have to make a complete new crawl of all pages
(and throw away the old crawl).

I mentioned this issue on this list a few times and reported this issue on Jira:
http://issues.apache.org/jira/browse/NUTCH-205

Unfortunality no nutch-developer seems to be interested in this
serious issue.....

Thanks for your persistance on this subject... ;-) I agree, it's a realissue. Most developers (myself included) concentrate on 0.8 branch now,which has a fix for this.

Basically, the whole premise of pages "truly gone" seems to beill-defined. If we can't reach a page even 1000 times during a givenperiod it doesn't automatically mean it's truly gone, it could mean thatthe server is temporarily down and we tried too often in a givenperiod... so, as long as the links from other pages are valid we shouldstill from time to time attempt to check the status of that page.

That's the reasoning behind the fix that went to 0.8 - if the last fetchwas long time ago (longer than a maximum interval for the installation)then we force refetch anyway, and if it doesn't succeed we just increasethe interval by 50%.

Now, fixing this the same way in 0.7 would mean that pages no longer endup in PAGE_GONE state. Is this a fix of broken behavior or a newbehavior (new feature)? I'm not sure...


--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: retry later

Reply via email to