mos wrote:
when you get an error while fetching, and you get the
org.apache.nutch.protocol.retrylater because the max retries have been
reached, nutch says it has given up and will retry later, when does that
retry occur?

That's an issue I reported some weeks ago and which is in my opinion
an annoying bug in Nutch 0.7.1:

Nutch says that it "will retry later" those pages. In reality the next
fetch date
is set to infinite and those pages are lost forever.
In consequence this means that pages which are temporary not available,
would be never indexed when doing recrawls.
That's the reason why the recrawl on bases of an existing webdb doesn't make
sense witch Nutch 0.7.1.  To make sure that temporary not available pages
are considered, you have to make a complete new crawl of all pages
(and throw away the old crawl).

I mentioned this issue on this list a few times and reported this issue on Jira:
http://issues.apache.org/jira/browse/NUTCH-205

Unfortunality no nutch-developer seems to be interested in this
serious issue.....

Thanks for your persistance on this subject... ;-) I agree, it's a real issue. Most developers (myself included) concentrate on 0.8 branch now, which has a fix for this.

Basically, the whole premise of pages "truly gone" seems to be ill-defined. If we can't reach a page even 1000 times during a given period it doesn't automatically mean it's truly gone, it could mean that the server is temporarily down and we tried too often in a given period... so, as long as the links from other pages are valid we should still from time to time attempt to check the status of that page.

That's the reasoning behind the fix that went to 0.8 - if the last fetch was long time ago (longer than a maximum interval for the installation) then we force refetch anyway, and if it doesn't succeed we just increase the interval by 50%.

Now, fixing this the same way in 0.7 would mean that pages no longer end up in PAGE_GONE state. Is this a fix of broken behavior or a new behavior (new feature)? I'm not sure...

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply via email to