> when you get an error while fetching, and you get the
> org.apache.nutch.protocol.retrylater because the max retries have been
> reached, nutch says it has given up and will retry later, when does that
> retry occur?

That's an issue I reported some weeks ago and which is in my opinion
an annoying bug in Nutch 0.7.1:

Nutch says that it "will retry later" those pages. In reality the next
fetch date
is set to infinite and those pages are lost forever.
In consequence this means that pages which are temporary not available,
would be never indexed when doing recrawls.
That's the reason why the recrawl on bases of an existing webdb doesn't make
sense witch Nutch 0.7.1.  To make sure that temporary not available pages
are considered, you have to make a complete new crawl of all pages
(and throw away the old crawl).

I mentioned this issue on this list a few times and reported this issue on Jira:
http://issues.apache.org/jira/browse/NUTCH-205

Unfortunality no nutch-developer seems to be interested in this
serious issue.....

Greetings
Oliver




On 3/7/06, Richard Braman <[EMAIL PROTECTED]> wrote:
> when you get an error while fetching, and you get the
> org.apache.nutch.protocol.retrylater because the max retries have been
> reached, nutch says it has given up and will retry later, when does that
> retry occur?  How would you make a fetchlist of all urls that have
> failed?  Is this information maintained somewhere?
>
>
> Richard Braman
> mailto:[EMAIL PROTECTED]
> 561.748.4002 (voice)
>
> http://www.taxcodesoftware.org <http://www.taxcodesoftware.org/>
> Free Open Source Tax Software
>
>
>
>


-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid0944&bid$1720&dat1642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to