Re: [Nutch-dev] Bug in RequestScheduler

Andrzej Bialecki Mon, 02 Feb 2004 06:57:31 -0800

Doug Cutting wrote:

The problem is that the fetcher frequently crashes in such a way (process killed, runs out of memory, etc.) that java code cannot do anything to catch the problem. Hopefully the fetcher will soon get more reliable and this will be less of an issue.

I know other parts of Nutch can read at least partially corrupt output from the fetcher, so it would make sense to add this ability to fetcher, too :-) The reliability is of course important, but content-friendly failure and restore modes are useful, too...

Hmmm... Perhaps writing the segment data should be performed atomically, so that it's kept consistent at all times; and some checkpointing data could be written from time to time, if it cannot already be reconstructed from the leftover data on the next run. Then the fetcher could be restarted on an unfinished data.

Related question: is there any way to force an update, even if the pages are marked as already fetched and not older than e.g. 30 days? In other words, I'd like to force re-fetching some pages (by URL pattern or by "[not] older than" date).

The -addDays parameter to the generate command makes it act like you're running it that many days in the future. Each URL has it's own refresh interval, but there's no command yet which updates it for a particular URL. So, for now, the best way to do this is to set the default update interval down while you first inject the URLs you want updated more frequently, or when you update them, but that's harder, as there will probably be a bunch of other URLs there when you update...

Looking at the logic in FetchListTool I don't understand where the new "last modified" date is written back to the webdb. I must be blind or what, the whole thing wouldn't work otherwise... As soon as I understand this part, modifying the FetchListTool to do what I need would be straightforward, I think...

And yet another question :-) - how do I remove URLs from webdb?

There's no command yet to do that. If you put a negated regular expression in your urlfilter.regex.file then they'll never enter the db in the first place. Probably we should add a command which re-filters the urls in the db, removing any which no longer are permitted. That way, to remove URLs, you'd just add some regexps and run that command. Can you log a bug requesting something like this? Thanks.

Oops. It seems to be there now - there are deletePage(url) / deleteLink(url) calls in WebDBWriter, and the corresponding argument parsing in main(); so it looks like this is only a deficiency in the docs :-). There is no re-filter command, though... Also, it looks like the method deleteLink() is missing from the IWebDBWriter.

--
Best regards,
Andrzej Bialecki

-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)


-------------------------------------------------------
The SF.Net email is sponsored by EclipseCon 2004
Premiere Conference on Open Tools Development and Integration
See the breadth of Eclipse activity. February 3-5 in Anaheim, CA.
http://www.eclipsecon.org/osdn
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] Bug in RequestScheduler

Reply via email to