Robin Haswell wrote:
On Fri, 2006-12-08 at 12:41 +0100, Andrzej Bialecki wrote:
Yes, most likely. Running complex regexes on hostile data, such as unknown URLs, quite often ends up like this - that's why many Internet-wide installations don't use regexes but combinations of prefix/suffix/custom filters.. If you were running the fetcher in non-parsing mode, this wouldn't happen during fetching but during parsing - and you could've changed your config and restart just the parsing, without refetching ... ah well.

Anyway - it's most likely not hung, but runs very, very slowly. You could give it a chance and let it run a few hours more, perhaps it will go past these troublesome urls, and keep watching the size of temporary data - if the files are not growing at all, then I'm afraid you will have to kill the job, and avoid your boss for a couple of days ... :/

(By the way, one can encounter most weird things in the wild ... I've seen URLs that are several kilobytes long, containing all sorts of illegal characters, containing nested unescaped URLs with invalid protocols and so and so on ... so, when crawling Internet at large you should be prepared for getting really nasty stuff. Complex regexes don't cut it).


I see, thanks. Ah well. The scope my regex is simply
glob("http://*.uk/*";). What filters would you recommend for doing this?

Prefix filter to cut off anything without "http://";. And then a (non-existent) domain-suffix filter, which considers only domain suffixes - this is easy to implement based on the suffix filter that ships with Nutch.

I'm guessing my use-case is pretty much the same as everyone else -
people who want everything from a domain. Is it wise to ship with the
regex urlfilter as the default filter?

Anyway, any help would be great. I'll keep an eye on the temp data. If
it rises I'll probably leave it going.

Thanks

-Rob




--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply via email to