Robin Haswell wrote:
On Fri, 2006-12-08 at 12:41 +0100, Andrzej Bialecki wrote:
Yes, most likely. Running complex regexes on hostile data, such as
unknown URLs, quite often ends up like this - that's why many
Internet-wide installations don't use regexes but combinations of
prefix/suffix/custom filters.. If you were running the fetcher in
non-parsing mode, this wouldn't happen during fetching but during
parsing - and you could've changed your config and restart just the
parsing, without refetching ... ah well.
Anyway - it's most likely not hung, but runs very, very slowly. You
could give it a chance and let it run a few hours more, perhaps it will
go past these troublesome urls, and keep watching the size of temporary
data - if the files are not growing at all, then I'm afraid you will
have to kill the job, and avoid your boss for a couple of days ... :/
(By the way, one can encounter most weird things in the wild ... I've
seen URLs that are several kilobytes long, containing all sorts of
illegal characters, containing nested unescaped URLs with invalid
protocols and so and so on ... so, when crawling Internet at large you
should be prepared for getting really nasty stuff. Complex regexes don't
cut it).
I see, thanks. Ah well. The scope my regex is simply
glob("http://*.uk/*"). What filters would you recommend for doing this?
Prefix filter to cut off anything without "http://". And then a
(non-existent) domain-suffix filter, which considers only domain
suffixes - this is easy to implement based on the suffix filter that
ships with Nutch.
I'm guessing my use-case is pretty much the same as everyone else -
people who want everything from a domain. Is it wise to ship with the
regex urlfilter as the default filter?
Anyway, any help would be great. I'll keep an eye on the temp data. If
it rises I'll probably leave it going.
Thanks
-Rob
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com