Robin Haswell wrote: > "Thread-0" prio=1 tid=0x00002aab361b28b0 nid=0x4752 runnable > [0x0000000040bc5000..0x0000000040bc5cc0] > at java.lang.Character.codePointAt(Character.java:2335) > at java.util.regex.Pattern$Dot.match(Pattern.java:4092) > at java.util.regex.Pattern$Curly.match1(Pattern.java:4256) > at java.util.regex.Pattern$Curly.match(Pattern.java:4199) > at java.util.regex.Pattern$Single.match(Pattern.java:3314) > at java.util.regex.Pattern$GroupTail.match(Pattern.java:4629) > at java.util.regex.Pattern$Curly.match1(Pattern.java:4250) > at java.util.regex.Pattern$Curly.match(Pattern.java:4199) > at java.util.regex.Pattern$Single.match(Pattern.java:3314) > at java.util.regex.Pattern$GroupHead.match(Pattern.java:4570) > at java.util.regex.Pattern$Curly.match0(Pattern.java:4235) > at java.util.regex.Pattern$Curly.match(Pattern.java:4197) > at java.util.regex.Pattern$Start.match(Pattern.java:3019) > at java.util.regex.Matcher.search(Matcher.java:1092) > at java.util.regex.Matcher.find(Matcher.java:528) > at org.apache.nutch.urlfilter.regex.RegexURLFilter > $Rule.match(RegexURLFilter.java:86) > >
> > Does this mean the regexfilter has hung? > Yes, most likely. Running complex regexes on hostile data, such as unknown URLs, quite often ends up like this - that's why many Internet-wide installations don't use regexes but combinations of prefix/suffix/custom filters.. If you were running the fetcher in non-parsing mode, this wouldn't happen during fetching but during parsing - and you could've changed your config and restart just the parsing, without refetching ... ah well. Anyway - it's most likely not hung, but runs very, very slowly. You could give it a chance and let it run a few hours more, perhaps it will go past these troublesome urls, and keep watching the size of temporary data - if the files are not growing at all, then I'm afraid you will have to kill the job, and avoid your boss for a couple of days ... :/ (By the way, one can encounter most weird things in the wild ... I've seen URLs that are several kilobytes long, containing all sorts of illegal characters, containing nested unescaped URLs with invalid protocols and so and so on ... so, when crawling Internet at large you should be prepared for getting really nasty stuff. Complex regexes don't cut it). -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general