Robin Haswell wrote:
> "Thread-0" prio=1 tid=0x00002aab361b28b0 nid=0x4752 runnable
> [0x0000000040bc5000..0x0000000040bc5cc0]
>         at java.lang.Character.codePointAt(Character.java:2335)
>         at java.util.regex.Pattern$Dot.match(Pattern.java:4092)
>         at java.util.regex.Pattern$Curly.match1(Pattern.java:4256)
>         at java.util.regex.Pattern$Curly.match(Pattern.java:4199)
>         at java.util.regex.Pattern$Single.match(Pattern.java:3314)
>         at java.util.regex.Pattern$GroupTail.match(Pattern.java:4629)
>         at java.util.regex.Pattern$Curly.match1(Pattern.java:4250)
>         at java.util.regex.Pattern$Curly.match(Pattern.java:4199)
>         at java.util.regex.Pattern$Single.match(Pattern.java:3314)
>         at java.util.regex.Pattern$GroupHead.match(Pattern.java:4570)
>         at java.util.regex.Pattern$Curly.match0(Pattern.java:4235)
>         at java.util.regex.Pattern$Curly.match(Pattern.java:4197)
>         at java.util.regex.Pattern$Start.match(Pattern.java:3019)
>         at java.util.regex.Matcher.search(Matcher.java:1092)
>         at java.util.regex.Matcher.find(Matcher.java:528)
>         at org.apache.nutch.urlfilter.regex.RegexURLFilter
> $Rule.match(RegexURLFilter.java:86)
>
>   


>
> Does this mean the regexfilter has hung?
>   

Yes, most likely. Running complex regexes on hostile data, such as 
unknown URLs, quite often ends up like this - that's why many 
Internet-wide installations don't use regexes but combinations of 
prefix/suffix/custom filters.. If you were running the fetcher in 
non-parsing mode, this wouldn't happen during fetching but during 
parsing - and you could've changed your config and restart just the 
parsing, without refetching ... ah well.

Anyway - it's most likely not hung, but runs very, very slowly. You 
could give it a chance and let it run a few hours more,  perhaps it will 
go past these troublesome urls, and keep watching the size of temporary 
data - if the files are not growing at all, then I'm afraid you will 
have to kill the job, and avoid your boss for a couple of days ... :/

(By the way, one can encounter most weird things in the wild ... I've 
seen URLs that are several kilobytes long, containing all sorts of 
illegal characters, containing nested unescaped URLs with invalid 
protocols and so and so on ... so, when crawling Internet at large you 
should be prepared for getting really nasty stuff. Complex regexes don't 
cut it).

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to