Thank you guys for the hints and helps. I could manage to find the root of the problem. When the reducer gets into freezing stage I dumped the core and here is the results:
at java.util.regex.Pattern$Curly.match1(Pattern.java:4250) at java.util.regex.Pattern$Curly.match(Pattern.java:4199) at java.util.regex.Pattern$Single.match(Pattern.java:3314) at java.util.regex.Pattern$GroupTail.match(Pattern.java:4629) at java.util.regex.Pattern$Curly.match1(Pattern.java:4250) at java.util.regex.Pattern$Curly.match(Pattern.java:4199) at java.util.regex.Pattern$Single.match(Pattern.java:3314) at java.util.regex.Pattern$GroupHead.match(Pattern.java:4570) at java.util.regex.Pattern$Curly.match0(Pattern.java:4235) at java.util.regex.Pattern$Curly.match(Pattern.java:4197) at java.util.regex.Pattern$Start.match(Pattern.java:3019) at java.util.regex.Matcher.search(Matcher.java:1092) at java.util.regex.Matcher.find(Matcher.java:528) at org.apache.nutch.urlfilter.regex.RegexURLFilter$Rule.match( RegexURLFilter.java:86) at org.apache.nutch.urlfilter.api.RegexURLFilterBase.filter( RegexURLFilterBase.java:116) - locked <0x00002aaafcc20468> (a org.apache.nutch.urlfilter.regex.RegexURLFilter) at org.apache.nutch.net.URLFilters.filter(URLFilters.java:82) at org.apache.nutch.parse.ParseOutputFormat$1.write( ParseOutputFormat.java:120) at org.apache.nutch.fetcher.FetcherOutputFormat$1.write( FetcherOutputFormat.java:97) at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:263) at org.apache.hadoop.mapred.lib.IdentityReducer.reduce( IdentityReducer.java:39) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:277) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1211) The reducers freeze because outputformat applies URLFILTER at this stage. I added log info to the outputformat.java class in order to catch the URLs that make the problem. Since it was 4,500,000 urls and 9 machines, it was really pain to catch the urls, here are 3 trouble making urls: http://www.modelpower.com/ http://www.discountedboots.com/ http://www.foreverwomen.com/site/724586/ Then I tried a local crawl using these URLS and put some logging at RegexURLFilter.java:86, I could catch the Regex (-.*(/.+?)/.*?\1/.*?\1/) takes more than 10 min. The problem is that java script parser parses some bogus links like this: http://www.discountedboots.com/<SELECT%20%20NAME%3D%22EDIT_BROWSE%22<http://www.discountedboots.com/%3cSELECT%20%20NAME%3D%22EDIT_BROWSE%22>> ……… These links are very very long and they have lots of / in it. These links are created from scripts like this: drawBrowseMenu('<SELECT%20%20NAME%3D%22EDIT_BROWSE%22>........ makeDropBox('<SELECT%20%20NAME%3D%22EDIT_BROWSE%22>........ And this reason is true in all the pages I could catch. I am not sure what is wrong with (-.*(/.+?)/.*?\1/.*?\1/) that makes this long delay or infinite loop! At least, I guess js-parser needs to be fixed and ignores these things. Or, we can have a timer thread is RegexURLFilter.java that when the filtering takes more than 200ms then it rejects the url and exit. What do you guys think? Thanks. Mike On 10/18/06, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
Mike Smith wrote: > I am in the same state again, and same reduce jobs keep failing on > different > machines. I cannot get the dump using kill -3 pid, it does not make the > thread to quit. Also, I tried to place some log into FetcherOutputFormat, > but because of this bug: > *https://issues.apache.org/jira/browse/HADOOP-406*< https://issues.apache.org/jira/browse/HADOOP-406> > > The logging is not possible in the childs threads. Do you have any > idea why > the reducers doesn't catch the QUIT signal from the cache. I am > running the > latest version on SVN, otherwise I could log some key,value and url > filtering information at the reduce stage. SIGQUIT should not make the JVM quit, it should produce a thread dump on stderr. You need to manually pick up the process that corresponds to the child JVM of the task, e.g. with top(1) or ps(1), and then execute 'kill -SIGQUIT <pid>'. You can use Hadoop's log4j.properties to quickly enable a lot of log info, including stderr - put it in conf on every tasktracker and restart the cluster. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
