The thread dumps pointed me to the Regex URL Filter and greedy pattern matching. It seems that there is a standing "error" in the JVM where the "wrong" regular expression will cause the program to hang and the cpu to go to 100%. Basically the behaviors that we are seeing. And this would make sense as this error wouldn't appear unless the "right" url came up. See this link for a complete explanation.
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6393051 After reviewing the regular expressions in the regex-urlfilter.txt file, here is what I think needs to be changed. this: -.*(/.+?)/.*?\1/.*?\1/ changed to: -.*?(/.+?)/.*?\1/.*?\1/ I am currently testing this to see if it runs correctly without stalling as before. Problem is that I am not a regular expressions expert. Will changing this regex affect this expression in a negative way? Dennis Dennis Kubes wrote: > I will start taking a look at some thread dumps. It is not the > sorting phase. It gets past the sort and gets through part of the > reduce phase (and always the same percentage, when the job is restarts > on the same machine it gets to the same part again before stalling > again). And this is happening on multiple machines so I do think it > is a machine problem. Again I need to spend some time looking through > thread dumps. > > Dennis > > Andrzej Bialecki wrote: >> Dennis Kubes wrote: >>> Do you think it is the parsing that is causing it? >> >> Just checking ... probably not. You could figure out from a thread >> dump where it's spending time. >> >> >>> I was looking at a smaller fetching run and the cpu gets pushed to >>> 100% as well but the reports keep happening. This only seems to >>> happen when I run very large fetches (> 500K pages). I just ran a >>> 100K fetch and it worked just fine. Should I have some special >>> settings for larger fetches? >> >> You could try tweaking the io.sort values, if it times out during the >> sorting phase. >> _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
