Re: Fetcher Stops Reports Pushes CPU to 100%

Dennis Kubes Fri, 09 Jun 2006 18:38:03 -0700

The thread dumps pointed me to the Regex URL Filter and greedy patternmatching. It seems that there is a standing "error" in the JVM wherethe "wrong" regular expression will cause the program to hang and thecpu to go to 100%. Basically the behaviors that we are seeing. Andthis would make sense as this error wouldn't appear unless the "right"url came up. See this link for a complete explanation.


http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6393051

After reviewing the regular expressions in the regex-urlfilter.txt file,here is what I think needs to be changed.


this: -.*(/.+?)/.*?\1/.*?\1/
changed to: -.*?(/.+?)/.*?\1/.*?\1/

I am currently testing this to see if it runs correctly without stallingas before. Problem is that I am not a regular expressions expert. Willchanging this regex affect this expression in a negative way?


Dennis

Dennis Kubes wrote:

I will start taking a look at some thread dumps. It is not thesorting phase. It gets past the sort and gets through part of thereduce phase (and always the same percentage, when the job is restartson the same machine it gets to the same part again before stallingagain). And this is happening on multiple machines so I do think itis a machine problem. Again I need to spend some time looking throughthread dumps.
Dennis

Andrzej Bialecki wrote:
Dennis Kubes wrote:
Do you think it is the parsing that is causing it?
Just checking ... probably not. You could figure out from a threaddump where it's spending time.
I was looking at a smaller fetching run and the cpu gets pushed to100% as well but the reports keep happening. This only seems tohappen when I run very large fetches (> 500K pages). I just ran a100K fetch and it worked just fine. Should I have some specialsettings for larger fetches?
You could try tweaking the io.sort values, if it times out during thesorting phase.

Re: Fetcher Stops Reports Pushes CPU to 100%

Reply via email to