The thread dumps pointed me to the Regex URL Filter and greedy pattern 
matching.  It seems that there is a standing "error" in the JVM where 
the "wrong" regular expression will cause the program to hang and the 
cpu to go to 100%.  Basically the behaviors that we are seeing.  And 
this would make sense as this error wouldn't appear unless the "right" 
url came up.  See this link for a complete explanation.

http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6393051

After reviewing the regular expressions in the regex-urlfilter.txt file, 
here is what I think needs to be changed.

this: -.*(/.+?)/.*?\1/.*?\1/
changed to: -.*?(/.+?)/.*?\1/.*?\1/

I am currently testing this to see if it runs correctly without stalling 
as before.  Problem is that I am not a regular expressions expert.  Will 
changing this regex affect this expression in a negative way?

Dennis

Dennis Kubes wrote:
> I will start taking a look at some thread dumps.  It is not the 
> sorting phase.  It gets past the sort and gets through part of the 
> reduce phase (and always the same percentage, when the job is restarts 
> on the same machine it gets to the same part again before stalling 
> again).  And this is happening on multiple machines so I do think it 
> is a machine problem.  Again I need to spend some time looking through 
> thread dumps.
>
> Dennis
>
> Andrzej Bialecki wrote:
>> Dennis Kubes wrote:
>>> Do you think it is the parsing that is causing it?
>>
>> Just checking ... probably not. You could figure out from a thread 
>> dump where it's spending time.
>>
>>
>>> I was looking at a smaller fetching run and the cpu gets pushed to 
>>> 100% as well but the reports keep happening.  This only seems to 
>>> happen when I run very large fetches (> 500K pages).  I just ran a 
>>> 100K fetch and it worked just fine.  Should I have some special 
>>> settings for larger fetches?
>>
>> You could try tweaking the io.sort values, if it times out during the 
>> sorting phase.
>>


_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to