I just completed a 500K run with no problems.  I had to comment out the 
-.*(/.+?)/.*?\1/.*?\1/ filter to get it to work.  Even using 
-.*?(/.+?)/.*?\1/.*?\1/ would stall it.  Good news is that it did it 
consistently so it will take me a while but I can narrow down what url 
was causing it.  I don't think the other urls in the default 
regex-urlfilter file will cause any problems because they are not 
greedy, but I would suggest that we look at either changing this regular 
expression or removing it altogether from the default install.

Dennis

Andrzej Bialecki wrote:
> Dennis Kubes wrote:
>> The thread dumps pointed me to the Regex URL Filter and greedy 
>> pattern matching.  It seems that there is a standing "error" in the 
>> JVM where the "wrong" regular expression will cause the program to 
>> hang and the cpu to go to 100%.  Basically the behaviors that we are 
>> seeing.  And this would make sense as this error wouldn't appear 
>> unless the "right" url came up.  See this link for a complete 
>> explanation.
>
> Ah, that would explain why I don't see this behavior - one of the 
> first changes I do in my installations is to remove regex-urlfilter 
> and replace it with a suitable combination of prefix/suffix-urlfilter, 
> or a custom one ... Of course, we should solve this issue in our code, 
> if possible, but using different urlfilters is a quick workaround.
>


_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to