I just completed a 500K run with no problems. I had to comment out the -.*(/.+?)/.*?\1/.*?\1/ filter to get it to work. Even using -.*?(/.+?)/.*?\1/.*?\1/ would stall it. Good news is that it did it consistently so it will take me a while but I can narrow down what url was causing it. I don't think the other urls in the default regex-urlfilter file will cause any problems because they are not greedy, but I would suggest that we look at either changing this regular expression or removing it altogether from the default install.
Dennis Andrzej Bialecki wrote: > Dennis Kubes wrote: >> The thread dumps pointed me to the Regex URL Filter and greedy >> pattern matching. It seems that there is a standing "error" in the >> JVM where the "wrong" regular expression will cause the program to >> hang and the cpu to go to 100%. Basically the behaviors that we are >> seeing. And this would make sense as this error wouldn't appear >> unless the "right" url came up. See this link for a complete >> explanation. > > Ah, that would explain why I don't see this behavior - one of the > first changes I do in my installations is to remove regex-urlfilter > and replace it with a suitable combination of prefix/suffix-urlfilter, > or a custom one ... Of course, we should solve this issue in our code, > if possible, but using different urlfilters is a quick workaround. > _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
