Hi,

This reminds me one saying:
If you got a problem and you use regular expession, then you got a two
problems...

I am not regex guru but I think those two expressions are different
due to the first question mare (lazy repetition). Depending on given
string it should behave differently.
Try to give an examples of string you want to check.

Regards,
Lukas

On 6/10/06, Dennis Kubes <[EMAIL PROTECTED]> wrote:
> The thread dumps pointed me to the Regex URL Filter and greedy pattern
> matching.  It seems that there is a standing "error" in the JVM where
> the "wrong" regular expression will cause the program to hang and the
> cpu to go to 100%.  Basically the behaviors that we are seeing.  And
> this would make sense as this error wouldn't appear unless the "right"
> url came up.  See this link for a complete explanation.
>
> http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6393051
>
> After reviewing the regular expressions in the regex-urlfilter.txt file,
> here is what I think needs to be changed.
>
> this: -.*(/.+?)/.*?\1/.*?\1/
> changed to: -.*?(/.+?)/.*?\1/.*?\1/
>
> I am currently testing this to see if it runs correctly without stalling
> as before.  Problem is that I am not a regular expressions expert.  Will
> changing this regex affect this expression in a negative way?
>
> Dennis
>
> Dennis Kubes wrote:
> > I will start taking a look at some thread dumps.  It is not the
> > sorting phase.  It gets past the sort and gets through part of the
> > reduce phase (and always the same percentage, when the job is restarts
> > on the same machine it gets to the same part again before stalling
> > again).  And this is happening on multiple machines so I do think it
> > is a machine problem.  Again I need to spend some time looking through
> > thread dumps.
> >
> > Dennis
> >
> > Andrzej Bialecki wrote:
> >> Dennis Kubes wrote:
> >>> Do you think it is the parsing that is causing it?
> >>
> >> Just checking ... probably not. You could figure out from a thread
> >> dump where it's spending time.
> >>
> >>
> >>> I was looking at a smaller fetching run and the cpu gets pushed to
> >>> 100% as well but the reports keep happening.  This only seems to
> >>> happen when I run very large fetches (> 500K pages).  I just ran a
> >>> 100K fetch and it worked just fine.  Should I have some special
> >>> settings for larger fetches?
> >>
> >> You could try tweaking the io.sort values, if it times out during the
> >> sorting phase.
> >>
>


_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to