Thanks ! I guess you mean: # skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/[^/]+)/[^/]+\1/[^/]+\1/
In conf/regex-urlfilter.txt, am I wrong ? The DomContentUtils on /nutch/trunk/src/java/org/apache/nutch/parse/*.java is a bit confusing to me and cannot see the recursion "protection" code. Thanks ! On Mon, Jun 30, 2008 at 12:21 AM, Dennis Kubes <[EMAIL PROTECTED]> wrote: > There are some regexes in the url normalizers and there is some code in > DomContentUtils for recursion. > > Dennis > > brainstorm wrote: >> >> Hi! >> >> I guess it is implemented, but cannot find it by myself on nutch API >> docs nor wiki :-/ ... Is there any mechanism implemented in nutch to >> detect spider traps[1] ? >> >> Thanks, >> Roman >> >> [1] http://en.wikipedia.org/wiki/Spider_trap >
