Re: Nutch spider trap detection

brainstorm Thu, 03 Jul 2008 07:58:50 -0700

Thanks ! I guess you mean:

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/


In conf/regex-urlfilter.txt, am I wrong ?

The DomContentUtils on
/nutch/trunk/src/java/org/apache/nutch/parse/*.java is a bit confusing
to me and cannot see the recursion "protection" code.

Thanks !

On Mon, Jun 30, 2008 at 12:21 AM, Dennis Kubes <[EMAIL PROTECTED]> wrote:
> There are some regexes in the url normalizers and there is some code in
> DomContentUtils for recursion.
>
> Dennis
>
> brainstorm wrote:
>>
>> Hi!
>>
>> I guess it is implemented, but cannot find it by myself on nutch API
>> docs nor wiki :-/ ... Is there any mechanism implemented in nutch to
>> detect spider traps[1] ?
>>
>> Thanks,
>> Roman
>>
>> [1] http://en.wikipedia.org/wiki/Spider_trap
>

Re: Nutch spider trap detection

Reply via email to