Adding this to your conf/regex-normalize.xml should remove the anchor from
the URLs:

<regex>
  <pattern>\#(.*)</pattern>
  <substitution></substitution>
</regex>

Regards,
Siddhartha

On Jan 26, 2008 1:41 PM, Per Andreas Buer <[EMAIL PROTECTED]> wrote:

> Hi.
>
> I'm indexing an intranet and I see some pages are fetched twenty times.
> There are a lot of anchors used so there are a lot of links like the
> ones in the subject.
>
> Is there some way I can instruct the crawler to discard the part of the
> url which is after the hash sign? I'm using nutch from trunk a few
> months back in time.
>
> TIA,
>
>
> Per.
>

Reply via email to