bar#zoo

Prafulla Sat, 26 Jan 2008 00:36:44 -0800

Hi,

The crawl-urlfilter.txt in conf directory can be used to provide regular
expressions to control the urls that are crawled. However this will help you
to ignore urls containing #. I don't think you can ask the crawler to just
ignore the part of the url after the hash sign by configuring properties,
you may have to write some code to achieve that


Regards,
Prafulla

On Jan 26, 2008 1:41 PM, Per Andreas Buer <[EMAIL PROTECTED]> wrote:

> Hi.
>
> I'm indexing an intranet and I see some pages are fetched twenty times.
> There are a lot of anchors used so there are a lot of links like the
> ones in the subject.
>
> Is there some way I can instruct the crawler to discard the part of the
> url which is after the hash sign? I'm using nutch from trunk a few
> months back in time.
>
> TIA,
>
>
> Per.
>

Re: crawler fetching both http://foo/bar#quux and http://foo/bar#zoo

Reply via email to