bar#zoo

Marcin Okraszewski Sat, 26 Jan 2008 06:31:45 -0800

There is regex-normalize.xml in conf dir, which allows to manipulate URLs (eg. 
remove string after '#"). Remember to have urlnormalizer-regex in 
plugins.include option (nutch-site.xml).


Marcin


Dnia 26 stycznia 2008 9:36 Prafulla <[EMAIL PROTECTED]> napisał(a):

> Hi,
> 
> The crawl-urlfilter.txt in conf directory can be used to provide regular
> expressions to control the urls that are crawled. However this will help you
> to ignore urls containing #. I don't think you can ask the crawler to just
> ignore the part of the url after the hash sign by configuring properties,
> you may have to write some code to achieve that
> 
> Regards,
> Prafulla
> 
> On Jan 26, 2008 1:41 PM, Per Andreas Buer  wrote:
> 
> > Hi.
> >
> > I'm indexing an intranet and I see some pages are fetched twenty times.
> > There are a lot of anchors used so there are a lot of links like the
> > ones in the subject.
> >
> > Is there some way I can instruct the crawler to discard the part of the
> > url which is after the hash sign? I'm using nutch from trunk a few
> > months back in time.
> >
> > TIA,
> >
> >
> > Per.
> >
>

Re: crawler fetching both http://foo/bar#quux and http: //foo/bar#zoo

Reply via email to