There is regex-normalize.xml in conf dir, which allows to manipulate URLs (eg. remove string after '#"). Remember to have urlnormalizer-regex in plugins.include option (nutch-site.xml).
Marcin Dnia 26 stycznia 2008 9:36 Prafulla <[EMAIL PROTECTED]> napisaĆ(a): > Hi, > > The crawl-urlfilter.txt in conf directory can be used to provide regular > expressions to control the urls that are crawled. However this will help you > to ignore urls containing #. I don't think you can ask the crawler to just > ignore the part of the url after the hash sign by configuring properties, > you may have to write some code to achieve that > > Regards, > Prafulla > > On Jan 26, 2008 1:41 PM, Per Andreas Buer wrote: > > > Hi. > > > > I'm indexing an intranet and I see some pages are fetched twenty times. > > There are a lot of anchors used so there are a lot of links like the > > ones in the subject. > > > > Is there some way I can instruct the crawler to discard the part of the > > url which is after the hash sign? I'm using nutch from trunk a few > > months back in time. > > > > TIA, > > > > > > Per. > > >
