Adding this to your conf/regex-normalize.xml should remove the anchor from the URLs:
<regex> <pattern>\#(.*)</pattern> <substitution></substitution> </regex> Regards, Siddhartha On Jan 26, 2008 1:41 PM, Per Andreas Buer <[EMAIL PROTECTED]> wrote: > Hi. > > I'm indexing an intranet and I see some pages are fetched twenty times. > There are a lot of anchors used so there are a lot of links like the > ones in the subject. > > Is there some way I can instruct the crawler to discard the part of the > url which is after the hash sign? I'm using nutch from trunk a few > months back in time. > > TIA, > > > Per. >
