On Dec 19, 2007, at 10:31 AM, Bolle, Jeffrey F. wrote:
All,
Is there a way to have Nutch (sorry for not being more specific in
terms of the crawler, indexer, parser, etc.) ignore anchor links
internal to the page (but not ignore pages internal to the site)? I
have some pages being indexed, archives of mailing lists, that have a
whole ton of anchors and Nutch re-fetches and re-parses the same page
countless times, each time on the different anchor link. I know there
is the property to ignore internal links, but I want other pages on
the
same host to be included, just not self-referencing links within a
page.
In your urlnormalizer regex conf file (regex-normalize.xml) you can
remove everything after the # symbol like so:
<!-- remove anchors, who needs em -->
<regex>
<pattern>\#(.*)</pattern>
<substitution></substitution>
</regex>