Hi,
Based on the suggestions/code from
https://issues.apache.org/jira/browse/NUTCH-585, I have created a plugin
toblacklist or whitelist html elements. This was based on the need for
not indexing header/footer/navigation, so the user gets really only
relevant results, e.g. even if the term shows up in the navigation.
The elements to be parsed (or not) can be defined by using CSS-like
selectors. A new field called "strippedContent" is available in the
index which can be used for searching. Links are still crawled and
parsed from the "content" field, allowing all pages to be parsed. The
full documentation is in the README.txt in the patch.
The patch can be found on:
http://www.scintillation.at/files/nutwe03mnyzwb/blacklist_whitelist_plugin.patch
Maybe it is of help to someone:)
Best,
Elisabeth