Hi,

Based on the suggestions/code from https://issues.apache.org/jira/browse/NUTCH-585, I have created a plugin toblacklist or whitelist html elements. This was based on the need for not indexing header/footer/navigation, so the user gets really only relevant results, e.g. even if the term shows up in the navigation.

The elements to be parsed (or not) can be defined by using CSS-like selectors. A new field called "strippedContent" is available in the index which can be used for searching. Links are still crawled and parsed from the "content" field, allowing all pages to be parsed. The full documentation is in the README.txt in the patch.

The patch can be found on: http://www.scintillation.at/files/nutwe03mnyzwb/blacklist_whitelist_plugin.patch

Maybe it is of help to someone:)
Best,
Elisabeth

Reply via email to