Extension of NUTCH-585 - blacklist whitelist plugin

Elisabeth Adler Wed, 21 Sep 2011 08:48:19 -0700

Hi,

Based on the suggestions/code fromhttps://issues.apache.org/jira/browse/NUTCH-585, I have created a plugintoblacklist or whitelist html elements. This was based on the need fornot indexing header/footer/navigation, so the user gets really onlyrelevant results, e.g. even if the term shows up in the navigation.

The elements to be parsed (or not) can be defined by using CSS-likeselectors. A new field called "strippedContent" is available in theindex which can be used for searching. Links are still crawled andparsed from the "content" field, allowing all pages to be parsed. Thefull documentation is in the README.txt in the patch.

The patch can be found on:http://www.scintillation.at/files/nutwe03mnyzwb/blacklist_whitelist_plugin.patch


Maybe it is of help to someone:)
Best,
Elisabeth

Extension of NUTCH-585 - blacklist whitelist plugin

Reply via email to