Elisabeth, Great. Could you attach your patch to the original issue in JIRA instead and check the box : Grant license to ASF for inclusion in ASF works?
Julien On 21 September 2011 16:47, Elisabeth Adler <[email protected]>wrote: > Hi, > > Based on the suggestions/code from https://issues.apache.org/** > jira/browse/NUTCH-585 <https://issues.apache.org/jira/browse/NUTCH-585>, I > have created a plugin toblacklist or whitelist html elements. This was based > on the need for not indexing header/footer/navigation, so the user gets > really only relevant results, e.g. even if the term shows up in the > navigation. > > The elements to be parsed (or not) can be defined by using CSS-like > selectors. A new field called "strippedContent" is available in the index > which can be used for searching. Links are still crawled and parsed from the > "content" field, allowing all pages to be parsed. The full documentation is > in the README.txt in the patch. > > The patch can be found on: http://www.scintillation.at/** > files/nutwe03mnyzwb/blacklist_**whitelist_plugin.patch<http://www.scintillation.at/files/nutwe03mnyzwb/blacklist_whitelist_plugin.patch> > > Maybe it is of help to someone:) > Best, > Elisabeth > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com

