Done. Patch is available in https://issues.apache.org/jira/browse/NUTCH-585
Best,
Elisabeth
On 21.09.2011 17:51, Julien Nioche wrote:
Elisabeth,
Great. Could you attach your patch to the original issue in JIRA
instead and check the box : Grant license to ASF for inclusion in ASF
works?
Julien
On 21 September 2011 16:47, Elisabeth Adler <[email protected]
<mailto:[email protected]>> wrote:
Hi,
Based on the suggestions/code from
https://issues.apache.org/jira/browse/NUTCH-585, I have created a
plugin toblacklist or whitelist html elements. This was based on
the need for not indexing header/footer/navigation, so the user
gets really only relevant results, e.g. even if the term shows up
in the navigation.
The elements to be parsed (or not) can be defined by using
CSS-like selectors. A new field called "strippedContent" is
available in the index which can be used for searching. Links are
still crawled and parsed from the "content" field, allowing all
pages to be parsed. The full documentation is in the README.txt in
the patch.
The patch can be found on:
http://www.scintillation.at/files/nutwe03mnyzwb/blacklist_whitelist_plugin.patch
Maybe it is of help to someone:)
Best,
Elisabeth
--
*
*Open Source Solutions for Text Engineering
http://digitalpebble.blogspot.com/
http://www.digitalpebble.com