Done. Patch is available in https://issues.apache.org/jira/browse/NUTCH-585
Best,
Elisabeth

On 21.09.2011 17:51, Julien Nioche wrote:
Elisabeth,

Great. Could you attach your patch to the original issue in JIRA instead and check the box : Grant license to ASF for inclusion in ASF works?

Julien

On 21 September 2011 16:47, Elisabeth Adler <[email protected] <mailto:[email protected]>> wrote:

    Hi,

    Based on the suggestions/code from
    https://issues.apache.org/jira/browse/NUTCH-585, I have created a
    plugin toblacklist or whitelist html elements. This was based on
    the need for not indexing header/footer/navigation, so the user
    gets really only relevant results, e.g. even if the term shows up
    in the navigation.

    The elements to be parsed (or not) can be defined by using
    CSS-like selectors. A new field called "strippedContent" is
    available in the index which can be used for searching. Links are
    still crawled and parsed from the "content" field, allowing all
    pages to be parsed. The full documentation is in the README.txt in
    the patch.

    The patch can be found on:
    
http://www.scintillation.at/files/nutwe03mnyzwb/blacklist_whitelist_plugin.patch

    Maybe it is of help to someone:)
    Best,
    Elisabeth




--
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Reply via email to