[
https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13107294#comment-13107294
]
Rui Araújo commented on NUTCH-585:
----------------------------------
I can also confirm that the patch works on Nutch 1.3.
However, it didn't work for my use-case as I need to filter a diverse set of tag
based on different attributes. Besides I needed the links from the filtered
area
which did not happen.
So I altered Hira's patch and I am publishing my work here.
This is the new changed property.
{code:xml}
<property>
<name>parser.html.NodesToExclude</name>
<value>table;summary;header|div;id;navigation</value>
<description>
A list of nodes whose content will not be indexed separated by "|". Use this
to tell
the HTML parser to ignore, for example, site navigation text.
Each node has three elements: the first one is the tag name, the second one
the
attribute name, the third one the value of the attribute.
Note that nodes with these attributes, and their children, will be silently
ignored by the parser
so verify the indexed content with Luke to confirm results.
</description>
</property>
{code}
I really think this should be present in Nutch. I am available to improve the
patch until it is ready for inclusion. Also I am looking for comments on how I
implemented my improvements.
Thanks,
Rui
> [PARSE-HTML plugin] Block certain parts of HTML code from being indexed
> -----------------------------------------------------------------------
>
> Key: NUTCH-585
> URL: https://issues.apache.org/jira/browse/NUTCH-585
> Project: Nutch
> Issue Type: Improvement
> Affects Versions: 0.9.0
> Environment: All operating systems
> Reporter: Andrea Spinelli
> Priority: Minor
> Attachments: nutch-585-excludeNodes.patch,
> nutch-585-jostens-excludeDIVs.patch
>
>
> We are using nutch to index our own web sites; we would like not to index
> certain parts of our pages, because we know they are not relevant (for
> instance, there are several links to change the background color) and
> generate spurious matches.
> We have modified the plugin so that it ignores HTML code between certain HTML
> comments, like
> <!-- START-IGNORE -->
> ... ignored part ...
> <!-- STOP-IGNORE -->
> We feel this might be useful to someone else, maybe factorizing the comment
> strings as constants in the configuration files (say parser.html.ignore.start
> and parser.html.ignore.stop in nutch-site.xml).
> We are almost ready to contribute our code snippet. Looking forward for any
> expression of interest - or for an explanation why waht we are doing is
> plain wrong!
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira