[jira] [Commented] (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed

JIRA Sat, 17 Sep 2011 17:00:46 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13107294#comment-13107294
 ]


Rui Araújo commented on NUTCH-585:
----------------------------------

I can also confirm that the patch works on Nutch 1.3.

However, it didn't work for my use-case as I need to filter a diverse set of tag
based on different attributes. Besides I needed the links from the filtered 
area 
which did not happen. 

So I altered Hira's patch and I am publishing my work here.

This is the new changed property.
{code:xml} 
<property>
  <name>parser.html.NodesToExclude</name>
  <value>table;summary;header|div;id;navigation</value>
  <description>
  A list of nodes whose content will not be indexed separated by "|".  Use this 
to tell
  the HTML parser to ignore, for example, site navigation text.
  Each node has three elements: the first one is the tag name, the second one 
the
  attribute name, the third one the value of the attribute.
  Note that nodes with these attributes, and their children, will be silently 
ignored by the parser
  so verify the indexed content with Luke to confirm results.
  </description>
</property>
{code} 

I really think this should be present in Nutch. I am available to improve the 
patch until it is ready for inclusion. Also I am looking for comments on how I 
implemented my improvements.

Thanks,
Rui

> [PARSE-HTML plugin] Block certain parts of HTML code from being indexed
> -----------------------------------------------------------------------
>
>                 Key: NUTCH-585
>                 URL: https://issues.apache.org/jira/browse/NUTCH-585
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.9.0
>         Environment: All operating systems
>            Reporter: Andrea Spinelli
>            Priority: Minor
>         Attachments: nutch-585-excludeNodes.patch, 
> nutch-585-jostens-excludeDIVs.patch
>
>
> We are using nutch to index our own web sites; we would like not to index 
> certain parts of our pages, because we know they are not relevant (for 
> instance, there are several links to change the background color) and 
> generate spurious matches.
> We have modified the plugin so that it ignores HTML code between certain HTML 
> comments, like
> <!-- START-IGNORE -->
> ... ignored part ...
> <!-- STOP-IGNORE -->
> We feel this might be useful to someone else, maybe factorizing the comment 
> strings as constants in the configuration files (say parser.html.ignore.start 
> and parser.html.ignore.stop in nutch-site.xml).
> We are almost ready to contribute our code snippet.  Looking forward for any 
> expression of  interest - or for an explanation why waht we are doing is 
> plain wrong!

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed

Reply via email to