[ 
https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13255497#comment-13255497
 ] 

Roberto Gardenier commented on NUTCH-585:
-----------------------------------------

Hello all,

I've stumbled upon this ticket in my research to achieve the stated situation: 
block certain html parts from being indexed.
I understand that this plugin/patch is achieves the desired situation, only i 
cannot seem to understand the following:

- Will this feature be implemented in nutch 1.5 (according to Julien Nioche - 
28/Sep/11 11:24) or will this be implemented in 1.6 (if this is what Markus 
Jelsma means with his comment on 03/Apr/12 12:08)?

- I found out that yesterday there was a vote concerning nutch 1.5 rc1: 
http://lucene.472066.n3.nabble.com/VOTE-Apache-Nutch-1-5-release-rc-1-td3913604.html.
 Is this a reliable source ? If so, what are the prospects upon releasing this 
version?

- Reason that I want to know is because I want to use the giving plugin but I 
can also wait of the nutch 1.5 release date isnt that far away.

It would be great if someone could advice me.
Many thanks in advance.

With kind regards,
Roberto Gardenier
                
> [PARSE-HTML plugin] Block certain parts of HTML code from being indexed
> -----------------------------------------------------------------------
>
>                 Key: NUTCH-585
>                 URL: https://issues.apache.org/jira/browse/NUTCH-585
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.9.0
>         Environment: All operating systems
>            Reporter: Andrea Spinelli
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.6
>
>         Attachments: blacklist_whitelist_plugin.patch, 
> nutch-585-excludeNodes.patch, nutch-585-jostens-excludeDIVs.patch
>
>
> We are using nutch to index our own web sites; we would like not to index 
> certain parts of our pages, because we know they are not relevant (for 
> instance, there are several links to change the background color) and 
> generate spurious matches.
> We have modified the plugin so that it ignores HTML code between certain HTML 
> comments, like
> <!-- START-IGNORE -->
> ... ignored part ...
> <!-- STOP-IGNORE -->
> We feel this might be useful to someone else, maybe factorizing the comment 
> strings as constants in the configuration files (say parser.html.ignore.start 
> and parser.html.ignore.stop in nutch-site.xml).
> We are almost ready to contribute our code snippet.  Looking forward for any 
> expression of  interest - or for an explanation why waht we are doing is 
> plain wrong!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to