[
https://issues.apache.org/jira/browse/NUTCH-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13419128#comment-13419128
]
Markus Jelsma commented on NUTCH-1434:
--------------------------------------
You're right about the current behaviour but there is no further problem.
Documents will never be passed to the index with this patch but we have to send
a delete request because they same page may have had NO NOINDEX metatag
yesterday. The same goes for 404's, we have to delete those too because we
don't know if we have added them to the index before (which is possible).
Thanks
> Indexer to delete robots noIndex
> --------------------------------
>
> Key: NUTCH-1434
> URL: https://issues.apache.org/jira/browse/NUTCH-1434
> Project: Nutch
> Issue Type: New Feature
> Components: indexer
> Affects Versions: 1.5.1
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.6
>
> Attachments: NUTCH-1434-1.6-1.patch
>
>
> Nutch does not treat pages with meta robots="noindex" properly. All it does
> is remove the title and content fields from the parsed data. It does not stop
> those pages from being indexed, nor can it delete existing pages from the
> index if they change.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira