[
https://issues.apache.org/jira/browse/NUTCH-693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13561701#comment-13561701
]
Santiago M. Mola commented on NUTCH-693:
----------------------------------------
This is completely different from an hypothetical "ignore.robots.txt" option.
"robots.txt" is controlled by the site owner, and it tells us explicitely not
to access/index some parts of the website. rel=nofollow is usually controlled
by third-parties and it's not supposed to restrict crawling. It's just for
preventing the link from adding up in link scoring algorithms (or, as Andrew
put it, non-endorsement).
But what is more important: What happpens when your seeds use rel=nofollow?
Then Nutch cannot crawl anything. For example, most MediaWiki setups include
rel=nofollow for all external links. That means that, if you need to use a
MediaWiki-based site as a seed, Nutch will not be able to extract links for
further crawling.
> Add configurable option for treating nofollow behaviour.
> --------------------------------------------------------
>
> Key: NUTCH-693
> URL: https://issues.apache.org/jira/browse/NUTCH-693
> Project: Nutch
> Issue Type: New Feature
> Reporter: Andrew McCall
> Priority: Minor
> Attachments: nutch.nofollow.patch
>
>
> For my purposes I'd like to follow links even if they're marked nofollow-
> Ideally I'd like to follow them, but not pass the link juice between them.
> I've attached a patch that adds a configuration element
> parser.html.outlinks.ignore_nofollow which allows the parser to ignore the
> nofollow elements on a page.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira