[jira] [Commented] (NUTCH-693) Add configurable option for treating nofollow behaviour.

Santiago M. Mola (JIRA) Thu, 24 Jan 2013 08:01:19 -0800

    [ 
https://issues.apache.org/jira/browse/NUTCH-693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13561701#comment-13561701
 ]


Santiago M. Mola commented on NUTCH-693:
----------------------------------------

This is completely different from an hypothetical "ignore.robots.txt" option. 
"robots.txt" is controlled by the site owner, and it tells us explicitely not 
to access/index some parts of the website. rel=nofollow is usually controlled 
by third-parties and it's not supposed to restrict crawling. It's just for 
preventing the link from adding up in link scoring algorithms (or, as Andrew 
put it, non-endorsement).

But what is more important: What happpens when your seeds use rel=nofollow? 
Then Nutch cannot crawl anything. For example, most MediaWiki setups include 
rel=nofollow for all external links. That means that, if you need to use a 
MediaWiki-based site as a seed, Nutch will not be able to extract links for 
further crawling.
                
> Add configurable option for treating nofollow behaviour.
> --------------------------------------------------------
>
>                 Key: NUTCH-693
>                 URL: https://issues.apache.org/jira/browse/NUTCH-693
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Andrew McCall
>            Priority: Minor
>         Attachments: nutch.nofollow.patch
>
>
> For my purposes I'd like to follow links even if they're marked nofollow- 
> Ideally I'd like to follow them, but not pass the link juice between them. 
> I've attached a patch that adds a configuration element 
> parser.html.outlinks.ignore_nofollow which allows the parser to ignore the 
> nofollow elements on a page. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-693) Add configurable option for treating nofollow behaviour.

Reply via email to