[ 
https://issues.apache.org/jira/browse/NUTCH-710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12925886#action_12925886
 ] 

Praveen Jayaraman commented on NUTCH-710:
-----------------------------------------

Hello -

I am having two problems with Nutch and am hoping that you can help me out.

a) Crawling does not use link rel="canonical" to index the links.

b) Crawling ignores robots.txt.

I am currently using Nutch 1.1 for crawling my local company site

I have tried various settings from the web forums but am unable to get the 
above 
issues working.

Can you tell me how to enable these while crawling.

Appreciate your answer

Thanks in advance,
Regards
Praveen.



      


> Support for rel="canonical" attribute
> -------------------------------------
>
>                 Key: NUTCH-710
>                 URL: https://issues.apache.org/jira/browse/NUTCH-710
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: 1.1
>            Reporter: Frank McCown
>            Priority: Minor
>
> There is a the new rel="canonical" attribute which is
> now being supported by Google, Yahoo, and Live:
> http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonical.html
> Adding support for this attribute value will potentially reduce the number of 
> URLs crawled and indexed and reduce duplicate page content.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to