[ 
https://issues.apache.org/jira/browse/NUTCH-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13649026#comment-13649026
 ] 

Tejas Patil commented on NUTCH-1513:
------------------------------------

One thing that I forgot to mention: The change picks up the agent names from 
http.agent.name and http.robots.agents. I could have added ftp.agent.name etc.. 
new configs but I dont see a point on doing that because both these configs 
would generally carry same values and so creating new ones would just add to 
the whole nest of already existing configs. What say ?
                
> Support Robots.txt for Ftp urls
> -------------------------------
>
>                 Key: NUTCH-1513
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1513
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.7, 2.2
>            Reporter: Tejas Patil
>            Assignee: Tejas Patil
>            Priority: Minor
>              Labels: robots.txt
>             Fix For: 2.3, 1.8
>
>         Attachments: NUTCH-1513.trunk.patch
>
>
> As per [0], a FTP website can have robots.txt like [1]. In the nutch code, 
> Ftp plugin is not parsing the robots file and accepting all urls.
> In "_src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/Ftp.java_"
> {noformat}   public RobotRules getRobotRules(Text url, CrawlDatum datum) {
>     return EmptyRobotRules.RULES;
>   }{noformat} 
> Its not clear of this was part of design or if its a bug. 
> [0] : 
> https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt
> [1] : ftp://example.com/robots.txt

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to