[ 
https://issues.apache.org/jira/browse/NUTCH-1718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated NUTCH-1718:
-------------------------------

    Attachment: NUTCH-1718-trunk.v1.patch

Thanks [~wastl-nagel] for bringing this up. I should have updated the 
documentation with NUTCH-1715 but lost track of the same.

In addition to having a documentation, I am proposing this: 
Instead of making users to have 'http.agent.name' as the first agent in 
'http.robots.agents', make the program do that automatically. So users would 
make use of 'http.robots.agents' to specify any additional agents apart from 
'http.agent.name'. Here is a patch for the same.

> update description of property http.robots.agent
> ------------------------------------------------
>
>                 Key: NUTCH-1718
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1718
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.7, 2.2, 2.2.1
>            Reporter: Sebastian Nagel
>            Priority: Trivial
>             Fix For: 2.3, 1.8
>
>         Attachments: NUTCH-1718-trunk.v1.patch
>
>
> The description of property http.robots.agent in nutch-default.xml recommends 
> to add a '*' to the list of agent names. This will cause the same problem as 
> described in NUTCH-1715. The description should be updated. Also regarding 
> "order of precedence" which is dictated since NUTCH-1031 only by ordering of 
> user agents in robots.txt.
> {code:xml}
> <property>
>   <name>http.robots.agents</name>
>   <value>*</value>
>   <description>The agent strings we'll look for in robots.txt files,
>   comma-separated, in decreasing order of precedence. You should
>   put the value of http.agent.name as the first agent name, and keep the
>   default * at the end of the list. E.g.: BlurflDev,Blurfl,*
>   </description>
> </property>
> {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to