[ 
https://issues.apache.org/jira/browse/NUTCH-1941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14367791#comment-14367791
 ] 

Lewis John McGibbney commented on NUTCH-1941:
---------------------------------------------

Hi Markus,
bq. I understand the usefulness regarding (acedemic) research purposes and/or 
doing potential clandestine crawls but i do, again, want to raise a point here 
about whether this is want we want to have in our distribution. So i am +0 for 
this feature.

Sounds like you've identified two good use cases. I originally suggested this 
issue due to the fact that even if one abides to very conservative crawl 
delays, sometimes server side policy dictates that anything which is identified 
as a bot will be blocked. This issue is intended to mitigate against that. I 
understand however that it could possibly be used in an adverse, potentially 
malicious manner... there are a number fo features on Nutch which can be used 
in this way though.

bq. Regarding the feature itself, is rotating per time interval the ideal 
choice for avoiding either clandestine crawl detection or automated systems 
detecting bots? 

I am kinda unsure about the answer to this question. I don't really have a 
preference on this. I would however state that idea was to completely change 
the http.agent.name, not change it based on appending a timestamp.

bq. Do any of you have access to such detection systems or have the know-how on 
how they operate? My gut tells me a very irregular fetch interval and much more 
sophisticated generator (hopefully not more than a FetchSchedule impl.) would 
get us further, of course, having a rotating UserAgent and probably IP rotation.

I'm kinda confused again here Markus. Are you talking about detecting bot 
crawling your server? Can you please clarify?

bq. Lewis, the hyperlink you reference is a very static approach for blocking 
bots that actually identify themselves. Their solution is easily mitigated by 
announcing one's crawler as a regular web browser.

Yes this is part of the issue. There is nothing currently stopping anyone from 
crawling with an Browser user agent name as your Nutch user agent name. However 
this again would be a static http.agent.name. I am talking about a revolving 
number of names.

bq. Regarding the patch, i contains a lot of clutter about class paths which i 
am unfamiliar with. It doesn't look like a trunk patch and i don't remeber 2x 
having these files. Do we need them?

Hopefully the next iteration will address this.
Thanks Markus



> Optional rolling http.agent.name's
> ----------------------------------
>
>                 Key: NUTCH-1941
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1941
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher, protocol
>            Reporter: Lewis John McGibbney
>            Priority: Trivial
>         Attachments: NUTCH-1941-ver1.patch, agent.names.txt, nutch.patch
>
>
> In some scenarios, even whilst adhering to fetcher.crawl.delay, web admins 
> can block your fetcher based merely on your crawler name. 
> I propose the ability to implement rolling http.agent.name's which could be 
> substituted every 5 seconds for example. This would mean that successive 
> requests to the same domain would be sent with different http.agent.name. 
> This behavior should be off by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to