[ https://issues.apache.org/jira/browse/NUTCH-1941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14367791#comment-14367791 ]
Lewis John McGibbney commented on NUTCH-1941: --------------------------------------------- Hi Markus, bq. I understand the usefulness regarding (acedemic) research purposes and/or doing potential clandestine crawls but i do, again, want to raise a point here about whether this is want we want to have in our distribution. So i am +0 for this feature. Sounds like you've identified two good use cases. I originally suggested this issue due to the fact that even if one abides to very conservative crawl delays, sometimes server side policy dictates that anything which is identified as a bot will be blocked. This issue is intended to mitigate against that. I understand however that it could possibly be used in an adverse, potentially malicious manner... there are a number fo features on Nutch which can be used in this way though. bq. Regarding the feature itself, is rotating per time interval the ideal choice for avoiding either clandestine crawl detection or automated systems detecting bots? I am kinda unsure about the answer to this question. I don't really have a preference on this. I would however state that idea was to completely change the http.agent.name, not change it based on appending a timestamp. bq. Do any of you have access to such detection systems or have the know-how on how they operate? My gut tells me a very irregular fetch interval and much more sophisticated generator (hopefully not more than a FetchSchedule impl.) would get us further, of course, having a rotating UserAgent and probably IP rotation. I'm kinda confused again here Markus. Are you talking about detecting bot crawling your server? Can you please clarify? bq. Lewis, the hyperlink you reference is a very static approach for blocking bots that actually identify themselves. Their solution is easily mitigated by announcing one's crawler as a regular web browser. Yes this is part of the issue. There is nothing currently stopping anyone from crawling with an Browser user agent name as your Nutch user agent name. However this again would be a static http.agent.name. I am talking about a revolving number of names. bq. Regarding the patch, i contains a lot of clutter about class paths which i am unfamiliar with. It doesn't look like a trunk patch and i don't remeber 2x having these files. Do we need them? Hopefully the next iteration will address this. Thanks Markus > Optional rolling http.agent.name's > ---------------------------------- > > Key: NUTCH-1941 > URL: https://issues.apache.org/jira/browse/NUTCH-1941 > Project: Nutch > Issue Type: New Feature > Components: fetcher, protocol > Reporter: Lewis John McGibbney > Priority: Trivial > Attachments: NUTCH-1941-ver1.patch, agent.names.txt, nutch.patch > > > In some scenarios, even whilst adhering to fetcher.crawl.delay, web admins > can block your fetcher based merely on your crawler name. > I propose the ability to implement rolling http.agent.name's which could be > substituted every 5 seconds for example. This would mean that successive > requests to the same domain would be sent with different http.agent.name. > This behavior should be off by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332)