[
https://issues.apache.org/jira/browse/NUTCH-1941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14367791#comment-14367791
]
Lewis John McGibbney commented on NUTCH-1941:
---------------------------------------------
Hi Markus,
bq. I understand the usefulness regarding (acedemic) research purposes and/or
doing potential clandestine crawls but i do, again, want to raise a point here
about whether this is want we want to have in our distribution. So i am +0 for
this feature.
Sounds like you've identified two good use cases. I originally suggested this
issue due to the fact that even if one abides to very conservative crawl
delays, sometimes server side policy dictates that anything which is identified
as a bot will be blocked. This issue is intended to mitigate against that. I
understand however that it could possibly be used in an adverse, potentially
malicious manner... there are a number fo features on Nutch which can be used
in this way though.
bq. Regarding the feature itself, is rotating per time interval the ideal
choice for avoiding either clandestine crawl detection or automated systems
detecting bots?
I am kinda unsure about the answer to this question. I don't really have a
preference on this. I would however state that idea was to completely change
the http.agent.name, not change it based on appending a timestamp.
bq. Do any of you have access to such detection systems or have the know-how on
how they operate? My gut tells me a very irregular fetch interval and much more
sophisticated generator (hopefully not more than a FetchSchedule impl.) would
get us further, of course, having a rotating UserAgent and probably IP rotation.
I'm kinda confused again here Markus. Are you talking about detecting bot
crawling your server? Can you please clarify?
bq. Lewis, the hyperlink you reference is a very static approach for blocking
bots that actually identify themselves. Their solution is easily mitigated by
announcing one's crawler as a regular web browser.
Yes this is part of the issue. There is nothing currently stopping anyone from
crawling with an Browser user agent name as your Nutch user agent name. However
this again would be a static http.agent.name. I am talking about a revolving
number of names.
bq. Regarding the patch, i contains a lot of clutter about class paths which i
am unfamiliar with. It doesn't look like a trunk patch and i don't remeber 2x
having these files. Do we need them?
Hopefully the next iteration will address this.
Thanks Markus
> Optional rolling http.agent.name's
> ----------------------------------
>
> Key: NUTCH-1941
> URL: https://issues.apache.org/jira/browse/NUTCH-1941
> Project: Nutch
> Issue Type: New Feature
> Components: fetcher, protocol
> Reporter: Lewis John McGibbney
> Priority: Trivial
> Attachments: NUTCH-1941-ver1.patch, agent.names.txt, nutch.patch
>
>
> In some scenarios, even whilst adhering to fetcher.crawl.delay, web admins
> can block your fetcher based merely on your crawler name.
> I propose the ability to implement rolling http.agent.name's which could be
> substituted every 5 seconds for example. This would mean that successive
> requests to the same domain would be sent with different http.agent.name.
> This behavior should be off by default.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)