[
https://issues.apache.org/jira/browse/NUTCH-1941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14378994#comment-14378994
]
Asitang Mishra commented on NUTCH-1941:
---------------------------------------
Hi Sebastian,
The solution 1 is what the patch is doing right now. It is using a single agent
name for a random number of times before switching to another name. The names
are chosen one after another in an indexed way though. We can also do this
randomly. So, We can
1. choose a random interval for which to keep using a particular agent
name.(already in the patch)
2. choose the name of the agent again randomly. (can be easily implemented)
So, we will need just one counter that will count the number of times the same
agent name was used and reset it once it has been used the required random
number of times. And then generate another random number and start the counter
for it.
Basically I am suggesting this:
{code}
private void rotateAgentName() {
if (urlCount <= 0) { //if counter gets to zero
//set a new user agent name from a random index
userAgent =
useragentnames.get(random.nextInt(useragentnames.size()-1)); //what if the
agents file is blank so size is zero (handle that case)
//generate a random number between 1 and
rotationInterval read from the nutch properties and set the counter
urlCount = random.nextInt(rotationInterval) + 1;
} else {
//decrement the counter
urlCount--;
}
}
{code}
I don't feel this code needs synchronization (As you suggested earlier in your
first comment). Although concurrent threads may at times mess up the counter
value, but won't cause anything serious.
What do you think.
> Optional rolling http.agent.name's
> ----------------------------------
>
> Key: NUTCH-1941
> URL: https://issues.apache.org/jira/browse/NUTCH-1941
> Project: Nutch
> Issue Type: New Feature
> Components: fetcher, protocol
> Reporter: Lewis John McGibbney
> Priority: Trivial
> Attachments: NUTCH-1941-ITR2.patch, NUTCH-1941-itr3.patch,
> NUTCH-1941-ver1.patch, agent.names.txt, nutch.patch
>
>
> In some scenarios, even whilst adhering to fetcher.crawl.delay, web admins
> can block your fetcher based merely on your crawler name.
> I propose the ability to implement rolling http.agent.name's which could be
> substituted every 5 seconds for example. This would mean that successive
> requests to the same domain would be sent with different http.agent.name.
> This behavior should be off by default.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)