[ 
https://issues.apache.org/jira/browse/NUTCH-1941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14378994#comment-14378994
 ] 

Asitang Mishra commented on NUTCH-1941:
---------------------------------------

Hi Sebastian,

The solution 1 is what the patch is doing right now. It is using a single agent 
name for a random number of times before switching to another name. The names 
are chosen one after another in an indexed way though. We can also do this 
randomly. So, We can 

1. choose a random interval for which to keep using a particular agent 
name.(already in the patch)
2. choose the name of the agent again randomly. (can be easily implemented)

So, we will need just one counter that will count the number of times the same 
agent name was used and reset it once it has been used the required random 
number of times. And then generate another random number and start the counter 
for it.

Basically I am suggesting this:

{code}

private void rotateAgentName() { 
                                                                                

                if (urlCount <= 0) { //if counter gets to zero
                        //set a new user agent name from a random index
                        userAgent = 
useragentnames.get(random.nextInt(useragentnames.size()-1)); //what if the 
agents file is blank so size is zero (handle that case)
                        

                        //generate a random number between 1 and 
rotationInterval read from the nutch properties and set the counter

                        urlCount = random.nextInt(rotationInterval) + 1;

                } else {
                        //decrement the counter
                        urlCount--;

                }
        }
{code}

I don't feel this code needs synchronization (As you suggested earlier in your 
first comment). Although concurrent threads may at times mess up the counter 
value, but won't cause anything serious.
What do you think.




> Optional rolling http.agent.name's
> ----------------------------------
>
>                 Key: NUTCH-1941
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1941
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher, protocol
>            Reporter: Lewis John McGibbney
>            Priority: Trivial
>         Attachments: NUTCH-1941-ITR2.patch, NUTCH-1941-itr3.patch, 
> NUTCH-1941-ver1.patch, agent.names.txt, nutch.patch
>
>
> In some scenarios, even whilst adhering to fetcher.crawl.delay, web admins 
> can block your fetcher based merely on your crawler name. 
> I propose the ability to implement rolling http.agent.name's which could be 
> substituted every 5 seconds for example. This would mean that successive 
> requests to the same domain would be sent with different http.agent.name. 
> This behavior should be off by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to