[ 
https://issues.apache.org/jira/browse/NUTCH-1941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14380959#comment-14380959
 ] 

Asitang Mishra commented on NUTCH-1941:
---------------------------------------

I see the problem now. Basically the code should not save any kind of state, 
because it may not be seen uniquely by each thread as they will have their own 
cached copy of that state which may not change if another thread changes that 
state.
So, you are saying synchronization will be too messy. (But it will also solve 
the visibility problem..right??)
But if we use a simple implementation won't it be too predictable and will have 
to rotate the agent every time.

I see two solutions right now:

1. Along with predicting an index, also with some probability decide if we want 
to change the agent right now or not.

2. Use volatile variables.(Do you think they are an answer to the visibility 
problem!!) 


> Optional rolling http.agent.name's
> ----------------------------------
>
>                 Key: NUTCH-1941
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1941
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher, protocol
>            Reporter: Lewis John McGibbney
>            Priority: Trivial
>         Attachments: NUTCH-1941-ITR2.patch, NUTCH-1941-itr3.patch, 
> NUTCH-1941-ver1.patch, agent.names.txt, nutch.patch
>
>
> In some scenarios, even whilst adhering to fetcher.crawl.delay, web admins 
> can block your fetcher based merely on your crawler name. 
> I propose the ability to implement rolling http.agent.name's which could be 
> substituted every 5 seconds for example. This would mean that successive 
> requests to the same domain would be sent with different http.agent.name. 
> This behavior should be off by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to