[ 
https://issues.apache.org/jira/browse/NUTCH-1941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14382974#comment-14382974
 ] 

Asitang Mishra commented on NUTCH-1941:
---------------------------------------

Looked into why it's not working for protocol-http client.

Unlike the normal http protocol (which is in the package: 
org.apache.nutch.protocol.http;), protocol-httpclient uses the following two 
classes in a different package.

package org.apache.nutch.protocol.httpclient; --> Http.java
package org.apache.nutch.protocol.httpclient; --> Httpresponse.java

An HttpClient is being used. It is filled with all the headers initially 
(including the "user-agent") and does not ask for it later on. So, headers are 
not being changed with each request.



class Http:

{code}

/* 1.*/ private static HttpClient client = new HttpClient(connectionManager); 
//clint is initialized


/***********************/

/* 2.*/ In the function configureClient(): //headers are being set into the 
client configuration

HostConfiguration hostConf = client.getHostConfiguration();
    
     ArrayList<Header> headers = new ArrayList<Header>();
    // Set the User Agent in the header
    headers.add(new Header("User-Agent", getUserAgent()));
    hostConf.getParams().setParameter("http.default-headers", headers);

/******************/

/*3.*/ 
  static synchronized HttpClient getClient() { //The Httpresponse object is 
using this function to get the client and execute it.
          
    return client;
  }


 {code}

What do you think is the best way to rotate the agent here. One way I see is to 
read the client's set headers and change the user-agent and refill it. Do you 
see anything more direct here!!??









> Optional rolling http.agent.name's
> ----------------------------------
>
>                 Key: NUTCH-1941
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1941
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher, protocol
>            Reporter: Lewis John McGibbney
>            Priority: Trivial
>         Attachments: NUTCH-1941-ITR2.patch, NUTCH-1941-itr3.patch, 
> NUTCH-1941-itr4.patch, NUTCH-1941-v5.patch, NUTCH-1941-ver1.patch, 
> agent.names.txt, nutch.patch
>
>
> In some scenarios, even whilst adhering to fetcher.crawl.delay, web admins 
> can block your fetcher based merely on your crawler name. 
> I propose the ability to implement rolling http.agent.name's which could be 
> substituted every 5 seconds for example. This would mean that successive 
> requests to the same domain would be sent with different http.agent.name. 
> This behavior should be off by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to