[
https://issues.apache.org/jira/browse/NUTCH-1941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14382974#comment-14382974
]
Asitang Mishra commented on NUTCH-1941:
---------------------------------------
Looked into why it's not working for protocol-http client.
Unlike the normal http protocol (which is in the package:
org.apache.nutch.protocol.http;), protocol-httpclient uses the following two
classes in a different package.
package org.apache.nutch.protocol.httpclient; --> Http.java
package org.apache.nutch.protocol.httpclient; --> Httpresponse.java
An HttpClient is being used. It is filled with all the headers initially
(including the "user-agent") and does not ask for it later on. So, headers are
not being changed with each request.
class Http:
{code}
/* 1.*/ private static HttpClient client = new HttpClient(connectionManager);
//clint is initialized
/***********************/
/* 2.*/ In the function configureClient(): //headers are being set into the
client configuration
HostConfiguration hostConf = client.getHostConfiguration();
ArrayList<Header> headers = new ArrayList<Header>();
// Set the User Agent in the header
headers.add(new Header("User-Agent", getUserAgent()));
hostConf.getParams().setParameter("http.default-headers", headers);
/******************/
/*3.*/
static synchronized HttpClient getClient() { //The Httpresponse object is
using this function to get the client and execute it.
return client;
}
{code}
What do you think is the best way to rotate the agent here. One way I see is to
read the client's set headers and change the user-agent and refill it. Do you
see anything more direct here!!??
> Optional rolling http.agent.name's
> ----------------------------------
>
> Key: NUTCH-1941
> URL: https://issues.apache.org/jira/browse/NUTCH-1941
> Project: Nutch
> Issue Type: New Feature
> Components: fetcher, protocol
> Reporter: Lewis John McGibbney
> Priority: Trivial
> Attachments: NUTCH-1941-ITR2.patch, NUTCH-1941-itr3.patch,
> NUTCH-1941-itr4.patch, NUTCH-1941-v5.patch, NUTCH-1941-ver1.patch,
> agent.names.txt, nutch.patch
>
>
> In some scenarios, even whilst adhering to fetcher.crawl.delay, web admins
> can block your fetcher based merely on your crawler name.
> I propose the ability to implement rolling http.agent.name's which could be
> substituted every 5 seconds for example. This would mean that successive
> requests to the same domain would be sent with different http.agent.name.
> This behavior should be off by default.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)