Adrian Newby created NUTCH-1836:
-----------------------------------
Summary: Timeouts in protocol-httpclient when crawling same host
with >2 threads NUTCH-1613 is not a complete solution
Key: NUTCH-1836
URL: https://issues.apache.org/jira/browse/NUTCH-1836
Project: Nutch
Issue Type: Improvement
Components: protocol
Affects Versions: 1.9
Reporter: Adrian Newby
Priority: Minor
NUTCH-1613 provided a fix for the hardcoded limitation of 2 threads for
protocol-httpclient. However, just extending the hardwired 10 max threads and
allocating them all to a single host only provides a partial solution. It is
still possible to exhaust the thread pool and observe timeouts depending on the
settings of:
- fetcher.threads.per.host (nutch-site.xml)
- mapred.tasktracker.map.tasks.maximum (mapred-site.xml)
It would perhaps be more robust to set the httpclient thread pool as a
derivative of these two configuration parameters as below:
{code}
params.setMaxTotalConnections(maxThreadsTotal);
// Add the following lines ...
//
--------------------------------------------------------------------------------
// Modification to increase the number of available connections for
// multi-threaded crawls.
//
--------------------------------------------------------------------------------
connectionManager.setMaxConnectionsPerHost(conf.getInt("fetcher.threads.per.host",
10));
connectionManager.setMaxTotalConnections(conf.getInt("mapred.tasktracker.map.tasks.maximum",
5) * conf.getInt("fetcher.threads.per.host", 10));
LOG.debug("setMaxConnectionsPerHost: " +
connectionManager.getMaxConnectionsPerHost());
LOG.debug("setMaxTotalConnections : " +
connectionManager.getMaxTotalConnections());
//
--------------------------------------------------------------------------------
{code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)