Adrian Newby created NUTCH-1836:
-----------------------------------

             Summary: Timeouts in protocol-httpclient when crawling same host 
with >2 threads NUTCH-1613 is not a complete solution
                 Key: NUTCH-1836
                 URL: https://issues.apache.org/jira/browse/NUTCH-1836
             Project: Nutch
          Issue Type: Improvement
          Components: protocol
    Affects Versions: 1.9
            Reporter: Adrian Newby
            Priority: Minor


NUTCH-1613 provided a fix for the hardcoded limitation of 2 threads for 
protocol-httpclient.  However, just extending the hardwired 10 max threads and 
allocating them all to a single host only provides a partial solution.  It is 
still possible to exhaust the thread pool and observe timeouts depending on the 
settings of:

 - fetcher.threads.per.host (nutch-site.xml)
 - mapred.tasktracker.map.tasks.maximum (mapred-site.xml)

It would perhaps be more robust to set the httpclient thread pool as a 
derivative of these two configuration parameters as below:



{code}
    params.setMaxTotalConnections(maxThreadsTotal);

// Add the following lines ...


        // 
--------------------------------------------------------------------------------
        // Modification to increase the number of available connections for
        // multi-threaded crawls.
        // 
--------------------------------------------------------------------------------
        
connectionManager.setMaxConnectionsPerHost(conf.getInt("fetcher.threads.per.host",
 10));
        
connectionManager.setMaxTotalConnections(conf.getInt("mapred.tasktracker.map.tasks.maximum",
 5) * conf.getInt("fetcher.threads.per.host", 10));
        LOG.debug("setMaxConnectionsPerHost: " + 
connectionManager.getMaxConnectionsPerHost());
        LOG.debug("setMaxTotalConnections  : " + 
connectionManager.getMaxTotalConnections());
        // 
--------------------------------------------------------------------------------
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to