On Jan 5, 2013, at 3:36pm, Oleg Kalnichevski wrote: > On Sat, 2013-01-05 at 22:11 +0000, sebb wrote: >> On 5 January 2013 21:33, vigna <[email protected]> wrote: >>>> But why would you want a web crawler to have 10-20K simultaneously >>>> opened connections in the first place? >>> >>> (I thought I answered this, but it's not on the archive. Boh.) >>> >>> Having a few thousands connection open is the only way to retrieve data >>> respecting politeness (e.g., not banging the same site too often). >> >> Huh? >> There are surely other ways to achieve that goal. >> > > I could not agree more. I personally think that closing idle connections > and letting the server reclaim the resources associated with them > (potentially enabling the server to serve other clients) would be more > 'polite'. It is cheaper for both the client and the server to close > connections more frequently than keeping them alive just in case.
Just to clarify, for our web crawl we were using a connection pool and letting idle connections be reclaimed. But we were also doing small batches of URLs (e.g. 5 at a time) when hitting the same server, keeping the connection open. This was an attempt to balance the cost to the target server of establishing a new connection, versus being polite. For typical web sites this feels like a win, but low-traffic sites that have complex pages being generated by JSP code (for example) could be unhappy. I know that Heritrix uses a strategy of varying their crawl delay based on the response time of the server, which could be a better approach to constraining the # of keep-alive requests. -- Ken -------------------------- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Mahout & Solr --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
