Re: AbstractNIOConnPool memory leak?

Ken Krugler Sat, 05 Jan 2013 15:44:41 -0800

On Jan 5, 2013, at 3:36pm, Oleg Kalnichevski wrote:

> On Sat, 2013-01-05 at 22:11 +0000, sebb wrote:
>> On 5 January 2013 21:33, vigna <[email protected]> wrote:
>>>> But why would you want a web crawler to have 10-20K simultaneously
>>>> opened connections in the first place?
>>> 
>>> (I thought I answered this, but it's not on the archive. Boh.)
>>> 
>>> Having a few thousands connection open is the only way to retrieve data
>>> respecting politeness (e.g., not banging the same site too often).
>> 
>> Huh?
>> There are surely other ways to achieve that goal.
>> 
> 
> I could not agree more. I personally think that closing idle connections
> and letting the server reclaim the resources associated with them
> (potentially enabling the server to serve other clients) would be more
> 'polite'. It is cheaper for both the client and the server to close
> connections more frequently than keeping them alive just in case.


Just to clarify, for our web crawl we were using a connection pool and letting 
idle connections be reclaimed.

But we were also doing small batches of URLs (e.g. 5 at a time) when hitting 
the same server, keeping the connection open. This was an attempt to balance 
the cost to the target server of establishing a new connection, versus being 
polite. For typical web sites this feels like a win, but low-traffic sites that 
have complex pages being generated by JSP code (for example) could be unhappy. 
I know that Heritrix uses a strategy of varying their crawl delay based on the 
response time of the server, which could be a better approach to constraining 
the # of keep-alive requests.

-- Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr




---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: AbstractNIOConnPool memory leak?

Reply via email to