On Jan 5, 2013, at 3:31pm, vigna wrote:

> On 5 Jan 2013, at 3:10 PM, Ken Krugler <[email protected]> wrote:
> 
>> So on a large box (e.g. 24 more powerful cores) I could see using upward
>> of 10K threads being the 
>> optimal number.
> 
> We are working to make 20-30K connections work on 64 cores.
> 
>> Just FYI about two years ago we were using big servers with lots of
>> threads during a large-scale web 
>> crawl, and we did run into interesting bottlenecks in HttpClient 4.0.1 (?)
>> with lots of simultaneous 
>> threads. I haven't had to revisit those issues with a recent release, so
>> maybe those have been resolved.
> 
> 
> Can you elaborate on that? I guess it would be priceless knowledge :).

1. CookieStore access

> For example, during a Bixo crawl with 300 threads, I was doing regular thread 
> dumps and inspecting the results. A very high percentage (typically > 1/3) 
> were blocked while waiting to get access to the cookie store. By default 
> there's only one of these per HttpClient.
> 
> This one was fairly easy to work around, by creating a cookie store in the 
> local context for each request:
> 
>            CookieStore cookieStore = new BasicCookieStore();
>            localContext.setAttribute(ClientContext.COOKIE_STORE, cookieStore);

2. Scheme registry

> But I've run into a few other synchronized method/data bottlenecks, which I'm 
> still working through. For example, at irregular intervals the bulk of my 
> fetcher threads are blocked on getting the scheme registry

I believe this one has been fixed via the patch for 
https://issues.apache.org/jira/browse/HTTPCLIENT-903, and is in the current 
release of HttpClient.

3. Global lock on connection pool

Oleg had written:

> Yes, your observation is correct. The problem is that the connection
> pool is guarded by a global lock. Naturally if you have 400 threads
> trying to obtain a connection at about the same time all of them end up
> contending for one lock. The problem is that I can't think of a
> different way to ensure the max limits (per route and total) are
> guaranteed not to be exceeded. If anyone can think of a better algorithm
> please do let me know. What might be a possibility is creating a more
> lenient and less prone to lock contention issues implementation that may
> under stress occasionally allocate a few more connections than the max
> limits.

I don't know if this has been resolved. My work-around from a few years ago was 
to rely on having multiple Hadoop reducers running on the server (each in their 
own JVM), where I could then limit each JVM to at most 300 connections.

HTH,

-- Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr





Reply via email to