On Jan 5, 2013, at 3:31pm, vigna wrote: > On 5 Jan 2013, at 3:10 PM, Ken Krugler <[email protected]> wrote: > >> So on a large box (e.g. 24 more powerful cores) I could see using upward >> of 10K threads being the >> optimal number. > > We are working to make 20-30K connections work on 64 cores. > >> Just FYI about two years ago we were using big servers with lots of >> threads during a large-scale web >> crawl, and we did run into interesting bottlenecks in HttpClient 4.0.1 (?) >> with lots of simultaneous >> threads. I haven't had to revisit those issues with a recent release, so >> maybe those have been resolved. > > > Can you elaborate on that? I guess it would be priceless knowledge :).
1. CookieStore access > For example, during a Bixo crawl with 300 threads, I was doing regular thread > dumps and inspecting the results. A very high percentage (typically > 1/3) > were blocked while waiting to get access to the cookie store. By default > there's only one of these per HttpClient. > > This one was fairly easy to work around, by creating a cookie store in the > local context for each request: > > CookieStore cookieStore = new BasicCookieStore(); > localContext.setAttribute(ClientContext.COOKIE_STORE, cookieStore); 2. Scheme registry > But I've run into a few other synchronized method/data bottlenecks, which I'm > still working through. For example, at irregular intervals the bulk of my > fetcher threads are blocked on getting the scheme registry I believe this one has been fixed via the patch for https://issues.apache.org/jira/browse/HTTPCLIENT-903, and is in the current release of HttpClient. 3. Global lock on connection pool Oleg had written: > Yes, your observation is correct. The problem is that the connection > pool is guarded by a global lock. Naturally if you have 400 threads > trying to obtain a connection at about the same time all of them end up > contending for one lock. The problem is that I can't think of a > different way to ensure the max limits (per route and total) are > guaranteed not to be exceeded. If anyone can think of a better algorithm > please do let me know. What might be a possibility is creating a more > lenient and less prone to lock contention issues implementation that may > under stress occasionally allocate a few more connections than the max > limits. I don't know if this has been resolved. My work-around from a few years ago was to rely on having multiple Hadoop reducers running on the server (each in their own JVM), where I could then limit each JVM to at most 300 connections. HTH, -- Ken -------------------------- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr
