On Sun, 2013-01-06 at 15:48 -0800, Ken Krugler wrote: > Hi Oleg, > > [snip] > > > Ken, > > > > You might want to have a look at the lest code in SVN trunk (to be > > released as 4.3). Several classes such as the scheme registry that > > previously had to be synchronized in order to ensure thread safety have > > been replaced with immutable equivalents. There is also now a way to > > create HttpClient in a minimal configuration without authentication, > > state management (cookies), proxy support and other non-essential > > functions. > > That sounds interesting - any hints as to how to create this minimal > HttpClient? >
The new API is not yet final and not properly documented. Presently this can be done with HttpClients#createMinimal > > These functions are not merely disabled but physically > > removed from the processing pipeline, which should result in somewhat > > better performance in high threads contention scenarios, as the only > > synchronization point involved in request execution would be the lock of > > the connection pool. Minimal HttpClient may be particularly useful for > > anonymous web crawling when authentication and state management are not > > required. > > > > > >> 3. Global lock on connection pool > >> > >> Oleg had written: > >> > >>> Yes, your observation is correct. The problem is that the connection > >>> pool is guarded by a global lock. Naturally if you have 400 threads > >>> trying to obtain a connection at about the same time all of them end up > >>> contending for one lock. The problem is that I can't think of a > >>> different way to ensure the max limits (per route and total) are > >>> guaranteed not to be exceeded. If anyone can think of a better algorithm > >>> please do let me know. What might be a possibility is creating a more > >>> lenient and less prone to lock contention issues implementation that may > >>> under stress occasionally allocate a few more connections than the max > >>> limits. > >> > >> I don't know if this has been resolved. My work-around from a few years > >> ago was to rely on having multiple Hadoop reducers running on the server > >> (each in their own JVM), where I could then limit each JVM to at most 300 > >> connections. > >> > > > > I experimented with the idea of lock-less (unlimited) connection manager > > but in my tests it did not perform any better than the standard > > connection manager. > > Previously I'd asked: > > > Would it work to go for finer-grained locking, by using atomic counters to > > track & enforce limits on per route/total connections? > > Any thoughts on that approach? E.g. have a map from route to atomic counter, > and a single atomic counter for total connections? > This may be worthwhile to try. However, in theory this should not perform any better than the approach I took with my experiments. The main problem is, though, that I do not have a good test framework that emulates an environment a web crawler is expected to operate in (and have no justification for building one in my spare time). So, this kind of effort ideally should be led by an external contributor. Oleg --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
