On Tue, 2012-01-24 at 12:28 -0800, Dvora wrote: > Hmm, any idea why? > > Anyway, if I may use this thread, can you suggest an optimal architecture > for crawling using httpclient?
I am not really qualified to make such recommendations as I personally never used Httpclient for web crawling. However, as far as I know there are several open-source web crawler implementations based on HttpClient which you might consider making use of instead of writing your own from scratch. > What is the best way (beside using lots of > worker threads, which I do now) to download maximum web pages in minimum > time, and better utilizing the bandwidth (now it's never crossing the > 2Mb/sec) ? > How exactly did you measure that? When running against a local web service HttpClient can generate the highest request per second ratio out of all HTTP clients benchmarked [1]. That makes me doubt that HttpClient is the bottleneck. Oleg [1] http://wiki.apache.org/HttpComponents/HttpClient3vsHttpClient4vsHttpCore > Thanks. > > > > olegk wrote: > > > > On Mon, 2012-01-23 at 11:36 -0800, Dvora wrote: > >> Hi, > >> > >> I would like to code an high performance web crawler using httpclient > >> 4.1.2. > >> In order to bring the machine to highest throughput, each crawling thread > >> creating a DefaultHttpClient with a pool configured as follow (based on > >> one > >> of the examples): > >> > >> static > >> { > >> cm = new ThreadSafeClientConnManager(); > >> cm.setMaxTotal( 50000 ); > >> cm.setDefaultMaxPerRoute( Integer.MAX_VALUE ); > >> > >> HttpClient client = new DefaultHttpClient(); > >> > >> params = client.getParams(); > >> > >> HttpClientParams.setRedirecting( params, false ); > >> HttpClientParams.setAuthenticating( params, true ); > >> > >> HttpConnectionParams.setSoTimeout( params, 30000 ); > >> HttpConnectionParams.setConnectionTimeout( params, 30000 ); > >> > >> IdleConnectionEvictor connEvictor = new IdleConnectionEvictor( > >> cm ); > >> > >> connEvictor.start(); > >> } > >> > >> When running the application with lots of crawling threads, netstat show > >> only 2k tcp connections in status ESTABLISHED. Is this expected > >> considering > >> maxTotsl = 50000? Are there other bottlenecks (OS level, etc.) blocking > >> the > >> application to reach more than 2k tcp connections? > >> > >> Thanks. > >> > >> > > > > I personally think this is to be expected. When running performance > > stress tests with 200 threads and 200 max connections limit I frequently > > observe HttpClient utilizing significantly fewer connections (~100) > > never ever reaching the max limit. > > > > Oleg > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [email protected] > > For additional commands, e-mail: [email protected] > > > > > > > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
