On Tue, 2012-01-24 at 12:28 -0800, Dvora wrote:
> Hmm, any idea why?
> 
> Anyway, if I may use this thread, can you suggest an optimal architecture
> for crawling using httpclient? 

I am not really qualified to make such recommendations as I personally
never used Httpclient for web crawling. However, as far as I know there
are several open-source web crawler implementations based on HttpClient
which you might consider making use of instead of writing your own from
scratch.

> What is the best way (beside using lots of
> worker threads, which I do now) to download maximum web pages in minimum
> time, and better utilizing the bandwidth (now it's never crossing the
> 2Mb/sec) ?
> 

How exactly did you measure that?

When running against a local web service HttpClient can generate the
highest request per second ratio out of all HTTP clients benchmarked
[1]. That makes me doubt that HttpClient is the bottleneck.

Oleg

[1]
http://wiki.apache.org/HttpComponents/HttpClient3vsHttpClient4vsHttpCore

> Thanks.
> 
> 
> 
> olegk wrote:
> > 
> > On Mon, 2012-01-23 at 11:36 -0800, Dvora wrote:
> >> Hi,
> >> 
> >> I would like to code an high performance web crawler using httpclient
> >> 4.1.2.
> >> In order to bring the machine to highest throughput, each crawling thread
> >> creating a DefaultHttpClient with a pool configured as follow (based on
> >> one
> >> of the examples):
> >> 
> >> static
> >>    {
> >>            cm = new ThreadSafeClientConnManager();
> >>            cm.setMaxTotal( 50000 );
> >>            cm.setDefaultMaxPerRoute( Integer.MAX_VALUE );
> >> 
> >>            HttpClient client = new DefaultHttpClient();
> >> 
> >>            params = client.getParams();
> >> 
> >>            HttpClientParams.setRedirecting( params, false );
> >>            HttpClientParams.setAuthenticating( params, true );
> >> 
> >>            HttpConnectionParams.setSoTimeout( params, 30000 );
> >>            HttpConnectionParams.setConnectionTimeout( params, 30000 );
> >> 
> >>            IdleConnectionEvictor connEvictor = new IdleConnectionEvictor( 
> >> cm );
> >> 
> >>            connEvictor.start();
> >>    }
> >> 
> >> When running the application with lots of crawling threads, netstat show
> >> only 2k tcp connections in status ESTABLISHED. Is this expected
> >> considering
> >> maxTotsl = 50000? Are there other bottlenecks (OS level, etc.) blocking
> >> the
> >> application to reach more than 2k tcp connections?
> >> 
> >> Thanks.
> >> 
> >> 
> > 
> > I personally think this is to be expected. When running performance
> > stress tests with 200 threads and 200 max connections limit I frequently
> > observe HttpClient utilizing significantly fewer connections (~100)
> > never ever reaching the max limit.  
> > 
> > Oleg   
> > 
> > 
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> > 
> > 
> > 
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to