In case anyone else is using HttpClient for a multi-threaded crawler,
here is the solution that seems to solve all the problems in this
discussion:

Don't use the MultiThreadedHttpConnectionManager.  You will need to
bail if a response body reaches a limit you define (mine is 100k). 
The only way to break the connection is to call HttpMethod.abort. 
Unfortunately this doesn't allow the HttpConnection to be safely
returned to the connection manager's pool.  Instead, I found pretty
good performance by creating a new HttpClient (simple constructor :
new HttpClient()) for each thread and use it for 1,000 requests at
which time I destroy the current and create a new one.  I'm sure this
doesn't perform as well as the multi threaded manager but it ran all
night for me with no exceptions, no memory leaks, and pulled down 2
million sites in about 12 hours (running 100 threads).  Not bad.

On 7/21/05, Tony Spencer <[EMAIL PROTECTED]> wrote:
> Ok, I hope you aren't getting sick of this problem. :)
> 
> HttpMethod.abort does solve the problem of sites that send an infinite
> response.  However, it seems that by calling abort we cannot properly
> release the connection.  I've tried calling method.releaseConnection
> right after abort.
> 
> My usage for HttpClient is a multi-threaded crawler so I've followed
> the suggestions on the threading page
> http://jakarta.apache.org/commons/httpclient/threading.html (nice
> documentation by the way).  So I use the
> MultiThreadedHttpConnectionManager as suggested and reuse the same
> HttpClient over and over as suggested.  After a certain number of
> calls to HttpMethod.abort my HttpClient goes bad (hangs).
> 
> So it appears that abort is too harsh and  doesn't allow clean return
> of the client to the pool.  Any more suggestions?
> 
> On 7/21/05, Tony Spencer <[EMAIL PROTECTED]> wrote:
> > Disregard my last message.  Your suggestion did work Oleg.  Originally
> > I put the abort statement after attempted to close the input stream.
> > Once I moved it in front of the stream close statement it worked fine.
> >  Thank you very much.
> >
> > On 7/21/05, Oleg Kalnichevski <[EMAIL PROTECTED]> wrote:
> > > Just call HttpMethod#abort to close the underlying connection
> > >
> > > Oleg
> > >
> > >
> > > On Thu, 2005-07-21 at 16:34 -0400, Tony Spencer wrote:
> > > > Ok, I managed to limit the the response to 8k in the following code
> > > > but it doesn't help with what I'm really trying to accomplish.
> > > > Sometimes there is a site that will spew a neverending response.  This
> > > > causes HttpClient to hang indefinitely.  My code below does not solve
> > > > the problem.  Here is an example of a nasty site that never stops
> > > > sending response: http://www.tfc-charts.w2d.com/chart/dw/w (beware.
> > > > it may crash your browser if you browse it)
> > > >
> > > >                 InputStream is = method.getResponseBodyAsStream();
> > > >                 BufferedInputStream bis = new BufferedInputStream(is);
> > > >                 byte[] bytes = new byte[ 8192 ];
> > > >                 bis.read(bytes);
> > > >                 bis.close();
> > > >                 is.close();
> > > >                 ret = new String(bytes);
> > > >
> > > >
> > > > On 7/21/05, Tony Spencer <[EMAIL PROTECTED]> wrote:
> > > > > I'd like to limit the size of the response but don't know how.  For
> > > > > instance, if the response body is greater than 100k I would like to
> > > > > close the connection to the site.  How can I go about doing this?  I
> > > > > see the available method param : BUFFER_WARN_TRIGGER_LIMIT but it only
> > > > > seems to control warning logging.
> > > > >
> > > > > Currently I receive the response body like so:
> > > > > byte[] bytes = method.getResponseBody();
> > > > >
> > > > > Any help greatly appreciated.
> > > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > > For additional commands, e-mail: [EMAIL PROTECTED]
> > > >
> > > >
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > For additional commands, e-mail: [EMAIL PROTECTED]
> > >
> > >
> >
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to