On Sep 25, 2013, at 3:57am, Sebastiano Vigna wrote: > On 25 Sep 2013, at 12:21 PM, Oleg Kalnichevski <[email protected]> wrote: > >> There are probably two most possibilities here: (1) there is a bug in >> HttpClient and the socket value does not correctly apply (I have tested >> such scenario on a number of occasions, so I do not find it likely), (2) >> the target servers keep on sending data either infinitely or at a very >> slow rate (in both cases the connection never reaches the level of >> inactivity for socket timeout to fire). > > > No, infinitely is impossible because we truncate after 20M. > > public Void handleResponse( > HttpResponse response ) throws ClientProtocolException, IOException { > FetchData.this.response = > response; > final HttpEntity entity = > response.getEntity(); > > if ( entity == null ) > LOGGER.warn( "Null entity for URL " + url ); > else { > > wrappedEntity.setEntity( entity ); > truncated = > wrappedEntity.copyContent( maxResponseBodyLength ); > if ( truncated ) > httpGet.abort(); > } > return null; > }} ); > > wrappedEntity simply copies maxResponseBodyLength bytes and then exits. > > It could be infinitely slow rate, but frankly netstat does not report *any* > open connection. > > Nonetheless, after about four hours, 41 out of 42 connections have exited. > > Any suggestion to patch this behaviour? One thing we can do is to track the > URLs that have caused such stalling connections.
During large scale web crawling we'd often run into servers that trickled data back, which would create effectively "hung" connections. Our solution is that we loop while reading available data from an InputStream get get via HttpEntity.getContent(). If the read rate drops below a threshold we abort the request. It would be cleaner to have this built into an instrumented HttpClient - I haven't looked at http://metrics.codahale.com/manual/httpclient/, but seems interesting. -- Ken -------------------------- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr
