[
https://issues.apache.org/jira/browse/HTTPCORE-195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13461288#comment-13461288
]
Ian Blavins commented on HTTPCORE-195:
--------------------------------------
G'day
I was experiencing the same problem as the originator of this issue (but since
I was running later code I was actually getting the new TruncatedChunkException
that was added to the code as a result of this issue).
It turned out I was experiencing the problem because I was closing the
connection to the web server while I was still reading the chunk. I suspect
that is why the originator was having his problems. I suggest the reason "...
we only crawled 20 websites before we started running into this problem. " was
that the first 19 didn't use chunked output and the reason "We are frequently
encountering this issue" was that there are plenty of sites that do chunk.
Note that it would be possible to process a chunked site without error if the
relative timing of the connection close and completion of the chunk read(s) was
favourable. So the fact that some chunked sites were processed without error
wouldn't necessarily disprove the suggestion. I would expect that some chunked
sites would reliably give the problem and some would give it some of the time.
That being said I didn't find the TruncatedChunkException to be much help
because I was working at the HttpResponse and HttpClient level. By the time the
exception reached that level it was way too late to do anything useful about
it. For the exception to be useful at that level it would need a parameter in
CoreConnectionPNames. This would be used by callers of ChunkedInputStream to
decide whether to treat TruncatedChunkException as fatal or treat it as end of
file. There is already a parameter that deals with buffering of small chunks so
users of ChunkedInputStream would appear to have access to the parameters.
> Make it possible to tolerate truncated chunk streams
> ----------------------------------------------------
>
> Key: HTTPCORE-195
> URL: https://issues.apache.org/jira/browse/HTTPCORE-195
> Project: HttpComponents HttpCore
> Issue Type: Improvement
> Components: HttpCore NIO
> Affects Versions: 4.0
> Reporter: Patrick Moore
> Priority: Minor
> Fix For: 4.1-alpha1
>
> Attachments: chunkValidationDecoupling.patch, HTTPCORE-195.patch
>
>
> Our server is webcrawling.
> We are frequently encountering this issue. We think this might be related to
> something on the server that we are scanning. But that doesn't matter. We
> need to handle such cases without exceptions. (From my perspective, such
> things should generate a debug message -- certainly not an exception that
> ends processing and throws away the retrieved content! )
> http://stuftpizza.com/ seems to reliably result in this problem
> May be TransferEncoding?
> http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.6
> Either way we need to be able to deal with issues on the other servers.
> {{{
> Date Mon, 20 Apr 2009 03:56:45 GMT
> Server Apache/2.2.3 (Red Hat)
> Accept-Ranges bytes
> Connection close
> Transfer-Encoding chunked
> Content-Type text/html
> '''Request Headers'''
> Host stuftpizza.com
> User-Agent Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US;
> rv:1.9.0.8) Gecko/2009032608 Firefox/3.0.8
> Accept text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
> Accept-Language en-us,en;q=0.5
> Accept-Encoding gzip,deflate
> Accept-Charset ISO-8859-1,utf-8;q=0.7,*;q=0.7
> Keep-Alive 300
> Connection keep-alive
> Cookie
> __utma=47358053.1237981682.1240199754.1240199754.1240199754.1;
> __utmb=47358053; __utmc=47358053; __utmz
> =47358053.1240199754.1.1.utmccn=(direct)|utmcsr=(direct)|utmcmd=(none)
> Cache-Control max-age=0
> }}}
> {{{
> 20:51:08,768 INFO [nioEventListener] Request http://stuftpizza.com/ failed
> with exception.
> org.apache.http.MalformedChunkCodingException: Truncated chunk
> at
> org.apache.http.impl.nio.codecs.ChunkDecoder.read(ChunkDecoder.java:203)
> at
> org.apache.http.nio.util.SimpleInputBuffer.consumeContent(SimpleInputBuffer.java:60)
> at
> org.apache.http.nio.entity.BufferingNHttpEntity.consumeContent(BufferingNHttpEntity.java:72)
> at
> org.apache.http.nio.protocol.AsyncNHttpClientHandler.inputReady(AsyncNHttpClientHandler.java:236)
> at
> org.apache.http.nio.protocol.BufferingHttpClientHandler.inputReady(BufferingHttpClientHandler.java:118)
> at
> org.apache.http.impl.nio.DefaultNHttpClientConnection.consumeInput(DefaultNHttpClientConnection.java:178)
> at
> org.apache.http.impl.nio.DefaultClientIOEventDispatch.inputReady(DefaultClientIOEventDispatch.java:146)
> at
> com.amplafi.iomanagement.http.UniversalIOEventDispatch.inputReady(UniversalIOEventDispatch.java:133)
> at
> $IOEventDispatch_120c19cd1c7.inputReady($IOEventDispatch_120c19cd1c7.java)
> at
> org.apache.http.impl.nio.reactor.BaseIOReactor.readable(BaseIOReactor.java:153)
> at
> org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvent(AbstractIOReactor.java:314)
> at
> org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvents(AbstractIOReactor.java:294)
> at
> org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:256)
> at
> org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:96)
> at
> org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:556)
> at java.lang.Thread.run(Thread.java:637)
> }}}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]