[
https://issues.apache.org/jira/browse/HADOOP-15541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16517417#comment-16517417
]
Steve Loughran commented on HADOOP-15541:
-----------------------------------------
Stream draining means the http1.1 connection can be returned to the pool and so
save setup costs, which is why we like to do it on close()
But here, if we can conclude that the connection is in trouble, should we
return it it?
No objection to doing the abort for IOEs and SDKs, I was suggesting the arg
because the reopen code already takes that param...requesting that forced abort
after an exception on read() would be good.
though: are you suggesting for any IOE/SDK exception we don't try to reopen the
call, just force the abort() before throwing up the exception? If so, yes, that
also makes sense. We don't want a failing HTTP connection to be recycled
Make sure any metrics on forced aborts are incremented though
> AWS SDK can mistake stream timeouts for EOF and throw SdkClientExceptions
> -------------------------------------------------------------------------
>
> Key: HADOOP-15541
> URL: https://issues.apache.org/jira/browse/HADOOP-15541
> Project: Hadoop Common
> Issue Type: Bug
> Components: fs/s3
> Affects Versions: 2.9.1, 2.8.4, 3.0.2, 3.1.1
> Reporter: Sean Mackrory
> Assignee: Sean Mackrory
> Priority: Major
>
> I've gotten a few reports of read timeouts not being handled properly in some
> Impala workloads. What happens is the following sequence of events (credit to
> Sailesh Mukil for figuring this out):
> * S3AInputStream.read() gets a SocketTimeoutException when it calls
> wrappedStream.read()
> * This is handled by onReadFailure -> reopen -> closeStream. When we try to
> drain the stream, SdkFilterInputStream.read() in the AWS SDK fails because of
> checkLength. The underlying Apache Commons stream returns -1 in the case of a
> timeout, and EOF.
> * The SDK assumes the -1 signifies an EOF, so assumes the bytes read must
> equal expected bytes, and because they don't (because it's a timeout and not
> an EOF) it throws an SdkClientException.
> This is tricky to test for without a ton of mocking of AWS SDK internals,
> because you have to get into this conflicting state where the SDK has only
> read a subset of the expected bytes and gets a -1.
> closeStream will abort the stream in the event of an IOException when
> draining. We could simply also abort in the event of an SdkClientException.
> I'm testing that this results in correct functionality in the workloads that
> seem to hit these timeouts a lot, but all the s3a tests continue to work with
> that change. I'm going to open an issue with the AWS SDK Github as well, but
> I'm not sure what the ideal outcome would be unless there's a good way to
> distinguish between a stream that has timed out and a stream that read all
> the data without huge rewrites.
>
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]