Sean Mackrory created HADOOP-15541:
--------------------------------------
Summary: AWS SDK can mistake stream timeouts for EOF and throw
SdkClientExceptions
Key: HADOOP-15541
URL: https://issues.apache.org/jira/browse/HADOOP-15541
Project: Hadoop Common
Issue Type: Bug
Reporter: Sean Mackrory
Assignee: Sean Mackrory
I've gotten a few reports of read timeouts not being handled properly in some
Impala workloads. What happens is the following sequence of events (credit to
Sailesh Mukil for figuring this out):
* S3AInputStream.read() gets a SocketTimeoutException when it calls
wrappedStream.read()
* This is handled by onReadFailure -> reopen -> closeStream. When we try to
drain the stream, SdkFilterInputStream.read() in the AWS SDK fails because of
checkLength. The underlying Apache Commons stream returns -1 in the case of a
timeout, and EOF.
* The SDK assumes the -1 signifies an EOF, so assumes the bytes read must
equal expected bytes, and because they don't (because it's a timeout and not an
EOF) it throws an SdkClientException.
This is tricky to test for without a ton of mocking of AWS SDK internals,
because you have to get into this conflicting state where the SDK has only read
a subset of the expected bytes and gets a -1.
closeStream will abort the stream in the event of an IOException when draining.
We could simply also abort in the event of an SdkClientException. I'm testing
that this results in correct functionality in the workloads that seem to hit
these timeouts a lot, but all the s3a tests continue to work with that change.
I'm going to open an issue with the AWS SDK Github as well, but I'm not sure
what the ideal outcome would be unless there's a good way to distinguish
between a stream that has timed out and a stream that read all the data without
huge rewrites.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]