[ https://issues.apache.org/jira/browse/HADOOP-15541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16538842#comment-16538842 ]
Sean Mackrory commented on HADOOP-15541: ---------------------------------------- Thanks Steve, commited. I'd like to commit this right now to address the known issue. I wanna do a bit of searching around and see if I can find any cases of IOExceptions where it would make sense to reuse the stream before taking it further. I'll a separate JIRA for that before resolving... > AWS SDK can mistake stream timeouts for EOF and throw SdkClientExceptions > ------------------------------------------------------------------------- > > Key: HADOOP-15541 > URL: https://issues.apache.org/jira/browse/HADOOP-15541 > Project: Hadoop Common > Issue Type: Bug > Components: fs/s3 > Affects Versions: 2.9.1, 2.8.4, 3.0.2, 3.1.1 > Reporter: Sean Mackrory > Assignee: Sean Mackrory > Priority: Major > Attachments: HADOOP-15541.001.patch > > > I've gotten a few reports of read timeouts not being handled properly in some > Impala workloads. What happens is the following sequence of events (credit to > Sailesh Mukil for figuring this out): > * S3AInputStream.read() gets a SocketTimeoutException when it calls > wrappedStream.read() > * This is handled by onReadFailure -> reopen -> closeStream. When we try to > drain the stream, SdkFilterInputStream.read() in the AWS SDK fails because of > checkLength. The underlying Apache Commons stream returns -1 in the case of a > timeout, and EOF. > * The SDK assumes the -1 signifies an EOF, so assumes the bytes read must > equal expected bytes, and because they don't (because it's a timeout and not > an EOF) it throws an SdkClientException. > This is tricky to test for without a ton of mocking of AWS SDK internals, > because you have to get into this conflicting state where the SDK has only > read a subset of the expected bytes and gets a -1. > closeStream will abort the stream in the event of an IOException when > draining. We could simply also abort in the event of an SdkClientException. > I'm testing that this results in correct functionality in the workloads that > seem to hit these timeouts a lot, but all the s3a tests continue to work with > that change. I'm going to open an issue with the AWS SDK Github as well, but > I'm not sure what the ideal outcome would be unless there's a good way to > distinguish between a stream that has timed out and a stream that read all > the data without huge rewrites. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org