[ https://issues.apache.org/jira/browse/HADOOP-15541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16516306#comment-16516306 ]
Sean Mackrory commented on HADOOP-15541: ---------------------------------------- {quote}Like you say, no real point in not aborting here.\{quote} Help me understand, though: when *do* we get a benefit from draining the stream instead of simply aborting? {quote}Happy for a patch, I don't think we can test this easily so not expecting any tests in the patch...\{quote} Yeah. This was (at the time anyway) happening pretty repeatedly with a particular workload - I'm hoping that keeps up so I can be fairly confident that the end result here is correct handling of timeouts. > AWS SDK can mistake stream timeouts for EOF and throw SdkClientExceptions > ------------------------------------------------------------------------- > > Key: HADOOP-15541 > URL: https://issues.apache.org/jira/browse/HADOOP-15541 > Project: Hadoop Common > Issue Type: Bug > Components: fs/s3 > Affects Versions: 2.9.1, 2.8.4, 3.0.2, 3.1.1 > Reporter: Sean Mackrory > Assignee: Sean Mackrory > Priority: Major > > I've gotten a few reports of read timeouts not being handled properly in some > Impala workloads. What happens is the following sequence of events (credit to > Sailesh Mukil for figuring this out): > * S3AInputStream.read() gets a SocketTimeoutException when it calls > wrappedStream.read() > * This is handled by onReadFailure -> reopen -> closeStream. When we try to > drain the stream, SdkFilterInputStream.read() in the AWS SDK fails because of > checkLength. The underlying Apache Commons stream returns -1 in the case of a > timeout, and EOF. > * The SDK assumes the -1 signifies an EOF, so assumes the bytes read must > equal expected bytes, and because they don't (because it's a timeout and not > an EOF) it throws an SdkClientException. > This is tricky to test for without a ton of mocking of AWS SDK internals, > because you have to get into this conflicting state where the SDK has only > read a subset of the expected bytes and gets a -1. > closeStream will abort the stream in the event of an IOException when > draining. We could simply also abort in the event of an SdkClientException. > I'm testing that this results in correct functionality in the workloads that > seem to hit these timeouts a lot, but all the s3a tests continue to work with > that change. I'm going to open an issue with the AWS SDK Github as well, but > I'm not sure what the ideal outcome would be unless there's a good way to > distinguish between a stream that has timed out and a stream that read all > the data without huge rewrites. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org