[jira] [Commented] (HADOOP-15541) AWS SDK can mistake stream timeouts for EOF and throw SdkClientExceptions
[ https://issues.apache.org/jira/browse/HADOOP-15541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16540126#comment-16540126 ] Steve Loughran commented on HADOOP-15541: - OK. cherry picked to branch-3.1; lets close this one. Thank you for finding a new failure mode :) > AWS SDK can mistake stream timeouts for EOF and throw SdkClientExceptions > - > > Key: HADOOP-15541 > URL: https://issues.apache.org/jira/browse/HADOOP-15541 > Project: Hadoop Common > Issue Type: Bug > Components: fs/s3 >Affects Versions: 2.9.1, 2.8.4, 3.0.2, 3.1.1 >Reporter: Sean Mackrory >Assignee: Sean Mackrory >Priority: Major > Fix For: 3.1.1 > > Attachments: HADOOP-15541.001.patch > > > I've gotten a few reports of read timeouts not being handled properly in some > Impala workloads. What happens is the following sequence of events (credit to > Sailesh Mukil for figuring this out): > * S3AInputStream.read() gets a SocketTimeoutException when it calls > wrappedStream.read() > * This is handled by onReadFailure -> reopen -> closeStream. When we try to > drain the stream, SdkFilterInputStream.read() in the AWS SDK fails because of > checkLength. The underlying Apache Commons stream returns -1 in the case of a > timeout, and EOF. > * The SDK assumes the -1 signifies an EOF, so assumes the bytes read must > equal expected bytes, and because they don't (because it's a timeout and not > an EOF) it throws an SdkClientException. > This is tricky to test for without a ton of mocking of AWS SDK internals, > because you have to get into this conflicting state where the SDK has only > read a subset of the expected bytes and gets a -1. > closeStream will abort the stream in the event of an IOException when > draining. We could simply also abort in the event of an SdkClientException. > I'm testing that this results in correct functionality in the workloads that > seem to hit these timeouts a lot, but all the s3a tests continue to work with > that change. I'm going to open an issue with the AWS SDK Github as well, but > I'm not sure what the ideal outcome would be unless there's a good way to > distinguish between a stream that has timed out and a stream that read all > the data without huge rewrites. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-15541) AWS SDK can mistake stream timeouts for EOF and throw SdkClientExceptions
[ https://issues.apache.org/jira/browse/HADOOP-15541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1653#comment-1653 ] Hudson commented on HADOOP-15541: - SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #14550 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/14550/]) HADOOP-15541. [s3a] Shouldn't try to drain stream before aborting (mackrorysd: rev d503f65b6689b19278ec2a0cf9da5a8762539de8) * (edit) hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java > AWS SDK can mistake stream timeouts for EOF and throw SdkClientExceptions > - > > Key: HADOOP-15541 > URL: https://issues.apache.org/jira/browse/HADOOP-15541 > Project: Hadoop Common > Issue Type: Bug > Components: fs/s3 >Affects Versions: 2.9.1, 2.8.4, 3.0.2, 3.1.1 >Reporter: Sean Mackrory >Assignee: Sean Mackrory >Priority: Major > Attachments: HADOOP-15541.001.patch > > > I've gotten a few reports of read timeouts not being handled properly in some > Impala workloads. What happens is the following sequence of events (credit to > Sailesh Mukil for figuring this out): > * S3AInputStream.read() gets a SocketTimeoutException when it calls > wrappedStream.read() > * This is handled by onReadFailure -> reopen -> closeStream. When we try to > drain the stream, SdkFilterInputStream.read() in the AWS SDK fails because of > checkLength. The underlying Apache Commons stream returns -1 in the case of a > timeout, and EOF. > * The SDK assumes the -1 signifies an EOF, so assumes the bytes read must > equal expected bytes, and because they don't (because it's a timeout and not > an EOF) it throws an SdkClientException. > This is tricky to test for without a ton of mocking of AWS SDK internals, > because you have to get into this conflicting state where the SDK has only > read a subset of the expected bytes and gets a -1. > closeStream will abort the stream in the event of an IOException when > draining. We could simply also abort in the event of an SdkClientException. > I'm testing that this results in correct functionality in the workloads that > seem to hit these timeouts a lot, but all the s3a tests continue to work with > that change. I'm going to open an issue with the AWS SDK Github as well, but > I'm not sure what the ideal outcome would be unless there's a good way to > distinguish between a stream that has timed out and a stream that read all > the data without huge rewrites. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-15541) AWS SDK can mistake stream timeouts for EOF and throw SdkClientExceptions
[ https://issues.apache.org/jira/browse/HADOOP-15541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538842#comment-16538842 ] Sean Mackrory commented on HADOOP-15541: Thanks Steve, commited. I'd like to commit this right now to address the known issue. I wanna do a bit of searching around and see if I can find any cases of IOExceptions where it would make sense to reuse the stream before taking it further. I'll a separate JIRA for that before resolving... > AWS SDK can mistake stream timeouts for EOF and throw SdkClientExceptions > - > > Key: HADOOP-15541 > URL: https://issues.apache.org/jira/browse/HADOOP-15541 > Project: Hadoop Common > Issue Type: Bug > Components: fs/s3 >Affects Versions: 2.9.1, 2.8.4, 3.0.2, 3.1.1 >Reporter: Sean Mackrory >Assignee: Sean Mackrory >Priority: Major > Attachments: HADOOP-15541.001.patch > > > I've gotten a few reports of read timeouts not being handled properly in some > Impala workloads. What happens is the following sequence of events (credit to > Sailesh Mukil for figuring this out): > * S3AInputStream.read() gets a SocketTimeoutException when it calls > wrappedStream.read() > * This is handled by onReadFailure -> reopen -> closeStream. When we try to > drain the stream, SdkFilterInputStream.read() in the AWS SDK fails because of > checkLength. The underlying Apache Commons stream returns -1 in the case of a > timeout, and EOF. > * The SDK assumes the -1 signifies an EOF, so assumes the bytes read must > equal expected bytes, and because they don't (because it's a timeout and not > an EOF) it throws an SdkClientException. > This is tricky to test for without a ton of mocking of AWS SDK internals, > because you have to get into this conflicting state where the SDK has only > read a subset of the expected bytes and gets a -1. > closeStream will abort the stream in the event of an IOException when > draining. We could simply also abort in the event of an SdkClientException. > I'm testing that this results in correct functionality in the workloads that > seem to hit these timeouts a lot, but all the s3a tests continue to work with > that change. I'm going to open an issue with the AWS SDK Github as well, but > I'm not sure what the ideal outcome would be unless there's a good way to > distinguish between a stream that has timed out and a stream that read all > the data without huge rewrites. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-15541) AWS SDK can mistake stream timeouts for EOF and throw SdkClientExceptions
[ https://issues.apache.org/jira/browse/HADOOP-15541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538498#comment-16538498 ] Steve Loughran commented on HADOOP-15541: - Looks good. One issue: should we always force close on a read failure, rather than treat SocketTimeoutException as special? I guess there are some potential failure modes (source was deleted during the read) which could trigger IOEs during the GET (maybe? Do we test this with a large enough file to be sure there's no caching going on? If not, I could imagine adding it to the huge files test). IF we say "every IOE -> forced abort', then its a simpler path on read. What you have here though is the core fix: on socket errors, don't try and recycle things. What do you think? If you want this one as is, you've got my +1. I'm just wondering if the need to add a separate catch for SocketTimeoutException is needed > AWS SDK can mistake stream timeouts for EOF and throw SdkClientExceptions > - > > Key: HADOOP-15541 > URL: https://issues.apache.org/jira/browse/HADOOP-15541 > Project: Hadoop Common > Issue Type: Bug > Components: fs/s3 >Affects Versions: 2.9.1, 2.8.4, 3.0.2, 3.1.1 >Reporter: Sean Mackrory >Assignee: Sean Mackrory >Priority: Major > Attachments: HADOOP-15541.001.patch > > > I've gotten a few reports of read timeouts not being handled properly in some > Impala workloads. What happens is the following sequence of events (credit to > Sailesh Mukil for figuring this out): > * S3AInputStream.read() gets a SocketTimeoutException when it calls > wrappedStream.read() > * This is handled by onReadFailure -> reopen -> closeStream. When we try to > drain the stream, SdkFilterInputStream.read() in the AWS SDK fails because of > checkLength. The underlying Apache Commons stream returns -1 in the case of a > timeout, and EOF. > * The SDK assumes the -1 signifies an EOF, so assumes the bytes read must > equal expected bytes, and because they don't (because it's a timeout and not > an EOF) it throws an SdkClientException. > This is tricky to test for without a ton of mocking of AWS SDK internals, > because you have to get into this conflicting state where the SDK has only > read a subset of the expected bytes and gets a -1. > closeStream will abort the stream in the event of an IOException when > draining. We could simply also abort in the event of an SdkClientException. > I'm testing that this results in correct functionality in the workloads that > seem to hit these timeouts a lot, but all the s3a tests continue to work with > that change. I'm going to open an issue with the AWS SDK Github as well, but > I'm not sure what the ideal outcome would be unless there's a good way to > distinguish between a stream that has timed out and a stream that read all > the data without huge rewrites. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-15541) AWS SDK can mistake stream timeouts for EOF and throw SdkClientExceptions
[ https://issues.apache.org/jira/browse/HADOOP-15541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16535510#comment-16535510 ] genericqa commented on HADOOP-15541: | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 20s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s{color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 27m 47s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 27s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 13s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 30s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 19s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 35s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 20s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 31s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 23s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 23s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 9s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 27s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 59s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 4s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 19s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 4m 31s{color} | {color:green} hadoop-aws in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 22s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 61m 33s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:abb62dd | | JIRA Issue | HADOOP-15541 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12930597/HADOOP-15541.001.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux e6408faeb955 3.13.0-137-generic #186-Ubuntu SMP Mon Dec 4 19:09:19 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / ba68320 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_171 | | findbugs | v3.1.0-RC1 | | Test Results | https://builds.apache.org/job/PreCommit-HADOOP-Build/14864/testReport/ | | Max. process+thread count | 302 (vs. ulimit of 1) | | modules | C: hadoop-tools/hadoop-aws U: hadoop-tools/hadoop-aws | | Console output | https://builds.apache.org/job/PreCommit-HADOOP-Build/14864/console | | Powered by | Apache Yetus 0.8.0-SNAPSHOT http://yetus.apache.org | This message was automatically generated. > AWS SDK can mistake stream timeouts for EOF and throw
[jira] [Commented] (HADOOP-15541) AWS SDK can mistake stream timeouts for EOF and throw SdkClientExceptions
[ https://issues.apache.org/jira/browse/HADOOP-15541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16535386#comment-16535386 ] Sean Mackrory commented on HADOOP-15541: Although on first glance, it certainly seems that calling streamStatistics.streamClose takes care of all that, and we're doing that. > AWS SDK can mistake stream timeouts for EOF and throw SdkClientExceptions > - > > Key: HADOOP-15541 > URL: https://issues.apache.org/jira/browse/HADOOP-15541 > Project: Hadoop Common > Issue Type: Bug > Components: fs/s3 >Affects Versions: 2.9.1, 2.8.4, 3.0.2, 3.1.1 >Reporter: Sean Mackrory >Assignee: Sean Mackrory >Priority: Major > Attachments: HADOOP-15541.001.patch > > > I've gotten a few reports of read timeouts not being handled properly in some > Impala workloads. What happens is the following sequence of events (credit to > Sailesh Mukil for figuring this out): > * S3AInputStream.read() gets a SocketTimeoutException when it calls > wrappedStream.read() > * This is handled by onReadFailure -> reopen -> closeStream. When we try to > drain the stream, SdkFilterInputStream.read() in the AWS SDK fails because of > checkLength. The underlying Apache Commons stream returns -1 in the case of a > timeout, and EOF. > * The SDK assumes the -1 signifies an EOF, so assumes the bytes read must > equal expected bytes, and because they don't (because it's a timeout and not > an EOF) it throws an SdkClientException. > This is tricky to test for without a ton of mocking of AWS SDK internals, > because you have to get into this conflicting state where the SDK has only > read a subset of the expected bytes and gets a -1. > closeStream will abort the stream in the event of an IOException when > draining. We could simply also abort in the event of an SdkClientException. > I'm testing that this results in correct functionality in the workloads that > seem to hit these timeouts a lot, but all the s3a tests continue to work with > that change. I'm going to open an issue with the AWS SDK Github as well, but > I'm not sure what the ideal outcome would be unless there's a good way to > distinguish between a stream that has timed out and a stream that read all > the data without huge rewrites. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-15541) AWS SDK can mistake stream timeouts for EOF and throw SdkClientExceptions
[ https://issues.apache.org/jira/browse/HADOOP-15541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16535383#comment-16535383 ] Sean Mackrory commented on HADOOP-15541: Thanks for the explanation [~ste...@apache.org]. Attaching a patch that uses the existing force-abort code in the event of a timeout. All tests continue to pass, and the workload that was consistently timing out before suddenly stopped upon applying this patch. I just saw your comment about incrementing metrics, though. Let me check for those and revise the patch if necessary. > AWS SDK can mistake stream timeouts for EOF and throw SdkClientExceptions > - > > Key: HADOOP-15541 > URL: https://issues.apache.org/jira/browse/HADOOP-15541 > Project: Hadoop Common > Issue Type: Bug > Components: fs/s3 >Affects Versions: 2.9.1, 2.8.4, 3.0.2, 3.1.1 >Reporter: Sean Mackrory >Assignee: Sean Mackrory >Priority: Major > Attachments: HADOOP-15541.001.patch > > > I've gotten a few reports of read timeouts not being handled properly in some > Impala workloads. What happens is the following sequence of events (credit to > Sailesh Mukil for figuring this out): > * S3AInputStream.read() gets a SocketTimeoutException when it calls > wrappedStream.read() > * This is handled by onReadFailure -> reopen -> closeStream. When we try to > drain the stream, SdkFilterInputStream.read() in the AWS SDK fails because of > checkLength. The underlying Apache Commons stream returns -1 in the case of a > timeout, and EOF. > * The SDK assumes the -1 signifies an EOF, so assumes the bytes read must > equal expected bytes, and because they don't (because it's a timeout and not > an EOF) it throws an SdkClientException. > This is tricky to test for without a ton of mocking of AWS SDK internals, > because you have to get into this conflicting state where the SDK has only > read a subset of the expected bytes and gets a -1. > closeStream will abort the stream in the event of an IOException when > draining. We could simply also abort in the event of an SdkClientException. > I'm testing that this results in correct functionality in the workloads that > seem to hit these timeouts a lot, but all the s3a tests continue to work with > that change. I'm going to open an issue with the AWS SDK Github as well, but > I'm not sure what the ideal outcome would be unless there's a good way to > distinguish between a stream that has timed out and a stream that read all > the data without huge rewrites. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-15541) AWS SDK can mistake stream timeouts for EOF and throw SdkClientExceptions
[ https://issues.apache.org/jira/browse/HADOOP-15541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16517417#comment-16517417 ] Steve Loughran commented on HADOOP-15541: - Stream draining means the http1.1 connection can be returned to the pool and so save setup costs, which is why we like to do it on close() But here, if we can conclude that the connection is in trouble, should we return it it? No objection to doing the abort for IOEs and SDKs, I was suggesting the arg because the reopen code already takes that param...requesting that forced abort after an exception on read() would be good. though: are you suggesting for any IOE/SDK exception we don't try to reopen the call, just force the abort() before throwing up the exception? If so, yes, that also makes sense. We don't want a failing HTTP connection to be recycled Make sure any metrics on forced aborts are incremented though > AWS SDK can mistake stream timeouts for EOF and throw SdkClientExceptions > - > > Key: HADOOP-15541 > URL: https://issues.apache.org/jira/browse/HADOOP-15541 > Project: Hadoop Common > Issue Type: Bug > Components: fs/s3 >Affects Versions: 2.9.1, 2.8.4, 3.0.2, 3.1.1 >Reporter: Sean Mackrory >Assignee: Sean Mackrory >Priority: Major > > I've gotten a few reports of read timeouts not being handled properly in some > Impala workloads. What happens is the following sequence of events (credit to > Sailesh Mukil for figuring this out): > * S3AInputStream.read() gets a SocketTimeoutException when it calls > wrappedStream.read() > * This is handled by onReadFailure -> reopen -> closeStream. When we try to > drain the stream, SdkFilterInputStream.read() in the AWS SDK fails because of > checkLength. The underlying Apache Commons stream returns -1 in the case of a > timeout, and EOF. > * The SDK assumes the -1 signifies an EOF, so assumes the bytes read must > equal expected bytes, and because they don't (because it's a timeout and not > an EOF) it throws an SdkClientException. > This is tricky to test for without a ton of mocking of AWS SDK internals, > because you have to get into this conflicting state where the SDK has only > read a subset of the expected bytes and gets a -1. > closeStream will abort the stream in the event of an IOException when > draining. We could simply also abort in the event of an SdkClientException. > I'm testing that this results in correct functionality in the workloads that > seem to hit these timeouts a lot, but all the s3a tests continue to work with > that change. I'm going to open an issue with the AWS SDK Github as well, but > I'm not sure what the ideal outcome would be unless there's a good way to > distinguish between a stream that has timed out and a stream that read all > the data without huge rewrites. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-15541) AWS SDK can mistake stream timeouts for EOF and throw SdkClientExceptions
[ https://issues.apache.org/jira/browse/HADOOP-15541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16516306#comment-16516306 ] Sean Mackrory commented on HADOOP-15541: {quote}Like you say, no real point in not aborting here.\{quote} Help me understand, though: when *do* we get a benefit from draining the stream instead of simply aborting? {quote}Happy for a patch, I don't think we can test this easily so not expecting any tests in the patch...\{quote} Yeah. This was (at the time anyway) happening pretty repeatedly with a particular workload - I'm hoping that keeps up so I can be fairly confident that the end result here is correct handling of timeouts. > AWS SDK can mistake stream timeouts for EOF and throw SdkClientExceptions > - > > Key: HADOOP-15541 > URL: https://issues.apache.org/jira/browse/HADOOP-15541 > Project: Hadoop Common > Issue Type: Bug > Components: fs/s3 >Affects Versions: 2.9.1, 2.8.4, 3.0.2, 3.1.1 >Reporter: Sean Mackrory >Assignee: Sean Mackrory >Priority: Major > > I've gotten a few reports of read timeouts not being handled properly in some > Impala workloads. What happens is the following sequence of events (credit to > Sailesh Mukil for figuring this out): > * S3AInputStream.read() gets a SocketTimeoutException when it calls > wrappedStream.read() > * This is handled by onReadFailure -> reopen -> closeStream. When we try to > drain the stream, SdkFilterInputStream.read() in the AWS SDK fails because of > checkLength. The underlying Apache Commons stream returns -1 in the case of a > timeout, and EOF. > * The SDK assumes the -1 signifies an EOF, so assumes the bytes read must > equal expected bytes, and because they don't (because it's a timeout and not > an EOF) it throws an SdkClientException. > This is tricky to test for without a ton of mocking of AWS SDK internals, > because you have to get into this conflicting state where the SDK has only > read a subset of the expected bytes and gets a -1. > closeStream will abort the stream in the event of an IOException when > draining. We could simply also abort in the event of an SdkClientException. > I'm testing that this results in correct functionality in the workloads that > seem to hit these timeouts a lot, but all the s3a tests continue to work with > that change. I'm going to open an issue with the AWS SDK Github as well, but > I'm not sure what the ideal outcome would be unless there's a good way to > distinguish between a stream that has timed out and a stream that read all > the data without huge rewrites. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-15541) AWS SDK can mistake stream timeouts for EOF and throw SdkClientExceptions
[ https://issues.apache.org/jira/browse/HADOOP-15541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16513608#comment-16513608 ] Steve Loughran commented on HADOOP-15541: - I've worried about something related to this for a while, precisely because we are using close() not abort. Assuming the error on read() is due to a network problem, breaking that whole TCP connection is the only way to guarantee that your followup GET isn't on the same HTTP1.1 stream. I wasn't too worried, on the basis that nobody had complained...clearly that's not true any more. And my expectation of how things would fail was worse. Here's one possible strategy # {{S3AInputStream.reopen()}} adds a {{boolean forceAbort}} param; passes it in to {{closeStream}}; # {{S3AInputStream.onReadFailure()}} forces that abort. Like you say, no real point in not aborting here. Happy for a patch, I don't think we can test this easily so not expecting any tests in the patch... > AWS SDK can mistake stream timeouts for EOF and throw SdkClientExceptions > - > > Key: HADOOP-15541 > URL: https://issues.apache.org/jira/browse/HADOOP-15541 > Project: Hadoop Common > Issue Type: Bug > Components: fs/s3 >Affects Versions: 2.9.1, 2.8.4, 3.0.2, 3.1.1 >Reporter: Sean Mackrory >Assignee: Sean Mackrory >Priority: Major > > I've gotten a few reports of read timeouts not being handled properly in some > Impala workloads. What happens is the following sequence of events (credit to > Sailesh Mukil for figuring this out): > * S3AInputStream.read() gets a SocketTimeoutException when it calls > wrappedStream.read() > * This is handled by onReadFailure -> reopen -> closeStream. When we try to > drain the stream, SdkFilterInputStream.read() in the AWS SDK fails because of > checkLength. The underlying Apache Commons stream returns -1 in the case of a > timeout, and EOF. > * The SDK assumes the -1 signifies an EOF, so assumes the bytes read must > equal expected bytes, and because they don't (because it's a timeout and not > an EOF) it throws an SdkClientException. > This is tricky to test for without a ton of mocking of AWS SDK internals, > because you have to get into this conflicting state where the SDK has only > read a subset of the expected bytes and gets a -1. > closeStream will abort the stream in the event of an IOException when > draining. We could simply also abort in the event of an SdkClientException. > I'm testing that this results in correct functionality in the workloads that > seem to hit these timeouts a lot, but all the s3a tests continue to work with > that change. I'm going to open an issue with the AWS SDK Github as well, but > I'm not sure what the ideal outcome would be unless there's a good way to > distinguish between a stream that has timed out and a stream that read all > the data without huge rewrites. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-15541) AWS SDK can mistake stream timeouts for EOF and throw SdkClientExceptions
[ https://issues.apache.org/jira/browse/HADOOP-15541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16512974#comment-16512974 ] Sean Mackrory commented on HADOOP-15541: Also filed an issue with the SDK: [https://github.com/aws/aws-sdk-java/issues/1630.] But like I said, I'm not sure what the point is or if there's anything wrong with just aborting on SdkClientExceptions since we'll have to fail at some point anyway. > AWS SDK can mistake stream timeouts for EOF and throw SdkClientExceptions > - > > Key: HADOOP-15541 > URL: https://issues.apache.org/jira/browse/HADOOP-15541 > Project: Hadoop Common > Issue Type: Bug >Reporter: Sean Mackrory >Assignee: Sean Mackrory >Priority: Major > > I've gotten a few reports of read timeouts not being handled properly in some > Impala workloads. What happens is the following sequence of events (credit to > Sailesh Mukil for figuring this out): > * S3AInputStream.read() gets a SocketTimeoutException when it calls > wrappedStream.read() > * This is handled by onReadFailure -> reopen -> closeStream. When we try to > drain the stream, SdkFilterInputStream.read() in the AWS SDK fails because of > checkLength. The underlying Apache Commons stream returns -1 in the case of a > timeout, and EOF. > * The SDK assumes the -1 signifies an EOF, so assumes the bytes read must > equal expected bytes, and because they don't (because it's a timeout and not > an EOF) it throws an SdkClientException. > This is tricky to test for without a ton of mocking of AWS SDK internals, > because you have to get into this conflicting state where the SDK has only > read a subset of the expected bytes and gets a -1. > closeStream will abort the stream in the event of an IOException when > draining. We could simply also abort in the event of an SdkClientException. > I'm testing that this results in correct functionality in the workloads that > seem to hit these timeouts a lot, but all the s3a tests continue to work with > that change. I'm going to open an issue with the AWS SDK Github as well, but > I'm not sure what the ideal outcome would be unless there's a good way to > distinguish between a stream that has timed out and a stream that read all > the data without huge rewrites. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-15541) AWS SDK can mistake stream timeouts for EOF and throw SdkClientExceptions
[ https://issues.apache.org/jira/browse/HADOOP-15541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16512941#comment-16512941 ] Sean Mackrory commented on HADOOP-15541: There are a bunch of subtle bugs that have lead to use recovering the way we do. Pinging [~ste...@apache.org] who has worked on a few of these: do you know what benefit we gain from draining the stream instead of simply aborting and starting a new stream? > AWS SDK can mistake stream timeouts for EOF and throw SdkClientExceptions > - > > Key: HADOOP-15541 > URL: https://issues.apache.org/jira/browse/HADOOP-15541 > Project: Hadoop Common > Issue Type: Bug >Reporter: Sean Mackrory >Assignee: Sean Mackrory >Priority: Major > > I've gotten a few reports of read timeouts not being handled properly in some > Impala workloads. What happens is the following sequence of events (credit to > Sailesh Mukil for figuring this out): > * S3AInputStream.read() gets a SocketTimeoutException when it calls > wrappedStream.read() > * This is handled by onReadFailure -> reopen -> closeStream. When we try to > drain the stream, SdkFilterInputStream.read() in the AWS SDK fails because of > checkLength. The underlying Apache Commons stream returns -1 in the case of a > timeout, and EOF. > * The SDK assumes the -1 signifies an EOF, so assumes the bytes read must > equal expected bytes, and because they don't (because it's a timeout and not > an EOF) it throws an SdkClientException. > This is tricky to test for without a ton of mocking of AWS SDK internals, > because you have to get into this conflicting state where the SDK has only > read a subset of the expected bytes and gets a -1. > closeStream will abort the stream in the event of an IOException when > draining. We could simply also abort in the event of an SdkClientException. > I'm testing that this results in correct functionality in the workloads that > seem to hit these timeouts a lot, but all the s3a tests continue to work with > that change. I'm going to open an issue with the AWS SDK Github as well, but > I'm not sure what the ideal outcome would be unless there's a good way to > distinguish between a stream that has timed out and a stream that read all > the data without huge rewrites. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org