[jira] [Commented] (HADOOP-15541) AWS SDK can mistake stream timeouts for EOF and throw SdkClientExceptions

2018-07-11 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-15541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16540126#comment-16540126
 ] 

Steve Loughran commented on HADOOP-15541:
-

OK. cherry picked to branch-3.1; lets close this one.

Thank you for finding a new failure mode :)

> AWS SDK can mistake stream timeouts for EOF and throw SdkClientExceptions
> -
>
> Key: HADOOP-15541
> URL: https://issues.apache.org/jira/browse/HADOOP-15541
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs/s3
>Affects Versions: 2.9.1, 2.8.4, 3.0.2, 3.1.1
>Reporter: Sean Mackrory
>Assignee: Sean Mackrory
>Priority: Major
> Fix For: 3.1.1
>
> Attachments: HADOOP-15541.001.patch
>
>
> I've gotten a few reports of read timeouts not being handled properly in some 
> Impala workloads. What happens is the following sequence of events (credit to 
> Sailesh Mukil for figuring this out):
>  * S3AInputStream.read() gets a SocketTimeoutException when it calls 
> wrappedStream.read()
>  * This is handled by onReadFailure -> reopen -> closeStream. When we try to 
> drain the stream, SdkFilterInputStream.read() in the AWS SDK fails because of 
> checkLength. The underlying Apache Commons stream returns -1 in the case of a 
> timeout, and EOF.
>  * The SDK assumes the -1 signifies an EOF, so assumes the bytes read must 
> equal expected bytes, and because they don't (because it's a timeout and not 
> an EOF) it throws an SdkClientException.
> This is tricky to test for without a ton of mocking of AWS SDK internals, 
> because you have to get into this conflicting state where the SDK has only 
> read a subset of the expected bytes and gets a -1.
> closeStream will abort the stream in the event of an IOException when 
> draining. We could simply also abort in the event of an SdkClientException. 
> I'm testing that this results in correct functionality in the workloads that 
> seem to hit these timeouts a lot, but all the s3a tests continue to work with 
> that change. I'm going to open an issue with the AWS SDK Github as well, but 
> I'm not sure what the ideal outcome would be unless there's a good way to 
> distinguish between a stream that has timed out and a stream that read all 
> the data without huge rewrites.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-15541) AWS SDK can mistake stream timeouts for EOF and throw SdkClientExceptions

2018-07-10 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-15541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1653#comment-1653
 ] 

Hudson commented on HADOOP-15541:
-

SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #14550 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/14550/])
HADOOP-15541. [s3a] Shouldn't try to drain stream before aborting (mackrorysd: 
rev d503f65b6689b19278ec2a0cf9da5a8762539de8)
* (edit) 
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java


> AWS SDK can mistake stream timeouts for EOF and throw SdkClientExceptions
> -
>
> Key: HADOOP-15541
> URL: https://issues.apache.org/jira/browse/HADOOP-15541
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs/s3
>Affects Versions: 2.9.1, 2.8.4, 3.0.2, 3.1.1
>Reporter: Sean Mackrory
>Assignee: Sean Mackrory
>Priority: Major
> Attachments: HADOOP-15541.001.patch
>
>
> I've gotten a few reports of read timeouts not being handled properly in some 
> Impala workloads. What happens is the following sequence of events (credit to 
> Sailesh Mukil for figuring this out):
>  * S3AInputStream.read() gets a SocketTimeoutException when it calls 
> wrappedStream.read()
>  * This is handled by onReadFailure -> reopen -> closeStream. When we try to 
> drain the stream, SdkFilterInputStream.read() in the AWS SDK fails because of 
> checkLength. The underlying Apache Commons stream returns -1 in the case of a 
> timeout, and EOF.
>  * The SDK assumes the -1 signifies an EOF, so assumes the bytes read must 
> equal expected bytes, and because they don't (because it's a timeout and not 
> an EOF) it throws an SdkClientException.
> This is tricky to test for without a ton of mocking of AWS SDK internals, 
> because you have to get into this conflicting state where the SDK has only 
> read a subset of the expected bytes and gets a -1.
> closeStream will abort the stream in the event of an IOException when 
> draining. We could simply also abort in the event of an SdkClientException. 
> I'm testing that this results in correct functionality in the workloads that 
> seem to hit these timeouts a lot, but all the s3a tests continue to work with 
> that change. I'm going to open an issue with the AWS SDK Github as well, but 
> I'm not sure what the ideal outcome would be unless there's a good way to 
> distinguish between a stream that has timed out and a stream that read all 
> the data without huge rewrites.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-15541) AWS SDK can mistake stream timeouts for EOF and throw SdkClientExceptions

2018-07-10 Thread Sean Mackrory (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-15541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538842#comment-16538842
 ] 

Sean Mackrory commented on HADOOP-15541:


Thanks Steve, commited. I'd like to commit this right now to address the known 
issue. I wanna do a bit of searching around and see if I can find any cases of 
IOExceptions where it would make sense to reuse the stream before taking it 
further. I'll a separate JIRA for that before resolving...

> AWS SDK can mistake stream timeouts for EOF and throw SdkClientExceptions
> -
>
> Key: HADOOP-15541
> URL: https://issues.apache.org/jira/browse/HADOOP-15541
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs/s3
>Affects Versions: 2.9.1, 2.8.4, 3.0.2, 3.1.1
>Reporter: Sean Mackrory
>Assignee: Sean Mackrory
>Priority: Major
> Attachments: HADOOP-15541.001.patch
>
>
> I've gotten a few reports of read timeouts not being handled properly in some 
> Impala workloads. What happens is the following sequence of events (credit to 
> Sailesh Mukil for figuring this out):
>  * S3AInputStream.read() gets a SocketTimeoutException when it calls 
> wrappedStream.read()
>  * This is handled by onReadFailure -> reopen -> closeStream. When we try to 
> drain the stream, SdkFilterInputStream.read() in the AWS SDK fails because of 
> checkLength. The underlying Apache Commons stream returns -1 in the case of a 
> timeout, and EOF.
>  * The SDK assumes the -1 signifies an EOF, so assumes the bytes read must 
> equal expected bytes, and because they don't (because it's a timeout and not 
> an EOF) it throws an SdkClientException.
> This is tricky to test for without a ton of mocking of AWS SDK internals, 
> because you have to get into this conflicting state where the SDK has only 
> read a subset of the expected bytes and gets a -1.
> closeStream will abort the stream in the event of an IOException when 
> draining. We could simply also abort in the event of an SdkClientException. 
> I'm testing that this results in correct functionality in the workloads that 
> seem to hit these timeouts a lot, but all the s3a tests continue to work with 
> that change. I'm going to open an issue with the AWS SDK Github as well, but 
> I'm not sure what the ideal outcome would be unless there's a good way to 
> distinguish between a stream that has timed out and a stream that read all 
> the data without huge rewrites.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-15541) AWS SDK can mistake stream timeouts for EOF and throw SdkClientExceptions

2018-07-10 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-15541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538498#comment-16538498
 ] 

Steve Loughran commented on HADOOP-15541:
-

Looks good. 
One issue: should we always force close on a read failure, rather than treat 
SocketTimeoutException as special? I guess there are some potential failure 
modes (source was deleted during the read) which could trigger IOEs during the 
GET (maybe? Do we test this with a large enough file to be sure there's no 
caching going on? If not, I could imagine adding it to the huge files 
test). 

IF we say "every IOE -> forced abort', then its a simpler path on read. What 
you have here though is the core fix: on socket errors, don't try and recycle 
things.

What do you think? If you want this one as is, you've got my +1. I'm just 
wondering if the need to add a separate catch for SocketTimeoutException is 
needed

> AWS SDK can mistake stream timeouts for EOF and throw SdkClientExceptions
> -
>
> Key: HADOOP-15541
> URL: https://issues.apache.org/jira/browse/HADOOP-15541
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs/s3
>Affects Versions: 2.9.1, 2.8.4, 3.0.2, 3.1.1
>Reporter: Sean Mackrory
>Assignee: Sean Mackrory
>Priority: Major
> Attachments: HADOOP-15541.001.patch
>
>
> I've gotten a few reports of read timeouts not being handled properly in some 
> Impala workloads. What happens is the following sequence of events (credit to 
> Sailesh Mukil for figuring this out):
>  * S3AInputStream.read() gets a SocketTimeoutException when it calls 
> wrappedStream.read()
>  * This is handled by onReadFailure -> reopen -> closeStream. When we try to 
> drain the stream, SdkFilterInputStream.read() in the AWS SDK fails because of 
> checkLength. The underlying Apache Commons stream returns -1 in the case of a 
> timeout, and EOF.
>  * The SDK assumes the -1 signifies an EOF, so assumes the bytes read must 
> equal expected bytes, and because they don't (because it's a timeout and not 
> an EOF) it throws an SdkClientException.
> This is tricky to test for without a ton of mocking of AWS SDK internals, 
> because you have to get into this conflicting state where the SDK has only 
> read a subset of the expected bytes and gets a -1.
> closeStream will abort the stream in the event of an IOException when 
> draining. We could simply also abort in the event of an SdkClientException. 
> I'm testing that this results in correct functionality in the workloads that 
> seem to hit these timeouts a lot, but all the s3a tests continue to work with 
> that change. I'm going to open an issue with the AWS SDK Github as well, but 
> I'm not sure what the ideal outcome would be unless there's a good way to 
> distinguish between a stream that has timed out and a stream that read all 
> the data without huge rewrites.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-15541) AWS SDK can mistake stream timeouts for EOF and throw SdkClientExceptions

2018-07-06 Thread genericqa (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-15541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16535510#comment-16535510
 ] 

genericqa commented on HADOOP-15541:


| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
20s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 27m 
47s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
27s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
13s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
30s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
11m 19s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
35s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
20s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
31s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
23s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
23s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
 9s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
27s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
11m 59s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m  
4s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
19s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  4m 
31s{color} | {color:green} hadoop-aws in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
22s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 61m 33s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:abb62dd |
| JIRA Issue | HADOOP-15541 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12930597/HADOOP-15541.001.patch
 |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  shadedclient  findbugs  checkstyle  |
| uname | Linux e6408faeb955 3.13.0-137-generic #186-Ubuntu SMP Mon Dec 4 
19:09:19 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / ba68320 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_171 |
| findbugs | v3.1.0-RC1 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HADOOP-Build/14864/testReport/ |
| Max. process+thread count | 302 (vs. ulimit of 1) |
| modules | C: hadoop-tools/hadoop-aws U: hadoop-tools/hadoop-aws |
| Console output | 
https://builds.apache.org/job/PreCommit-HADOOP-Build/14864/console |
| Powered by | Apache Yetus 0.8.0-SNAPSHOT   http://yetus.apache.org |


This message was automatically generated.



> AWS SDK can mistake stream timeouts for EOF and throw 

[jira] [Commented] (HADOOP-15541) AWS SDK can mistake stream timeouts for EOF and throw SdkClientExceptions

2018-07-06 Thread Sean Mackrory (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-15541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16535386#comment-16535386
 ] 

Sean Mackrory commented on HADOOP-15541:


Although on first glance, it certainly seems that calling 
streamStatistics.streamClose takes care of all that, and we're doing that.

> AWS SDK can mistake stream timeouts for EOF and throw SdkClientExceptions
> -
>
> Key: HADOOP-15541
> URL: https://issues.apache.org/jira/browse/HADOOP-15541
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs/s3
>Affects Versions: 2.9.1, 2.8.4, 3.0.2, 3.1.1
>Reporter: Sean Mackrory
>Assignee: Sean Mackrory
>Priority: Major
> Attachments: HADOOP-15541.001.patch
>
>
> I've gotten a few reports of read timeouts not being handled properly in some 
> Impala workloads. What happens is the following sequence of events (credit to 
> Sailesh Mukil for figuring this out):
>  * S3AInputStream.read() gets a SocketTimeoutException when it calls 
> wrappedStream.read()
>  * This is handled by onReadFailure -> reopen -> closeStream. When we try to 
> drain the stream, SdkFilterInputStream.read() in the AWS SDK fails because of 
> checkLength. The underlying Apache Commons stream returns -1 in the case of a 
> timeout, and EOF.
>  * The SDK assumes the -1 signifies an EOF, so assumes the bytes read must 
> equal expected bytes, and because they don't (because it's a timeout and not 
> an EOF) it throws an SdkClientException.
> This is tricky to test for without a ton of mocking of AWS SDK internals, 
> because you have to get into this conflicting state where the SDK has only 
> read a subset of the expected bytes and gets a -1.
> closeStream will abort the stream in the event of an IOException when 
> draining. We could simply also abort in the event of an SdkClientException. 
> I'm testing that this results in correct functionality in the workloads that 
> seem to hit these timeouts a lot, but all the s3a tests continue to work with 
> that change. I'm going to open an issue with the AWS SDK Github as well, but 
> I'm not sure what the ideal outcome would be unless there's a good way to 
> distinguish between a stream that has timed out and a stream that read all 
> the data without huge rewrites.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-15541) AWS SDK can mistake stream timeouts for EOF and throw SdkClientExceptions

2018-07-06 Thread Sean Mackrory (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-15541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16535383#comment-16535383
 ] 

Sean Mackrory commented on HADOOP-15541:


Thanks for the explanation [~ste...@apache.org]. Attaching a patch that uses 
the existing force-abort code in the event of a timeout. All tests continue to 
pass, and the workload that was consistently timing out before suddenly stopped 
upon applying this patch. I just saw your comment about incrementing metrics, 
though. Let me check for those and revise the patch if necessary.

> AWS SDK can mistake stream timeouts for EOF and throw SdkClientExceptions
> -
>
> Key: HADOOP-15541
> URL: https://issues.apache.org/jira/browse/HADOOP-15541
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs/s3
>Affects Versions: 2.9.1, 2.8.4, 3.0.2, 3.1.1
>Reporter: Sean Mackrory
>Assignee: Sean Mackrory
>Priority: Major
> Attachments: HADOOP-15541.001.patch
>
>
> I've gotten a few reports of read timeouts not being handled properly in some 
> Impala workloads. What happens is the following sequence of events (credit to 
> Sailesh Mukil for figuring this out):
>  * S3AInputStream.read() gets a SocketTimeoutException when it calls 
> wrappedStream.read()
>  * This is handled by onReadFailure -> reopen -> closeStream. When we try to 
> drain the stream, SdkFilterInputStream.read() in the AWS SDK fails because of 
> checkLength. The underlying Apache Commons stream returns -1 in the case of a 
> timeout, and EOF.
>  * The SDK assumes the -1 signifies an EOF, so assumes the bytes read must 
> equal expected bytes, and because they don't (because it's a timeout and not 
> an EOF) it throws an SdkClientException.
> This is tricky to test for without a ton of mocking of AWS SDK internals, 
> because you have to get into this conflicting state where the SDK has only 
> read a subset of the expected bytes and gets a -1.
> closeStream will abort the stream in the event of an IOException when 
> draining. We could simply also abort in the event of an SdkClientException. 
> I'm testing that this results in correct functionality in the workloads that 
> seem to hit these timeouts a lot, but all the s3a tests continue to work with 
> that change. I'm going to open an issue with the AWS SDK Github as well, but 
> I'm not sure what the ideal outcome would be unless there's a good way to 
> distinguish between a stream that has timed out and a stream that read all 
> the data without huge rewrites.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-15541) AWS SDK can mistake stream timeouts for EOF and throw SdkClientExceptions

2018-06-19 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-15541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16517417#comment-16517417
 ] 

Steve Loughran commented on HADOOP-15541:
-

Stream draining means the http1.1 connection can be returned to the pool and so 
save setup costs, which is why we like to do it on close()

But here, if we can conclude that the connection is in trouble, should we 
return it it? 

No objection to doing the abort for IOEs and SDKs, I was suggesting the arg 
because the reopen code already takes that param...requesting that forced abort 
after an exception on read() would be good.

though: are you suggesting for any IOE/SDK exception we don't try to reopen the 
call, just force the abort() before throwing up the exception? If so, yes, that 
also makes sense. We don't want a failing HTTP connection to be recycled

Make sure any metrics on forced aborts are incremented though

> AWS SDK can mistake stream timeouts for EOF and throw SdkClientExceptions
> -
>
> Key: HADOOP-15541
> URL: https://issues.apache.org/jira/browse/HADOOP-15541
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs/s3
>Affects Versions: 2.9.1, 2.8.4, 3.0.2, 3.1.1
>Reporter: Sean Mackrory
>Assignee: Sean Mackrory
>Priority: Major
>
> I've gotten a few reports of read timeouts not being handled properly in some 
> Impala workloads. What happens is the following sequence of events (credit to 
> Sailesh Mukil for figuring this out):
>  * S3AInputStream.read() gets a SocketTimeoutException when it calls 
> wrappedStream.read()
>  * This is handled by onReadFailure -> reopen -> closeStream. When we try to 
> drain the stream, SdkFilterInputStream.read() in the AWS SDK fails because of 
> checkLength. The underlying Apache Commons stream returns -1 in the case of a 
> timeout, and EOF.
>  * The SDK assumes the -1 signifies an EOF, so assumes the bytes read must 
> equal expected bytes, and because they don't (because it's a timeout and not 
> an EOF) it throws an SdkClientException.
> This is tricky to test for without a ton of mocking of AWS SDK internals, 
> because you have to get into this conflicting state where the SDK has only 
> read a subset of the expected bytes and gets a -1.
> closeStream will abort the stream in the event of an IOException when 
> draining. We could simply also abort in the event of an SdkClientException. 
> I'm testing that this results in correct functionality in the workloads that 
> seem to hit these timeouts a lot, but all the s3a tests continue to work with 
> that change. I'm going to open an issue with the AWS SDK Github as well, but 
> I'm not sure what the ideal outcome would be unless there's a good way to 
> distinguish between a stream that has timed out and a stream that read all 
> the data without huge rewrites.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-15541) AWS SDK can mistake stream timeouts for EOF and throw SdkClientExceptions

2018-06-18 Thread Sean Mackrory (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-15541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16516306#comment-16516306
 ] 

Sean Mackrory commented on HADOOP-15541:


{quote}Like you say, no real point in not aborting here.\{quote}

Help me understand, though: when *do* we get a benefit from draining the stream 
instead of simply aborting?

{quote}Happy for a patch, I don't think we can test this easily so not 
expecting any tests in the patch...\{quote}

Yeah. This was (at the time anyway) happening pretty repeatedly with a 
particular workload - I'm hoping that keeps up so I can be fairly confident 
that the end result here is correct handling of timeouts.

> AWS SDK can mistake stream timeouts for EOF and throw SdkClientExceptions
> -
>
> Key: HADOOP-15541
> URL: https://issues.apache.org/jira/browse/HADOOP-15541
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs/s3
>Affects Versions: 2.9.1, 2.8.4, 3.0.2, 3.1.1
>Reporter: Sean Mackrory
>Assignee: Sean Mackrory
>Priority: Major
>
> I've gotten a few reports of read timeouts not being handled properly in some 
> Impala workloads. What happens is the following sequence of events (credit to 
> Sailesh Mukil for figuring this out):
>  * S3AInputStream.read() gets a SocketTimeoutException when it calls 
> wrappedStream.read()
>  * This is handled by onReadFailure -> reopen -> closeStream. When we try to 
> drain the stream, SdkFilterInputStream.read() in the AWS SDK fails because of 
> checkLength. The underlying Apache Commons stream returns -1 in the case of a 
> timeout, and EOF.
>  * The SDK assumes the -1 signifies an EOF, so assumes the bytes read must 
> equal expected bytes, and because they don't (because it's a timeout and not 
> an EOF) it throws an SdkClientException.
> This is tricky to test for without a ton of mocking of AWS SDK internals, 
> because you have to get into this conflicting state where the SDK has only 
> read a subset of the expected bytes and gets a -1.
> closeStream will abort the stream in the event of an IOException when 
> draining. We could simply also abort in the event of an SdkClientException. 
> I'm testing that this results in correct functionality in the workloads that 
> seem to hit these timeouts a lot, but all the s3a tests continue to work with 
> that change. I'm going to open an issue with the AWS SDK Github as well, but 
> I'm not sure what the ideal outcome would be unless there's a good way to 
> distinguish between a stream that has timed out and a stream that read all 
> the data without huge rewrites.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-15541) AWS SDK can mistake stream timeouts for EOF and throw SdkClientExceptions

2018-06-15 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-15541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16513608#comment-16513608
 ] 

Steve Loughran commented on HADOOP-15541:
-

I've worried about something related to this for a while, precisely because we 
are using close() not abort. Assuming the error on read() is due to a network 
problem, breaking that whole TCP connection is the only way to guarantee that 
your followup GET isn't on the same HTTP1.1 stream.

I wasn't too worried, on the basis that nobody had complained...clearly that's 
not true any more. And my expectation of how things would fail was worse. 

Here's one possible strategy

# {{S3AInputStream.reopen()}}  adds a {{boolean forceAbort}} param; passes it 
in to {{closeStream}}; 
# {{S3AInputStream.onReadFailure()}} forces that abort.

Like you say, no real point in not aborting here.

Happy for a patch, I don't think we can test this easily so not expecting any 
tests in the patch...

> AWS SDK can mistake stream timeouts for EOF and throw SdkClientExceptions
> -
>
> Key: HADOOP-15541
> URL: https://issues.apache.org/jira/browse/HADOOP-15541
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs/s3
>Affects Versions: 2.9.1, 2.8.4, 3.0.2, 3.1.1
>Reporter: Sean Mackrory
>Assignee: Sean Mackrory
>Priority: Major
>
> I've gotten a few reports of read timeouts not being handled properly in some 
> Impala workloads. What happens is the following sequence of events (credit to 
> Sailesh Mukil for figuring this out):
>  * S3AInputStream.read() gets a SocketTimeoutException when it calls 
> wrappedStream.read()
>  * This is handled by onReadFailure -> reopen -> closeStream. When we try to 
> drain the stream, SdkFilterInputStream.read() in the AWS SDK fails because of 
> checkLength. The underlying Apache Commons stream returns -1 in the case of a 
> timeout, and EOF.
>  * The SDK assumes the -1 signifies an EOF, so assumes the bytes read must 
> equal expected bytes, and because they don't (because it's a timeout and not 
> an EOF) it throws an SdkClientException.
> This is tricky to test for without a ton of mocking of AWS SDK internals, 
> because you have to get into this conflicting state where the SDK has only 
> read a subset of the expected bytes and gets a -1.
> closeStream will abort the stream in the event of an IOException when 
> draining. We could simply also abort in the event of an SdkClientException. 
> I'm testing that this results in correct functionality in the workloads that 
> seem to hit these timeouts a lot, but all the s3a tests continue to work with 
> that change. I'm going to open an issue with the AWS SDK Github as well, but 
> I'm not sure what the ideal outcome would be unless there's a good way to 
> distinguish between a stream that has timed out and a stream that read all 
> the data without huge rewrites.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-15541) AWS SDK can mistake stream timeouts for EOF and throw SdkClientExceptions

2018-06-14 Thread Sean Mackrory (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-15541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16512974#comment-16512974
 ] 

Sean Mackrory commented on HADOOP-15541:


Also filed an issue with the SDK: 
[https://github.com/aws/aws-sdk-java/issues/1630.] But like I said, I'm not 
sure what the point is or if there's anything wrong with just aborting on 
SdkClientExceptions since we'll have to fail at some point anyway.

> AWS SDK can mistake stream timeouts for EOF and throw SdkClientExceptions
> -
>
> Key: HADOOP-15541
> URL: https://issues.apache.org/jira/browse/HADOOP-15541
> Project: Hadoop Common
>  Issue Type: Bug
>Reporter: Sean Mackrory
>Assignee: Sean Mackrory
>Priority: Major
>
> I've gotten a few reports of read timeouts not being handled properly in some 
> Impala workloads. What happens is the following sequence of events (credit to 
> Sailesh Mukil for figuring this out):
>  * S3AInputStream.read() gets a SocketTimeoutException when it calls 
> wrappedStream.read()
>  * This is handled by onReadFailure -> reopen -> closeStream. When we try to 
> drain the stream, SdkFilterInputStream.read() in the AWS SDK fails because of 
> checkLength. The underlying Apache Commons stream returns -1 in the case of a 
> timeout, and EOF.
>  * The SDK assumes the -1 signifies an EOF, so assumes the bytes read must 
> equal expected bytes, and because they don't (because it's a timeout and not 
> an EOF) it throws an SdkClientException.
> This is tricky to test for without a ton of mocking of AWS SDK internals, 
> because you have to get into this conflicting state where the SDK has only 
> read a subset of the expected bytes and gets a -1.
> closeStream will abort the stream in the event of an IOException when 
> draining. We could simply also abort in the event of an SdkClientException. 
> I'm testing that this results in correct functionality in the workloads that 
> seem to hit these timeouts a lot, but all the s3a tests continue to work with 
> that change. I'm going to open an issue with the AWS SDK Github as well, but 
> I'm not sure what the ideal outcome would be unless there's a good way to 
> distinguish between a stream that has timed out and a stream that read all 
> the data without huge rewrites.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-15541) AWS SDK can mistake stream timeouts for EOF and throw SdkClientExceptions

2018-06-14 Thread Sean Mackrory (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-15541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16512941#comment-16512941
 ] 

Sean Mackrory commented on HADOOP-15541:


There are a bunch of subtle bugs that have lead to use recovering the way we 
do. Pinging [~ste...@apache.org] who has worked on a few of these: do you know 
what benefit we gain from draining the stream instead of simply aborting and 
starting a new stream?

> AWS SDK can mistake stream timeouts for EOF and throw SdkClientExceptions
> -
>
> Key: HADOOP-15541
> URL: https://issues.apache.org/jira/browse/HADOOP-15541
> Project: Hadoop Common
>  Issue Type: Bug
>Reporter: Sean Mackrory
>Assignee: Sean Mackrory
>Priority: Major
>
> I've gotten a few reports of read timeouts not being handled properly in some 
> Impala workloads. What happens is the following sequence of events (credit to 
> Sailesh Mukil for figuring this out):
>  * S3AInputStream.read() gets a SocketTimeoutException when it calls 
> wrappedStream.read()
>  * This is handled by onReadFailure -> reopen -> closeStream. When we try to 
> drain the stream, SdkFilterInputStream.read() in the AWS SDK fails because of 
> checkLength. The underlying Apache Commons stream returns -1 in the case of a 
> timeout, and EOF.
>  * The SDK assumes the -1 signifies an EOF, so assumes the bytes read must 
> equal expected bytes, and because they don't (because it's a timeout and not 
> an EOF) it throws an SdkClientException.
> This is tricky to test for without a ton of mocking of AWS SDK internals, 
> because you have to get into this conflicting state where the SDK has only 
> read a subset of the expected bytes and gets a -1.
> closeStream will abort the stream in the event of an IOException when 
> draining. We could simply also abort in the event of an SdkClientException. 
> I'm testing that this results in correct functionality in the workloads that 
> seem to hit these timeouts a lot, but all the s3a tests continue to work with 
> that change. I'm going to open an issue with the AWS SDK Github as well, but 
> I'm not sure what the ideal outcome would be unless there's a good way to 
> distinguish between a stream that has timed out and a stream that read all 
> the data without huge rewrites.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org