[jira] [Work logged] (HADOOP-17812) NPE in S3AInputStream read() after failure to reconnect to store

ASF GitHub Bot (Jira) Tue, 27 Jul 2021 04:21:07 -0700


     [ 
https://issues.apache.org/jira/browse/HADOOP-17812?focusedWorklogId=628370&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-628370
 ]


ASF GitHub Bot logged work on HADOOP-17812:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 27/Jul/21 11:20
            Start Date: 27/Jul/21 11:20
    Worklog Time Spent: 10m 
      Work Description: steveloughran commented on pull request #3222:
URL: https://github.com/apache/hadoop/pull/3222#issuecomment-887428882


   > if onReadFailure called when detecting wrappedStream==null is failed, then 
the onReadFailure will be called again in the block of catch (IOException e) { 
, My intention is to let the retry do that.
   
   I'm thinking: 
   
   1. pull the `if (!wrappesStream) onReadFailure()` out of try/catch, so that 
double check doesn't happen
   2. the exception handling to only close the wrapped stream, not try to 
reopen:
   
   
   ```java
   
   if (wrappedStream == null) {
     // trigger a re-open. 
     reopen("failure recovery", getPos(), 1, false);
   }
   try {
        b = wrappedStream.read();
      } catch (EOFException e) {
        return -1;
      } catch (SocketTimeoutException e) {
        onReadFailure(e, 1, true);
        throw e;
      } catch (IOException e) {
        onReadFailure(e, 1, false);
        throw e;
      }
   
   ```
   
   Critically, onReadFailure will stop trying to reopen the stream, instead it 
release (or, socket exception: force breaks) the stream. Removes `throws IOE` 
from its signature.
   
   ```   
      
    private void onReadFailure(IOException ioe, int length, boolean forceAbort)
            {
      if (LOG.isDebugEnabled()) {
        LOG.debug("Got exception while trying to read from stream {}, " +
            "client: {} object: {}, trying to recover: ",
            uri, client, object, ioe);
      } else {
        LOG.info("Got exception while trying to read from stream {}, " +
            "client: {} object: {}, trying to recover: " + ioe,
            uri, client, object);
      }
      streamStatistics.readException();
      closeStream("failure recovery", contentRangeFinish, forceAbort);   // HERE
    }   
   ```
   
   so:
   
   1. there's no attempt to reopen the stream (so cannot raise IOE any more). 
The exception raise in the initial read() failure is the one raised to the 
S3ARetryPolicy.
   2. the new reopen (note, there's one hidden in lazySeek) does the reconnect; 
if it fails it is thrown to the retry policy on the second
   + attempt.
   
   Result:
   * no risk of a reopen() exception overriding the initial failure
   * the moved reopen() call has gone from being a special case only 
encountered after a double IOE to that on a single failure, so gets better 
coverage.
   * initial read failure will encounter the initial brief retry delay before 
trying to reconnect. I don't see that slowing down the operation at all as it 
was going to happen anyway.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 628370)
    Time Spent: 2.5h  (was: 2h 20m)

> NPE in S3AInputStream read() after failure to reconnect to store
> ----------------------------------------------------------------
>
>                 Key: HADOOP-17812
>                 URL: https://issues.apache.org/jira/browse/HADOOP-17812
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: fs/s3
>    Affects Versions: 3.2.2, 3.3.1
>            Reporter: Bobby Wang
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: s3a-test.tar.gz
>
>          Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> when [reading from S3a 
> storage|https://github.com/apache/hadoop/blob/rel/release-3.2.0/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java#L450],
>  SSLException (which extends IOException) happens, which will trigger 
> [onReadFailure|https://github.com/apache/hadoop/blob/rel/release-3.2.0/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java#L458].
> onReadFailure calls "reopen". it will first close the original 
> *wrappedStream* and set *wrappedStream = null*, and then it will try to 
> [re-get 
> *wrappedStream*|https://github.com/apache/hadoop/blob/rel/release-3.2.0/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java#L184].
>  But what if the previous code [obtaining 
> S3Object|https://github.com/apache/hadoop/blob/rel/release-3.2.0/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java#L183]
>  throw exception, then "wrappedStream" will be null.
> And the 
> [retry|https://github.com/apache/hadoop/blob/rel/release-3.2.0/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java#L446]
>  mechanism may re-execute the 
> [wrappedStream.read|https://github.com/apache/hadoop/blob/rel/release-3.2.0/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java#L450]
>  and cause NPE.
>  
> For more details, please refer to 
> [https://github.com/NVIDIA/spark-rapids/issues/2915]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Work logged] (HADOOP-17812) NPE in S3AInputStream read() after failure to reconnect to store

Reply via email to