[jira] [Updated] (HDDS-4667) XCeiverClientGrpc should give up if unexpected exception is thrown from read path

Marton Elek (Jira) Mon, 11 Jan 2021 04:16:05 -0800


     [ 
https://issues.apache.org/jira/browse/HDDS-4667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Marton Elek updated HDDS-4667:
------------------------------
    Description: 
Found it during the usage of a data generator.

 1. I accidentally uploaded keys without checksum data.

  2. With this specific key, the client is moved to an endless loop instead of 
giving up after the first unexpected exceptions:

{code}
2021-01-11 13:01:50,031 INFO  storage.BlockInputStream 
(BlockInputStream.java:refreshPipeline(166)) - Unable to read information for 
block conID: 2 locID: 185 bcsId: 0 from pipeline 
PipelineID=206da15d-62f6-4e24-93d1-e2e805fc1376: Unexpected OzoneException: 
org.apache.hadoop.ozone.common.OzoneChecksumException: Original checksumData 
has no checksums
2021-01-11 13:01:50,047 ERROR scm.XceiverClientGrpc 
(XceiverClientGrpc.java:sendCommandWithRetry(408)) - Failed to execute command 
cmdType: ReadChunk
traceID: ""
containerID: 2
datanodeUuid: "2c124e08-e8a5-4493-a41e-84797984e6a6"
readChunk {
  blockID {
    containerID: 2
    localID: 185
    blockCommitSequenceId: 0
  }
  chunkData {
    chunkName: "chunk0"
    offset: 0
    len: 4194304
    checksumData {
      type: CRC32
      bytesPerChecksum: 1048576
    }
  }
}
 on the pipeline Pipeline[ Id: 7d5ed2da-7453-4113-b766-4100458dcc16, Nodes: 
2c124e08-e8a5-4493-a41e-84797984e6a6{ip: 127.0.0.1, host: localhost, 
networkLocation: /default-rack, certSerialId: null, persistedOpState: 
IN_SERVICE, persistedOpStateExpiryEpochSec: 0}, Type:STAND_ALONE, Factor:THREE, 
State:OPEN, leaderId:, CreationTimestamp2021-01-11T12:01:50.032Z].
2021-01-11 13:01:50,047 INFO  storage.BlockInputStream 
(BlockInputStream.java:refreshPipeline(166)) - Unable to read information for 
block conID: 2 locID: 185 bcsId: 0 from pipeline 
PipelineID=7d5ed2da-7453-4113-b766-4100458dcc16: Unexpected OzoneException: 
org.apache.hadoop.ozone.common.OzoneChecksumException: Original checksumData 
has no checksums
2021-01-11 13:01:50,062 ERROR scm.XceiverClientGrpc 
(XceiverClientGrpc.java:sendCommandWithRetry(408)) - Failed to execute command 
cmdType: ReadChunk
traceID: ""
containerID: 2
datanodeUuid: "2c124e08-e8a5-4493-a41e-84797984e6a6"
readChunk {
  blockID {
    containerID: 2
    localID: 185
    blockCommitSequenceId: 0
  }
  chunkData {
    chunkName: "chunk0"
    offset: 0
    len: 4194304
    checksumData {
      type: CRC32
      bytesPerChecksum: 1048576
    }
  }
}
 on the pipeline Pipeline[ Id: 3a4b5032-6b2f-4297-8c4b-89d715175bb1, Nodes: 
2c124e08-e8a5-4493-a41e-84797984e6a6{ip: 127.0.0.1, host: localhost, 
networkLocation: /default-rack, certSerialId: null, persistedOpState: 
IN_SERVICE, persistedOpStateExpiryEpochSec: 0}, Type:STAND_ALONE, Factor:THREE, 
State:OPEN, leaderId:, CreationTimestamp2021-01-11T12:01:50.048Z].
{code}

Please note that the two attempt happens in the same milliseconds.

The problematic part seems to be in the BlockInputStream:

{code}
      try {
        numBytesRead = current.read(b, off, numBytesToRead);
      } catch (IOException e) {
        handleReadError(e);
        continue;
      }
{code}

In case of system exceptions we should "break" from the loop instead of 
"continue".

(Normally it's not possible in a production cluster as the data is created with 
a bad client. But it has security implication: a malicious user can create 
similar keys which makes a DoS attack: all the clients will retry without 
sleep...)

  was:
Found it during the usage of a data generator.

 1. I accidentally uploaded keys without checksum data.

  2. In this specific key, the client is moved to an endless loop instead of 
giving up after the first unexpected exceptions:

{code}
2021-01-11 13:01:50,031 INFO  storage.BlockInputStream 
(BlockInputStream.java:refreshPipeline(166)) - Unable to read information for 
block conID: 2 locID: 185 bcsId: 0 from pipeline 
PipelineID=206da15d-62f6-4e24-93d1-e2e805fc1376: Unexpected OzoneException: 
org.apache.hadoop.ozone.common.OzoneChecksumException: Original checksumData 
has no checksums
2021-01-11 13:01:50,047 ERROR scm.XceiverClientGrpc 
(XceiverClientGrpc.java:sendCommandWithRetry(408)) - Failed to execute command 
cmdType: ReadChunk
traceID: ""
containerID: 2
datanodeUuid: "2c124e08-e8a5-4493-a41e-84797984e6a6"
readChunk {
  blockID {
    containerID: 2
    localID: 185
    blockCommitSequenceId: 0
  }
  chunkData {
    chunkName: "chunk0"
    offset: 0
    len: 4194304
    checksumData {
      type: CRC32
      bytesPerChecksum: 1048576
    }
  }
}
 on the pipeline Pipeline[ Id: 7d5ed2da-7453-4113-b766-4100458dcc16, Nodes: 
2c124e08-e8a5-4493-a41e-84797984e6a6{ip: 127.0.0.1, host: localhost, 
networkLocation: /default-rack, certSerialId: null, persistedOpState: 
IN_SERVICE, persistedOpStateExpiryEpochSec: 0}, Type:STAND_ALONE, Factor:THREE, 
State:OPEN, leaderId:, CreationTimestamp2021-01-11T12:01:50.032Z].
2021-01-11 13:01:50,047 INFO  storage.BlockInputStream 
(BlockInputStream.java:refreshPipeline(166)) - Unable to read information for 
block conID: 2 locID: 185 bcsId: 0 from pipeline 
PipelineID=7d5ed2da-7453-4113-b766-4100458dcc16: Unexpected OzoneException: 
org.apache.hadoop.ozone.common.OzoneChecksumException: Original checksumData 
has no checksums
2021-01-11 13:01:50,062 ERROR scm.XceiverClientGrpc 
(XceiverClientGrpc.java:sendCommandWithRetry(408)) - Failed to execute command 
cmdType: ReadChunk
traceID: ""
containerID: 2
datanodeUuid: "2c124e08-e8a5-4493-a41e-84797984e6a6"
readChunk {
  blockID {
    containerID: 2
    localID: 185
    blockCommitSequenceId: 0
  }
  chunkData {
    chunkName: "chunk0"
    offset: 0
    len: 4194304
    checksumData {
      type: CRC32
      bytesPerChecksum: 1048576
    }
  }
}
 on the pipeline Pipeline[ Id: 3a4b5032-6b2f-4297-8c4b-89d715175bb1, Nodes: 
2c124e08-e8a5-4493-a41e-84797984e6a6{ip: 127.0.0.1, host: localhost, 
networkLocation: /default-rack, certSerialId: null, persistedOpState: 
IN_SERVICE, persistedOpStateExpiryEpochSec: 0}, Type:STAND_ALONE, Factor:THREE, 
State:OPEN, leaderId:, CreationTimestamp2021-01-11T12:01:50.048Z].
{code}

Please note that the two attempt happens in the same milliseconds.

The problematic part seems to be in the BlockInputStream:

{code}
      try {
        numBytesRead = current.read(b, off, numBytesToRead);
      } catch (IOException e) {
        handleReadError(e);
        continue;
      }
{code}

In case of system exceptions we should "break" from the loop instead of 
"continue".

(Normally it's not possible in a production cluster as the data is created with 
a bad client. But it has security implication: a malicious user can create 
similar keys which makes a DoS attack: all the clients will retry without 
sleep...)


> XCeiverClientGrpc should give up if unexpected exception is thrown from read 
> path
> ---------------------------------------------------------------------------------
>
>                 Key: HDDS-4667
>                 URL: https://issues.apache.org/jira/browse/HDDS-4667
>             Project: Hadoop Distributed Data Store
>          Issue Type: Bug
>          Components: Ozone Client
>            Reporter: Marton Elek
>            Priority: Major
>
> Found it during the usage of a data generator.
>  1. I accidentally uploaded keys without checksum data.
>   2. With this specific key, the client is moved to an endless loop instead 
> of giving up after the first unexpected exceptions:
> {code}
> 2021-01-11 13:01:50,031 INFO  storage.BlockInputStream 
> (BlockInputStream.java:refreshPipeline(166)) - Unable to read information for 
> block conID: 2 locID: 185 bcsId: 0 from pipeline 
> PipelineID=206da15d-62f6-4e24-93d1-e2e805fc1376: Unexpected OzoneException: 
> org.apache.hadoop.ozone.common.OzoneChecksumException: Original checksumData 
> has no checksums
> 2021-01-11 13:01:50,047 ERROR scm.XceiverClientGrpc 
> (XceiverClientGrpc.java:sendCommandWithRetry(408)) - Failed to execute 
> command cmdType: ReadChunk
> traceID: ""
> containerID: 2
> datanodeUuid: "2c124e08-e8a5-4493-a41e-84797984e6a6"
> readChunk {
>   blockID {
>     containerID: 2
>     localID: 185
>     blockCommitSequenceId: 0
>   }
>   chunkData {
>     chunkName: "chunk0"
>     offset: 0
>     len: 4194304
>     checksumData {
>       type: CRC32
>       bytesPerChecksum: 1048576
>     }
>   }
> }
>  on the pipeline Pipeline[ Id: 7d5ed2da-7453-4113-b766-4100458dcc16, Nodes: 
> 2c124e08-e8a5-4493-a41e-84797984e6a6{ip: 127.0.0.1, host: localhost, 
> networkLocation: /default-rack, certSerialId: null, persistedOpState: 
> IN_SERVICE, persistedOpStateExpiryEpochSec: 0}, Type:STAND_ALONE, 
> Factor:THREE, State:OPEN, leaderId:, 
> CreationTimestamp2021-01-11T12:01:50.032Z].
> 2021-01-11 13:01:50,047 INFO  storage.BlockInputStream 
> (BlockInputStream.java:refreshPipeline(166)) - Unable to read information for 
> block conID: 2 locID: 185 bcsId: 0 from pipeline 
> PipelineID=7d5ed2da-7453-4113-b766-4100458dcc16: Unexpected OzoneException: 
> org.apache.hadoop.ozone.common.OzoneChecksumException: Original checksumData 
> has no checksums
> 2021-01-11 13:01:50,062 ERROR scm.XceiverClientGrpc 
> (XceiverClientGrpc.java:sendCommandWithRetry(408)) - Failed to execute 
> command cmdType: ReadChunk
> traceID: ""
> containerID: 2
> datanodeUuid: "2c124e08-e8a5-4493-a41e-84797984e6a6"
> readChunk {
>   blockID {
>     containerID: 2
>     localID: 185
>     blockCommitSequenceId: 0
>   }
>   chunkData {
>     chunkName: "chunk0"
>     offset: 0
>     len: 4194304
>     checksumData {
>       type: CRC32
>       bytesPerChecksum: 1048576
>     }
>   }
> }
>  on the pipeline Pipeline[ Id: 3a4b5032-6b2f-4297-8c4b-89d715175bb1, Nodes: 
> 2c124e08-e8a5-4493-a41e-84797984e6a6{ip: 127.0.0.1, host: localhost, 
> networkLocation: /default-rack, certSerialId: null, persistedOpState: 
> IN_SERVICE, persistedOpStateExpiryEpochSec: 0}, Type:STAND_ALONE, 
> Factor:THREE, State:OPEN, leaderId:, 
> CreationTimestamp2021-01-11T12:01:50.048Z].
> {code}
> Please note that the two attempt happens in the same milliseconds.
> The problematic part seems to be in the BlockInputStream:
> {code}
>       try {
>         numBytesRead = current.read(b, off, numBytesToRead);
>       } catch (IOException e) {
>         handleReadError(e);
>         continue;
>       }
> {code}
> In case of system exceptions we should "break" from the loop instead of 
> "continue".
> (Normally it's not possible in a production cluster as the data is created 
> with a bad client. But it has security implication: a malicious user can 
> create similar keys which makes a DoS attack: all the clients will retry 
> without sleep...)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HDDS-4667) XCeiverClientGrpc should give up if unexpected exception is thrown from read path

Reply via email to