[
https://issues.apache.org/jira/browse/HDFS-15588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
sr2020 updated HDFS-15588:
--------------------------
Attachment: HDFS-15588-002.patch
> Arbitrarily low values for `dfs.block.access.token.lifetime` aren't safe and
> can cause a healthy datanode to be excluded
> ------------------------------------------------------------------------------------------------------------------------
>
> Key: HDFS-15588
> URL: https://issues.apache.org/jira/browse/HDFS-15588
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: hdfs, hdfs-client, security
> Reporter: sr2020
> Priority: Major
> Attachments: HDFS-15588-001.patch, HDFS-15588-002.patch
>
>
> *Problem*:
> Setting `dfs.block.access.token.lifetime` to arbitrarily low values (like 1)
> means the lifetime of a block token is very short, as a result some healthy
> datanodes could be wrongly excluded by the client due to the
> `InvalidBlockTokenException`.
> More specifically, in `nextBlockOutputStream`, the client tries to get the
> `accessToken` from the namenode and use it to talk to datanode. And the
> lifetime of `accessToken` could set to very small (like 1 min) by setting
> `dfs.block.access.token.lifetime`. In some extreme conditions (like a VM
> migration, temporary network issue, or a stop-the-world GC), the
> `accessToken` could become expired when the client tries to use it to talk to
> the datanode. If expired, `createBlockOutputStream` will return false (and
> mask the `InvalidBlockTokenException`), so the client will think the datanode
> is unhealthy, mark the it as "excluded" and will never read/write on it.
> Related code in `nextBlockOutputStream`:
> {code:java}
> // Connect to first DataNode in the list.
> success = createBlockOutputStream(nodes, nextStorageTypes, nextStorageIDs,
> 0L, false);
> if (!success) {
> LOG.warn("Abandoning " + block);
> dfsClient.namenode.abandonBlock(block.getCurrentBlock(),
> stat.getFileId(), src, dfsClient.clientName);
> block.setCurrentBlock(null);
> final DatanodeInfo badNode = nodes[errorState.getBadNodeIndex()];
> LOG.warn("Excluding datanode " + badNode);
> excludedNodes.put(badNode, badNode);
> }
> {code}
>
> *Proposed solution*:
> A simple retry on the same datanode after catching
> `InvalidBlockTokenException` can solve this problem (assuming the extreme
> conditions won't happen often). Since currently the
> `dfs.block.access.token.lifetime` can even accept values like 0, we can also
> choose to prevent the users from setting `dfs.block.access.token.lifetime` to
> a small value (e.g., we can enforce a minimum value of 5mins for this
> parameter).
> We submit a patch for retrying after catching `InvalidBlockTokenException` in
> `nextBlockOutputStream`. We can also provide a patch for enforcing a larger
> minimum value for `dfs.block.access.token.lifetime` if it is a better way to
> handle this.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]