[
https://issues.apache.org/jira/browse/HDFS-3912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13455262#comment-13455262
]
nkeywal commented on HDFS-3912:
-------------------------------
Some thinking, with an HBase bias:
- if the datanode is too busy and cannot heartbeat in a minute, we will also
get timeouts when writing the blocks (if the datanode is dead: 20s connect
timeout. If it's not dead, or if we had previously a connection, we will fail
on the read timeout for the ack, it's around 1 minute by default).
- the recovery is on the critical path, so going to a suspicious node is not
something you want to do.
- things are already quite complicated, so I think I would end up with the same
value for read & write to keep them simple.
Then there is the case when many nodes are staled. I think we're in a really
bad shape at this stage... I feel that just throwing an exception is the best
solution. HBase would wait a few seconds and retry. That's better for the
cluster than trying a node that is unlikely to execute the write. But it's a
kind of change vs. today's behavior.
To synthesis, this could make sense imho:
- there are enough fully alive nodes: let's use them, whatever the number of
stale nodes.
- there are not enough fully alive nodes, but there are some stale nodes that
we could use: let's use the stale nodes them, at least the behavior will be
backward compatible.
- there are not enough live node: as today.
> Detecting and avoiding stale datanodes for writing
> --------------------------------------------------
>
> Key: HDFS-3912
> URL: https://issues.apache.org/jira/browse/HDFS-3912
> Project: Hadoop HDFS
> Issue Type: Sub-task
> Reporter: Jing Zhao
> Assignee: Jing Zhao
>
> 1. Make stale timeout adaptive to the number of nodes marked stale in the
> cluster.
> 2. Consider having a separate configuration for write skipping the stale
> nodes.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira