[
https://issues.apache.org/jira/browse/HDFS-7725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14391353#comment-14391353
]
Andrew Wang commented on HDFS-7725:
-----------------------------------
Thanks for working on this Ming. Nice find, patch looks basically good. Just a
few comments:
I agree with Zhe's original review comment above, I think we should move the
liveness check for both start and stop into heartbeat manager. This way the
caller doesn't have to worry about it.
It would also be good to add "alive" or "dead" to the first log in
stopDecommission too, just to give admins some more information about node
state.
Do we also need assert checks in the test after recommissioning the dead node?
> Incorrect "nodes in service" metrics caused all writes to fail
> --------------------------------------------------------------
>
> Key: HDFS-7725
> URL: https://issues.apache.org/jira/browse/HDFS-7725
> Project: Hadoop HDFS
> Issue Type: Bug
> Reporter: Ming Ma
> Assignee: Ming Ma
> Attachments: HDFS-7725-2.patch, HDFS-7725.patch
>
>
> One of our clusters sometimes couldn't allocate blocks from any DNs.
> BlockPlacementPolicyDefault complains with the following messages for all DNs.
> {noformat}
> the node is too busy (load:x > y)
> {noformat}
> It turns out the {{HeartbeatManager}}'s {{nodesInService}} was computed
> incorrectly when admins decomm or recomm dead nodes. Here are two scenarios.
> * Decomm dead nodes. It turns out HDFS-7374 has fixed it; not sure if it is
> intentional. cc / [~zhz], [~andrew.wang], [~atm] Here is the sequence of
> event without HDFS-7374.
> ** Cluster has one live node. nodesInService == 1
> ** The node becomes dead. nodesInService == 0
> ** Decomm the node. nodesInService == -1
> * However, HDFS-7374 introduces another inconsistency when recomm is involved.
> ** Cluster has one live node. nodesInService == 1
> ** The node becomes dead. nodesInService == 0
> ** Decomm the node. nodesInService == 0
> ** Recomm the node. nodesInService == 1
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)