[ 
https://issues.apache.org/jira/browse/HDFS-12703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16879917#comment-16879917
 ] 

He Xiaoqiao edited comment on HDFS-12703 at 7/7/19 6:10 PM:
------------------------------------------------------------

Upload patch [^HDFS-12703.005.patch] with unit test and try to fix this issue.
I think the root cause is that interface of DatanodeDescriptor is 
non-thread-safe after dig the decommission logic. Consider that 
{{DatanodeAdminManager#monitor}} is running, another thread set {{adminState}} 
to {{Decommissioned}} of corresponding DataNode, then this issue will reprod.
 [^HDFS-12703.005.patch] just catch the exception and remove datanode from 
{{outOfServiceNodeBlocks}} and push back to {{pendingNodes}} then it will 
process next loop.
{quote}Does it need a restart or another refreshNodes to take it out of the 
invalid state?
{quote}
Since postpone to check and it will meet the proper state in next loop, so we 
do not need to operation DataNode or refreshNodes again.

To [~xuel1], I just assign JIRA to myself, please feel free to assign back to 
you if would like to go on working on this issue before we resolve it.


was (Author: hexiaoqiao):
Upload patch [^HDFS-12703.005.patch] with unit test and try to fix this issue.
I think the root cause is that interface of DatanodeDescriptor is 
non-thread-safe after dig the decommission logic. Consider that 
{{DatanodeAdminManager#monitor}} is running, another thread set {{adminState}} 
to {{Decommissioned}} of corresponding DataNode, then this issue will reprod.
 [^HDFS-12703.005.patch] just catch the exception and remove datanode from 
{{outOfServiceNodeBlocks}} and push back to {{pendingNodes}} then it will 
process next loop.
{code:java}
Does it need a restart or another refreshNodes to take it out of the invalid 
state?
{code}
Since postpone to check and it will meet the proper state in next loop, so we 
do not need to operation DataNode or refreshNodes again.

To [~xuel1], I just assign JIRA to myself, please feel free to assign back to 
you if would like to go on working on this issue before we resolve it.

> Exceptions are fatal to decommissioning monitor
> -----------------------------------------------
>
>                 Key: HDFS-12703
>                 URL: https://issues.apache.org/jira/browse/HDFS-12703
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.7.0
>            Reporter: Daryn Sharp
>            Assignee: He Xiaoqiao
>            Priority: Critical
>         Attachments: HDFS-12703.001.patch, HDFS-12703.002.patch, 
> HDFS-12703.003.patch, HDFS-12703.004.patch, HDFS-12703.005.patch
>
>
> The {{DecommissionManager.Monitor}} runs as an executor scheduled task.  If 
> an exception occurs, all decommissioning ceases until the NN is restarted.  
> Per javadoc for {{executor#scheduleAtFixedRate}}: *If any execution of the 
> task encounters an exception, subsequent executions are suppressed*.  The 
> monitor thread is alive but blocked waiting for an executor task that will 
> never come.  The code currently disposes of the future so the actual 
> exception that aborted the task is gone.
> Failover is insufficient since the task is also likely dead on the standby.  
> Replication queue init after the transition to active will fix the under 
> replication of blocks on currently decommissioning nodes but future nodes 
> never decommission.  The standby must be bounced prior to failover – and 
> hopefully the error condition does not reoccur.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to