[ 
https://issues.apache.org/jira/browse/HDFS-9409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14999257#comment-14999257
 ] 

Chris Nauroth commented on HDFS-9409:
-------------------------------------

{{DataNode#shutdown}} calls {{BlockPoolManager#getAllNamenodeThreads}} to get 
every {{BPOfferService}}.  Then, later in {{shutdown}}, these are passed to 
{{BlockPoolManager#shutDownAll}}, which eventually stops and joins each 
{{BPServiceActor}} thread.  There are a few problems:

# {{BlockPoolManager#getAllNamenodeThreads}} returns an unmodifiable wrapper 
over its underlying list, so callers can't mutate the list, but it's still the 
same shared backing list.  Later during shutdown, the {{BPServiceActor}} is 
told that it can exit its main loop.  Part of that is a call on the 
{{BPServiceActor}} thread to {{BlockPoolManager#remove}}.  This effectively 
removes it from the backing list returned by 
{{BlockPoolManager#getAllNamenodeThreads}}, so it will appear to vanish from 
the list before the call to {{BlockPoolManager#shutDownAll}}.
# Even if point 1 is fixed by changing 
{{BlockPoolManager#getAllNamenodeThreads}} to return a deep copy, there is a 
similar problem in that {{BPOfferService#shutdownActor}} will remove the actor 
from its internal tracking list.

Because of these 2 problems, {{DataNode#shutdown}} might no longer have a 
reference to the {{BPServiceActor}} threads when it tries to stop and join on 
them.  Therefore, those threads might still be alive even after completion of 
{{DataNode#shutdown}}.  I noticed this while trying to write a test that 
asserts a particular thread has exited after DataNode shutdown.

> DataNode shutdown does not guarantee full shutdown of all threads due to race 
> condition.
> ----------------------------------------------------------------------------------------
>
>                 Key: HDFS-9409
>                 URL: https://issues.apache.org/jira/browse/HDFS-9409
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>            Reporter: Chris Nauroth
>
> {{DataNode#shutdown}} is documented to return "only after shutdown is 
> complete".  Even after completion of this method, it's possible that threads 
> started by the DataNode are still running.  Race conditions in the shutdown 
> sequence may cause it to skip stopping and joining the {{BPServiceActor}} 
> threads.
> This is likely not a big problem in normal operations, because these are 
> daemon threads that won't block overall process exit.  It is more of a 
> problem for tests, because it makes it impossible to write reliable 
> assertions that these threads exited cleanly.  For large test suites, it can 
> also cause an accumulation of unneeded threads, which might harm test 
> performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to