[
https://issues.apache.org/jira/browse/HDFS-9409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14999257#comment-14999257
]
Chris Nauroth commented on HDFS-9409:
-------------------------------------
{{DataNode#shutdown}} calls {{BlockPoolManager#getAllNamenodeThreads}} to get
every {{BPOfferService}}. Then, later in {{shutdown}}, these are passed to
{{BlockPoolManager#shutDownAll}}, which eventually stops and joins each
{{BPServiceActor}} thread. There are a few problems:
# {{BlockPoolManager#getAllNamenodeThreads}} returns an unmodifiable wrapper
over its underlying list, so callers can't mutate the list, but it's still the
same shared backing list. Later during shutdown, the {{BPServiceActor}} is
told that it can exit its main loop. Part of that is a call on the
{{BPServiceActor}} thread to {{BlockPoolManager#remove}}. This effectively
removes it from the backing list returned by
{{BlockPoolManager#getAllNamenodeThreads}}, so it will appear to vanish from
the list before the call to {{BlockPoolManager#shutDownAll}}.
# Even if point 1 is fixed by changing
{{BlockPoolManager#getAllNamenodeThreads}} to return a deep copy, there is a
similar problem in that {{BPOfferService#shutdownActor}} will remove the actor
from its internal tracking list.
Because of these 2 problems, {{DataNode#shutdown}} might no longer have a
reference to the {{BPServiceActor}} threads when it tries to stop and join on
them. Therefore, those threads might still be alive even after completion of
{{DataNode#shutdown}}. I noticed this while trying to write a test that
asserts a particular thread has exited after DataNode shutdown.
> DataNode shutdown does not guarantee full shutdown of all threads due to race
> condition.
> ----------------------------------------------------------------------------------------
>
> Key: HDFS-9409
> URL: https://issues.apache.org/jira/browse/HDFS-9409
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: datanode
> Reporter: Chris Nauroth
>
> {{DataNode#shutdown}} is documented to return "only after shutdown is
> complete". Even after completion of this method, it's possible that threads
> started by the DataNode are still running. Race conditions in the shutdown
> sequence may cause it to skip stopping and joining the {{BPServiceActor}}
> threads.
> This is likely not a big problem in normal operations, because these are
> daemon threads that won't block overall process exit. It is more of a
> problem for tests, because it makes it impossible to write reliable
> assertions that these threads exited cleanly. For large test suites, it can
> also cause an accumulation of unneeded threads, which might harm test
> performance.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)