race condition in FSNamesystem.close() causes NullPointerException without
serious consequence - TestHDFSServerPorts unit test failure
--------------------------------------------------------------------------------------------------------------------------------------
Key: HDFS-1878
URL: https://issues.apache.org/jira/browse/HDFS-1878
Project: Hadoop HDFS
Issue Type: Bug
Components: name-node
Affects Versions: 0.20.204.0
Reporter: Matt Foley
Assignee: Matt Foley
Priority: Minor
Fix For: 0.20.205.0
TestHDFSServerPorts was observed to intermittently throw a
NullPointerException. This only happens when FSNamesystem.close() is called,
which means system termination for the Namenode, so this is not a serious bug
for .204. TestHDFSServerPorts is more likely than normal execution to
stimulate the race, because it runs two Namenodes in the same JVM, causing more
interleaving and more potential to see a race condition.
The race is in FSNamesystem.close(), line 566, we have:
if (replthread != null) replthread.interrupt();
if (replmon != null) replmon = null;
Since the interrupted replthread is not waited on, there is a potential race
condition with replmon being nulled before replthread is dead, but replthread
references replmon in computeDatanodeWork() where the NullPointerException
occurs.
The solution is either to wait on replthread or just don't null replmon. The
latter is preferred, since none of the sibling Namenode processing threads are
waited on in close().
I'll attach a patch for .205.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira