[ 
https://issues.apache.org/jira/browse/HDFS-4006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated HDFS-4006:
------------------------------

    Attachment: hdfs-4006.txt

I think this patch will fix the issue.

The issue was the following:
In testCheckpointTriggerOnTxnCount we were setting up a thread to run the SNN's 
checkpoint work loop, but not joining on it in the completion of the test. This 
was causing a race where the snn.close() call caused 
SecondaryNameNode.storage.close() to get called, which cleared the list of 
storage directories. Hence the getFsImageName() call was returning null if it 
raced with the completion of a checkpoint. I was able to reproduce this 
reliably by adding a sleep before the getFsImageName call, and then adding a 
join on the thread at the end of the test.

The fix is to actually make the checkpointer thread a member of the 
SecondaryNameNode, so that it can be properly shut down.

I also added code to the test that checks for any leftover checkpointer threads 
between tests as an extra safeguard against this kind of test bug.
                
> TestCheckpoint#testSecondaryHasVeryOutOfDateImage occasionally fails due to 
> unexpected exit
> -------------------------------------------------------------------------------------------
>
>                 Key: HDFS-4006
>                 URL: https://issues.apache.org/jira/browse/HDFS-4006
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: name-node
>    Affects Versions: 2.0.0-alpha
>            Reporter: Eli Collins
>            Assignee: Todd Lipcon
>              Labels: test-fail
>         Attachments: hdfs-4006.txt, test-log.txt
>
>
> TestCheckpoint#testSecondaryHasVeryOutOfDateImage occasionally fails due to 
> unexpected exit, due to an NPE while checkpointing. It looks like the 
> background checkpoint fails, conflicts with the explicit checkpoints done by 
> the tests (note the backtrace is not for the doCheckpoint calls in the tests.
> {noformat}
> 2012-09-16 01:55:05,901 FATAL hdfs.MiniDFSCluster 
> (MiniDFSCluster.java:shutdown(1355)) - Test resulted in an unexpected exit
> org.apache.hadoop.util.ExitUtil$ExitException: Fatal exception with message 
> null
> stack trace
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:480)
> at 
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doWork(SecondaryNameNode.java:331)
> at 
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$2.run(SecondaryNameNode.java:298)
> at 
> org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:452)
> at 
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.run(SecondaryNameNode.java:294)
> at java.lang.Thread.run(Thread.java:662)
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to