[
https://issues.apache.org/jira/browse/HDFS-9249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960391#comment-14960391
]
Steve Loughran commented on HDFS-9249:
--------------------------------------
Wei, in preparing for the next CDH release, you don't something wonderful
1. Found a new way for the NN to fail, and got a stack trace off it.
2. Delved into that stack trace to come up with some reasonable hypothesis as
to the cause
Those are achievements to exploit
bq. It's supportability improvement therefore I deem there is no need for a
test case.
If it was just the logging I'd agree, though looking at the destination code
I'd ask for what I'm asking for first: use the java7 combined exception handler
to also catch and log the {{HadoopIllegalArgumentException}}.
But It's a fix for an NPE on shutdown, which is something we can verify.
Given the code path you highlit does appear to trigger an IOE if you enable
hadoop auth without defining a NN principal, the IOE should be repeatedly
triggered. This means we can have a test which sets up an NN appropriately and
then triggers the failure.
*This can then verify that the changes to the {{stop()}} call are the ones
needed for a robust shutdown.*
That's why I want a test: to verify that all the problems have been identified
and fixed.
You've done the hard work: identify a possible cause, come up with a probable
fix. All it needs is a little test to round it off -one which will even
validate your hypothesis about probable cause.
> NPE thrown if an IOException is thrown in NameNode.<init>
> ---------------------------------------------------------
>
> Key: HDFS-9249
> URL: https://issues.apache.org/jira/browse/HDFS-9249
> Project: Hadoop HDFS
> Issue Type: Bug
> Reporter: Wei-Chiu Chuang
> Assignee: Wei-Chiu Chuang
> Priority: Minor
> Labels: supportability
> Attachments: HDFS-9249.001.patch
>
>
> This issue was found when running test case
> TestBackupNode.testCheckpointNode, but upon closer look, the problem is not
> due to the test case.
> Looks like an IOException was thrown in
> try {
> initializeGenericKeys(conf, nsId, namenodeId);
> initialize(conf);
> try {
> haContext.writeLock();
> state.prepareToEnterState(haContext);
> state.enterState(haContext);
> } finally {
> haContext.writeUnlock();
> }
> causing the namenode to stop, but the namesystem was not yet properly
> instantiated, causing NPE.
> I tried to reproduce locally, but to no avail.
> Because I could not reproduce the bug, and the log does not indicate what
> caused the IOException, I suggest make this a supportability JIRA to log the
> exception for future improvement.
> Stacktrace
> java.lang.NullPointerException: null
> at
> org.apache.hadoop.hdfs.server.namenode.NameNode.getFSImage(NameNode.java:906)
> at org.apache.hadoop.hdfs.server.namenode.BackupNode.stop(BackupNode.java:210)
> at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:827)
> at
> org.apache.hadoop.hdfs.server.namenode.BackupNode.<init>(BackupNode.java:89)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1474)
> at
> org.apache.hadoop.hdfs.server.namenode.TestBackupNode.startBackupNode(TestBackupNode.java:102)
> at
> org.apache.hadoop.hdfs.server.namenode.TestBackupNode.testCheckpoint(TestBackupNode.java:298)
> at
> org.apache.hadoop.hdfs.server.namenode.TestBackupNode.testCheckpointNode(TestBackupNode.java:130)
> The last few lines of log:
> 2015-10-14 19:45:07,807 INFO namenode.NameNode
> (NameNode.java:createNameNode(1422)) - createNameNode [-checkpoint]
> 2015-10-14 19:45:07,807 INFO impl.MetricsSystemImpl
> (MetricsSystemImpl.java:init(158)) - CheckpointNode metrics system started
> (again)
> 2015-10-14 19:45:07,808 INFO namenode.NameNode
> (NameNode.java:setClientNamenodeAddress(402)) - fs.defaultFS is
> hdfs://localhost:37835
> 2015-10-14 19:45:07,808 INFO namenode.NameNode
> (NameNode.java:setClientNamenodeAddress(422)) - Clients are to use
> localhost:37835 to access this namenode/service.
> 2015-10-14 19:45:07,810 INFO hdfs.MiniDFSCluster
> (MiniDFSCluster.java:shutdown(1708)) - Shutting down the Mini HDFS Cluster
> 2015-10-14 19:45:07,810 INFO namenode.FSNamesystem
> (FSNamesystem.java:stopActiveServices(1298)) - Stopping services started for
> active state
> 2015-10-14 19:45:07,811 INFO namenode.FSEditLog
> (FSEditLog.java:endCurrentLogSegment(1228)) - Ending log segment 1
> 2015-10-14 19:45:07,811 INFO namenode.FSNamesystem
> (FSNamesystem.java:run(5306)) - NameNodeEditLogRoller was interrupted, exiting
> 2015-10-14 19:45:07,811 INFO namenode.FSEditLog
> (FSEditLog.java:printStatistics(703)) - Number of transactions: 3 Total time
> for transactions(ms): 0 Number of transactions batched in Syncs: 0 Number of
> syncs: 4 SyncTimes(ms): 2 1
> 2015-10-14 19:45:07,811 INFO namenode.FSNamesystem
> (FSNamesystem.java:run(5373)) - LazyPersistFileScrubber was interrupted,
> exiting
> 2015-10-14 19:45:07,822 INFO namenode.FileJournalManager
> (FileJournalManager.java:finalizeLogSegment(142)) - Finalizing edits file
> /data/jenkins/workspace/CDH5.5.0-Hadoop-HDFS-2.6.0/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/name1/current/edits_inprogress_0000000000000000001
> ->
> /data/jenkins/workspace/CDH5.5.0-Hadoop-HDFS-2.6.0/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/name1/current/edits_0000000000000000001-0000000000000000003
> 2015-10-14 19:45:07,835 INFO namenode.FileJournalManager
> (FileJournalManager.java:finalizeLogSegment(142)) - Finalizing edits file
> /data/jenkins/workspace/CDH5.5.0-Hadoop-HDFS-2.6.0/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/name2/current/edits_inprogress_0000000000000000001
> ->
> /data/jenkins/workspace/CDH5.5.0-Hadoop-HDFS-2.6.0/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/name2/current/edits_0000000000000000001-0000000000000000003
> 2015-10-14 19:45:07,836 INFO blockmanagement.CacheReplicationMonitor
> (CacheReplicationMonitor.java:run(169)) - Shutting down
> CacheReplicationMonitor
> 2015-10-14 19:45:07,836 INFO ipc.Server (Server.java:stop(2485)) - Stopping
> server on 37835
> 2015-10-14 19:45:07,837 INFO ipc.Server (Server.java:run(718)) - Stopping IPC
> Server listener on 37835
> 2015-10-14 19:45:07,837 INFO ipc.Server (Server.java:run(844)) - Stopping IPC
> Server Responder
> 2015-10-14 19:45:07,837 INFO blockmanagement.BlockManager
> (BlockManager.java:run(3781)) - Stopping ReplicationMonitor.
> 2015-10-14 19:45:07,838 WARN blockmanagement.DecommissionManager
> (DecommissionManager.java:run(78)) - Monitor interrupted:
> java.lang.InterruptedException: sleep interrupted
> 2015-10-14 19:45:07,844 INFO namenode.FSNamesystem
> (FSNamesystem.java:stopActiveServices(1298)) - Stopping services started for
> active state
> 2015-10-14 19:45:07,845 INFO namenode.FSNamesystem
> (FSNamesystem.java:stopStandbyServices(1386)) - Stopping services started for
> standby state
> 2015-10-14 19:45:07,848 INFO mortbay.log (Slf4jLog.java:info(67)) - Stopped
> HttpServer2$SelectChannelConnectorWithSafeStartup@localhost:0
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)