[
https://issues.apache.org/jira/browse/HADOOP-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508264
]
Doug Cutting commented on HADOOP-1486:
--------------------------------------
> whether to have a monitoring daemon that restarts namenode automatically
It seems safe to restart the namenode in this case. I'd simply add a loop to
NameNode.main() that creates and starts a new NameNode when the existing
namenode exits unexpectedly. We should only restart if it's stopping due to an
error, and not due to an explicit call to stop(). So perhaps NameNode#join()
could return a boolean indicating whether it's exiting normally or should be
restarted, and the catch in the ReplicationMonitor should call a NameNode
method to trigger that kind of exit. Does this sound workable?
> ReplicationMonitor thread goes away
> ------------------------------------
>
> Key: HADOOP-1486
> URL: https://issues.apache.org/jira/browse/HADOOP-1486
> Project: Hadoop
> Issue Type: Bug
> Components: dfs
> Affects Versions: 0.12.3
> Reporter: Koji Noguchi
> Assignee: dhruba borthakur
> Priority: Blocker
> Fix For: 0.14.0
>
> Attachments: catchThrowable2.patch
>
>
> Saw many over/under replicated blocks in fsck output.
> .out file showed
> Exception in thread "[EMAIL PROTECTED]" java.lang.IllegalArgumentException:
> Unexpected non-existing data node: /99.9.99.0/99.9.99.42:99999
> at
> org.apache.hadoop.net.NetworkTopology.checkArgument(NetworkTopology.java:379)
> at
> org.apache.hadoop.net.NetworkTopology.isOnSameRack(NetworkTopology.java:424)
> at
> org.apache.hadoop.dfs.FSNamesystem$ReplicationTargetChooser.chooseTarget(FSNamesystem.java:2853)
> at
> org.apache.hadoop.dfs.FSNamesystem$ReplicationTargetChooser.chooseTarget(FSNamesystem.java:2816)
> at
> org.apache.hadoop.dfs.FSNamesystem.pendingTransfers(FSNamesystem.java:2658)
> at
> org.apache.hadoop.dfs.FSNamesystem.computeDatanodeWork(FSNamesystem.java:1774)
> at
> org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor.run(FSNamesystem.java:1723)
> at java.lang.Thread.run(Thread.java:619)
> (same as HADOOP-1232)
> And, jstack showed no ReplicationMonitor thread.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.