[
https://issues.apache.org/jira/browse/HDFS-3886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13446916#comment-13446916
]
Steve Loughran commented on HDFS-3886:
--------------------------------------
I don't think you could easily do much with init.d as that is initiated by the
OS when it's doing a shutdown and it may be unrolling large parts of the
system: fast shutdowns are always appreciated before the monitoring layers
escalate. Same for Linux clustering resource agents: the slower the shutdown,
the longer it takes to migrate a service to a new node in the HA cluster.
Perhaps a way could be provided over RPC to tell the NN to block & checkpoint;
dfsAdmin could be the gateway to this. If you could do this without even
stopping the process, you have something you can test more easily and a better
ops experience: you just issue a {{hadoop dfsadmin --checkpoint}} command, your
NN goes into safe mode briefly, the logs are sorted out and things continue.
> Shutdown requests can possibly check for checkpoint issues (corrupted edits)
> and save a good namespace copy before closing down?
> --------------------------------------------------------------------------------------------------------------------------------
>
> Key: HDFS-3886
> URL: https://issues.apache.org/jira/browse/HDFS-3886
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: name-node
> Affects Versions: 2.0.0-alpha
> Reporter: Harsh J
> Priority: Minor
>
> HDFS-3878 sorta gives me this idea. Aside of having a method to download it
> to a different location, we can also lock up the namesystem (or deactivate
> the client rpc server) and save the namesystem before we complete up the
> shutdown.
> The init.d/shutdown scripts would have to work with this somehow though, to
> not kill -9 it when in-process. Also, the new image may be stored in a
> shutdown.chkpt directory, to not interfere in the regular dirs, but still
> allow easier recovery.
> Obviously this will still not work if all directories are broken. So maybe we
> could have some configs to tackle that as well?
> I haven't thought this through, so let me know what part is wrong to do :)
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira