After going through the comments in the patch what I could gather is the
application of this patch will simply enable us to start the namenode in
case of a failure from the fsimage in the fs.checkpoint.dir, instead of
looking at the standard position of <dfs.name.dir>/current/fsimage.
In case the namenode goes down, and we have the checkpointed dir in
secondary namenode, the namenode can be started up therein. But this
would require manual startup of the namenode in the secondary namenode
server with -importcheckpoint option. Please correct me if anything is
wrong with my understanding of the application of Hadoop-2585.
Your understand is absolutely correct.
But you need to keep one thing in mind, import from checkpoint means what it
says, you are importing the latest checkpointed image.
That might not be the latest image your cluster has. For example, default
checkpoint time is 60 minutes (I think). which means, if your are running a
namenode which stopped after say 50 minutes from the last checkpoint, trying to
import old image, would endup with losing all 50 minutes of changes. You should
just try to start the namenode as is, with existing image and see if namenode
boots up, sometimes it does, only if you are unable to, you should try the
checkpointed image.
However for our situation we intend to have a mechanism that will detect
a namenode failure. and automatically startup the namenode with
-importcheckpoint option in the secondary namenode server. When i say
automatically it necessarily means absolutely no manual intervention at
the point of failure and startup. The datanodes in the slaves should
also be automatically aware of the change in the Namenode and the
cluster on the whole should go on without hampering functionality.
Hence, my question is,
- is such a mechanism part of any future hadoop release ?
As of now, there is not automatic failover
- is hadoop-2585 a step towards incorporating automatic namenode
recovery mechanism in Hadoop ?
In some sense yes, and some no.
- currently using hadoop 0.17.1, is there anyway we can achieve
something close to the mechanism we want.
One thing you could do is this. In your config, there is an variable named
dfs.name.dir. which accepts a comma separated list of directories. Namenode
stores its image and edits in all the directories listed by this config
variable. Normal way of operating to have one directory local to the namenode
and other on NFS and possible another on same node, but different disk. Now,
when namenode goes down, you can try to start it again, even if any one of the
directories has good image, namenode pick that and start functioning.
But as we have seen, namenode does not go down that easily, all exceptions are
caught and logged keeping the namenode running.
Thanks,
Lohit