Joydeep Sen Sarma wrote:
agreed - i think for anyone who is thinking of using hadoop as a place from where data is served - has to be distrubed by lack of data protection. replication in hadoop provides protection against hardware failures. not software failures. backups (and depending on how they are implemented - snapshots) protect against errant software. we have seen evidence of the namenode going haywire and causing block deletions/file corruptions at least once. we have seen more reports of the same nature on this list. i don't think hadoop (and hbase) can reach their full potential without a safeguard against software corruptions.
Did you try '-upgrade' option? It is exactly meant for protection against errant software. Yes, it does not let you keep multiple 'snapshots' though it was part of the initial design of the feature.
Raghu.
this question came up a couple of days back as well. one option is switching over to solaris+zfs as a way of taking data snapshots. the other option is having two hdfs instances (ideally running different versions) and replicating data amongst them. both have clear downsides. (i don't think the traditional notion of backing up to tape (or even virtual tape - which is really what our filers are becoming) is worth discussing. for large data sets - the restore time would be so bad as to render these useless as a recovery path).