In your example, you only have one active Name Node. So how would you encounter a 'split brain' scenario? Maybe it would be better if you defined what you mean by a split brain?
-Mike On Jun 18, 2012, at 8:30 PM, hdev ml wrote: > All hadoop contributors/experts, > > I am trying to simulate split brain in our installation. There are a few > things we want to know > > 1. Does data corruption happen? > 2. If Yes in #1, how to recover from it. > 3. What are the corrective steps to take in this situation e.g. killing one > namenode etc > > So to simulate this I took following steps. > > 1. We already have a healthy test cluster, consisting of 4 machines. One > machine runs namenode and a datanode, other machine runs secondarynamenode > and a datanode, 3rd runs jobtracker and a datanode, and 4th one just a > datanode. > 2. Copied the hadoop installation folder to a new location in the datanode. > 3. Kept all configurations same in hdfs-site and core-site xmls, except > renamed the fs.default.name to a different URI > 4. The namenode directory - dfs.name.dir was pointing to the same shared > NFS mounted directory to which the main namenode points to. > > I started this standby namenode using following command > bin/hadoop-daemon.sh --config conf --hosts slaves start namenode > > It errored out saying that "the directory is already locked", which is an > expected behaviour. The directory has been locked by the original namenode. > > So I changed the dfs.name.dir to some other folder, and issued the same > command. It fails with message - "namenode has not been formatted", which > is also expected. > > This makes me think - does splitbrain situation really occur in hadoop? > > My understanding is that split brain happens because of timeouts on the > main namenode. The way it happens is, when the timeout occurs, the HA > implementation - Be it Linux HA, Veritas etc., thinks that the main > namenode has died and tries to start the standby namenode. The standby > namenode starts up and then main namenode comes back from the timeout phase > and starts functioning as if nothing happened, giving rise to 2 namenodes > in the cluster - Split Brain. > > Considering the error messages and the above understanding, I cannot point > 2 different namenodes to same directory, because the main namenode isn't > responding but has locked the directory. > > So can I safely conclude that split brain does not occur in hadoop? > > Or am I missing any other situation where split brain happens and the > namenode directory is not locked, thus allowing the standby namenode also > to start up? > > Has anybody encountered this? > > Any help is really appreciated. > > Harshad