---------- Forwarded message ---------- From: phil young <[email protected]> Date: Mon, Oct 25, 2010 at 8:30 PM Subject: Re: Namenode corruption: need help quickly please To: [email protected]
In the interests of helping others, here's some details on what happened to us and how we recovered.... Incompatible Build Versions (between the NameNode<https://twiki.tripadvisor.com/bin/edit/Development/NameNode?topicparent=Development.HadoopTroubleshooting;nowysiwyg=0>and the DataNodes<https://twiki.tripadvisor.com/bin/edit/Development/DataNodes?topicparent=Development.HadoopTroubleshooting;nowysiwyg=0>) We and others have seen the following error. Apparently it occurs when there's some change resulting in a difference in the "build" versions. This is not DFS corruption but may apper to be so because the master and task tracker processes start fine, but the DataNodes<https://twiki.tripadvisor.com/bin/edit/Development/DataNodes?topicparent=Development.HadoopTroubleshooting;nowysiwyg=0>report the following error: *2010-10-25 18:35:38,470 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Incompatible build versions: namenode BV = ; datanode BV =* xxxxx This was caused by running "ant package" on the master. To recover, we restored /hadoop on the master using the following steps: 1. Stop the cluster (somewhat violently) 1. Normal shut down 1. stop-all.sh 2. Find and kill lingering processes 1. mon_jps #an alias in ~/.bash_profile that runs jps on all slaves 2. kill -9 each running Java process 3. Remove pid files 1. ls -ltr /tmp/*pid 2. rm -f /tmp/*pid #on each slave 2. Restore /hadoop on the master from a slave 1. cd /usr/local/hadoop 2. mv hadoop-0.20.2 hadoop-0.20.2.MOVED 3. restore hadoop-0.20.2 from a tarball generated on a slave 3. Restore the original "conf" folder for the master (since it's not the same as the slaves) 1. cd hadoop-0.20.2 2. mv ./conf ./conf.MOVED 3. cp -r ../../hadoop-0.20.2.MOVED/conf ./ 4. Start the cluster 1. start-all.sh 2. test_hadoop #an alias in ~/.bash_profile that runs a test map-reduce job On Mon, Oct 25, 2010 at 8:00 PM, Brian Bockelman <[email protected]>wrote: > > On Oct 25, 2010, at 6:35 PM, phil young wrote: > > > I had also assumed that some other jar or configuration file had been > > changed, but reviewing the timestamps on the files did not reveal the > > problem. > > On the assumption that something did in fact change, that I was not > seeing, > > I renamed my $HADOOP_HOME directory and replaced it with one from a > slave. > > I then restored $HADOOP_HOME/conf from the original/renamed directory, > and > > voila - we're back in business. > > > > Glad to hear this. > > > Brian, thanks very much for your help. > > It took literally more time for me to write the original email (5 > minutes) > > than to get a reply which indicated a way to solve the problem, and > another > > 5 minutes to solve it. > > That says a lot about the user group. I don't think I would have reached > a > > human being in 5 minutes for the tech support for most products. > > I'll make sure to monitor this list more closely so I can pay it forward > ;) > > > > No problem. There are lots of good people on this list, and I certainly > have done the "oh crap, I put my neck on the line for this new Hadoop thing > and now its broke" email. > > Brian
