Digging into the logs some more, the namenode and secondarynamenode logs are full of exceptions like this going back to Dec 25th (the oldest logs I have):
2007-12-25 00:03:38,497 INFO org.apache.hadoop.fs.FSNamesystem: Roll Edit Log from 204.16.107.165 2007-12-25 00:03:38,557 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 54310, call rollEditLog() from 204.16.107.165:38982: error: java.io.IOException: Attempt to roll edit log but edits.n .107.165:38549: error: java.io.IOException: Attempt to roll edit log but edits.new exists java.io.IOException: Attempt to roll edit log but edits.new exists at org.apache.hadoop.dfs.FSEditLog.rollEditLog(FSEditLog.java:577) at org.apache.hadoop.dfs.FSNamesystem.rollEditLog(FSNamesystem.java:3519) at org.apache.hadoop.dfs.NameNode.rollEditLog(NameNode.java:553) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.jav a:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor Impl.java:25) at java.lang.reflect.Method.invoke(Method.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:340) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:566) The datanode logs on my master system look ok until New Year's Eve, when for some reason it starts moving blocks around like crazy. I noticed the next day that it seems to have rebalanced the whole file system. During this process there are a number of errors like: 2007-12-31 05:04:09,413 WARN org.apache.hadoop.dfs.DataNode: Failed to transfer blk_-8158005346611535914 to 204.16.107.200:50010 got java.net.SocketException: Connection reset at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:96) at java.net.SocketOutputStream.write(SocketOutputStream.java:136) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109) at java.io.DataOutputStream.write(DataOutputStream.java:90) at org.apache.hadoop.dfs.DataNode.sendBlock(DataNode.java:1231) at org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:1280) at java.lang.Thread.run(Thread.java:595) 2007-12-31 05:04:09,415 WARN org.apache.hadoop.dfs.DataNode: Failed to transfer blk_-8158005346611535914 to 204.16.107.200:50010 got java.net.SocketException: Broken pipe at java.net.SocketOutputStream.socketWrite0(Native Method) at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92) at java.net.SocketOutputStream.write(SocketOutputStream.java:136) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109) at java.io.DataOutputStream.write(DataOutputStream.java:90) at org.apache.hadoop.dfs.DataNode.sendBlock(DataNode.java:1231) at org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:1280) at java.lang.Thread.run(Thread.java:595) -----Original Message----- From: Jeff Eastman [mailto:[EMAIL PROTECTED] Sent: Thursday, January 03, 2008 9:26 AM To: hadoop-user@lucene.apache.org Subject: Damage Control I have a small cloud running with about 100 gb of data in the dfs. All appeared normal until yesterday, when Eclipse could not access the dfs. Investigating: 1. I logged onto the master machine and attempted to upload a local file. Got 6 errors like: 08/01/02 21:34:43 WARN fs.DFSClient: Error while writing. java.net.SocketException: Broken pipe at java.net.SocketOutputStream.socketWrite0(Native Method) at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92) at java.net.SocketOutputStream.write(SocketOutputStream.java:136) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109) at java.io.DataOutputStream.write(DataOutputStream.java:90) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.endBlock(DFSClient.java: 1656) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.close(DFSClient.java:174 4) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutput Stream.java:49) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:64 ) at org.apache.hadoop.fs.FileUtil.copyContent(FileUtil.java:263) at org.apache.hadoop.fs.FileUtil.copyContent(FileUtil.java:248) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:133) at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:776) at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:757) at org.apache.hadoop.fs.FsShell.copyFromLocal(FsShell.java:115) at org.apache.hadoop.fs.FsShell.run(FsShell.java:1220) at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:187) at org.apache.hadoop.fs.FsShell.main(FsShell.java:1333) put: Broken pipe 2. I bounced the cloud 3. Now I had 2x the number of nodes in Node manager (hosts were all duplicated with 0 blocks allocated in each duplicate) 4. I brought down the cloud 5. Jps still showed master processes, but none on slaves 6. Tried to down the cloud again, no change 7. Rebooted the master server (stupid move) 8. Brought up the cloud. No name node [EMAIL PROTECTED] hadoop]$ jps 2436 DataNode 2539 SecondaryNameNode 2781 Jps 2739 TaskTracker 2605 JobTracker 9. Node manager page is absent, cannot connect to Hadoop 10. Checking the name node log, the directory /tmp/hadoop-jeastman/dfs/name is missing The simplest thing would be to just reinitialize the dfs, since the data is stored elsewhere. But I would like to understand what went wrong if possible and also fix it if that is possible. Any suggestions? Jeff