I have a small cloud running with about 100 gb of data in the dfs. All appeared normal until yesterday, when Eclipse could not access the dfs. Investigating:
1. I logged onto the master machine and attempted to upload a local file. Got 6 errors like: 08/01/02 21:34:43 WARN fs.DFSClient: Error while writing. java.net.SocketException: Broken pipe at java.net.SocketOutputStream.socketWrite0(Native Method) at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92) at java.net.SocketOutputStream.write(SocketOutputStream.java:136) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109) at java.io.DataOutputStream.write(DataOutputStream.java:90) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.endBlock(DFSClient.java: 1656) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.close(DFSClient.java:174 4) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutput Stream.java:49) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:64 ) at org.apache.hadoop.fs.FileUtil.copyContent(FileUtil.java:263) at org.apache.hadoop.fs.FileUtil.copyContent(FileUtil.java:248) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:133) at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:776) at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:757) at org.apache.hadoop.fs.FsShell.copyFromLocal(FsShell.java:115) at org.apache.hadoop.fs.FsShell.run(FsShell.java:1220) at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:187) at org.apache.hadoop.fs.FsShell.main(FsShell.java:1333) put: Broken pipe 2. I bounced the cloud 3. Now I had 2x the number of nodes in Node manager (hosts were all duplicated with 0 blocks allocated in each duplicate) 4. I brought down the cloud 5. Jps still showed master processes, but none on slaves 6. Tried to down the cloud again, no change 7. Rebooted the master server (stupid move) 8. Brought up the cloud. No name node [EMAIL PROTECTED] hadoop]$ jps 2436 DataNode 2539 SecondaryNameNode 2781 Jps 2739 TaskTracker 2605 JobTracker 9. Node manager page is absent, cannot connect to Hadoop 10. Checking the name node log, the directory /tmp/hadoop-jeastman/dfs/name is missing The simplest thing would be to just reinitialize the dfs, since the data is stored elsewhere. But I would like to understand what went wrong if possible and also fix it if that is possible. Any suggestions? Jeff