I have a small cloud running with about 100 gb of data in the dfs. All
appeared normal until yesterday, when Eclipse could not access the dfs.
Investigating:
1. I logged onto the master machine and attempted to upload a local
file. Got 6 errors like:
08/01/02 21:34:43 WARN fs.DFSClient: Error while writing.
java.net.SocketException: Broken pipe
at java.net.SocketOutputStream.socketWrite0(Native Method)
at
java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)
at
java.net.SocketOutputStream.write(SocketOutputStream.java:136)
at
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
at
java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
at java.io.DataOutputStream.write(DataOutputStream.java:90)
at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.endBlock(DFSClient.java:
1656)
at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.close(DFSClient.java:174
4)
at
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutput
Stream.java:49)
at
org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:64
)
at org.apache.hadoop.fs.FileUtil.copyContent(FileUtil.java:263)
at org.apache.hadoop.fs.FileUtil.copyContent(FileUtil.java:248)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:133)
at
org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:776)
at
org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:757)
at org.apache.hadoop.fs.FsShell.copyFromLocal(FsShell.java:115)
at org.apache.hadoop.fs.FsShell.run(FsShell.java:1220)
at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:187)
at org.apache.hadoop.fs.FsShell.main(FsShell.java:1333)
put: Broken pipe
2. I bounced the cloud
3. Now I had 2x the number of nodes in Node manager (hosts were all
duplicated with 0 blocks allocated in each duplicate)
4. I brought down the cloud
5. Jps still showed master processes, but none on slaves
6. Tried to down the cloud again, no change
7. Rebooted the master server (stupid move)
8. Brought up the cloud. No name node
[EMAIL PROTECTED] hadoop]$ jps
2436 DataNode
2539 SecondaryNameNode
2781 Jps
2739 TaskTracker
2605 JobTracker
9. Node manager page is absent, cannot connect to Hadoop
10. Checking the name node log, the directory
/tmp/hadoop-jeastman/dfs/name is missing
The simplest thing would be to just reinitialize the dfs, since the data
is stored elsewhere. But I would like to understand what went wrong if
possible and also fix it if that is possible. Any suggestions?
Jeff