Damage Control

Jeff Eastman Thu, 03 Jan 2008 09:26:52 -0800

I have a small cloud running with about 100 gb of data in the dfs. All
appeared normal until yesterday, when Eclipse could not access the dfs.
Investigating:


 

1. I logged onto the master machine and attempted to upload a local
file. Got 6 errors like:

 

08/01/02 21:34:43 WARN fs.DFSClient: Error while writing.

java.net.SocketException: Broken pipe

        at java.net.SocketOutputStream.socketWrite0(Native Method)

        at
java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)

        at
java.net.SocketOutputStream.write(SocketOutputStream.java:136)

        at
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)

        at
java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)

        at java.io.DataOutputStream.write(DataOutputStream.java:90)

        at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.endBlock(DFSClient.java:
1656)

        at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.close(DFSClient.java:174
4)

        at
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutput
Stream.java:49)

        at
org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:64
)

        at org.apache.hadoop.fs.FileUtil.copyContent(FileUtil.java:263)

        at org.apache.hadoop.fs.FileUtil.copyContent(FileUtil.java:248)

        at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:133)

        at
org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:776)

        at
org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:757)

        at org.apache.hadoop.fs.FsShell.copyFromLocal(FsShell.java:115)

        at org.apache.hadoop.fs.FsShell.run(FsShell.java:1220)

        at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:187)

        at org.apache.hadoop.fs.FsShell.main(FsShell.java:1333)

put: Broken pipe

 

2. I bounced the cloud

3. Now I had 2x the number of nodes in Node manager (hosts were all
duplicated with 0 blocks allocated in each duplicate)

4. I brought down the cloud

5. Jps still showed master processes, but none on slaves

6. Tried to down the cloud again, no change

7. Rebooted the master server (stupid move)

8. Brought up the cloud. No name node

 

[EMAIL PROTECTED] hadoop]$ jps

2436 DataNode

2539 SecondaryNameNode

2781 Jps

2739 TaskTracker

2605 JobTracker

 

9. Node manager page is absent, cannot connect to Hadoop

10. Checking the name node log, the directory
/tmp/hadoop-jeastman/dfs/name is missing

 

The simplest thing would be to just reinitialize the dfs, since the data
is stored elsewhere. But I would like to understand what went wrong if
possible and also fix it if that is possible. Any suggestions?

 

Jeff

Damage Control

Reply via email to