This failure seems to be repeatable with this job and this cluster.
I reran, and had the same problem, 2 machines unable to transfer some blocks.

I have a mapper, a combiner and a reducer. My combiner results in about a 4 to 1 reduction in data volumes.

This is the same job that shows up with the slow reducer transfer rates I have asked about earlier.

reduce > copy (643 of 789 at 0.12 MB/s) >
reduce > copy (656 of 789 at 0.12 MB/s) >
reduce > copy (644 of 789 at 0.12 MB/s) >
reduce > copy (644 of 789 at 0.12 MB/s) >
reduce > copy (656 of 789 at 0.12 MB/s) >
reduce > copy (656 of 789 at 0.12 MB/s) >
reduce > copy (643 of 789 at 0.12 MB/s) >
reduce > copy (623 of 789 at 0.12 MB/s) >
reduce > copy (621 of 789 at 0.12 MB/s) >

Raghu Angadi wrote:

I would think after an hours or so things are ok.. but that might not have helped the job.

Raghu.

Jason Venner wrote:
We have a small cluster of 9 machines on a shared Gig Switch (with a lot of other machines)

The other day, running a job, the reduce stalled, when the map was 99.99x% done. 7 of the 9 machines were idle, and 2 of the machines were using 100% of 1 cpu (1 job per machine).

So it appears that there was a synchronization failure, in that one machine thought the transfer hadn't started and the other machine thought it had.

We did have a momentary network outage on the switch during this job. We tried stopping the hadoop processes on the machines with the sending failures, and after 10 minutes they went 'dead' but the job never resumed.

Looking into the log files of the spinning machines, they were endlessly trying to start a block move to any of a set of other machines in the cluster. The shape of their log message repeats are below.

007-12-03 15:42:44,755 INFO org.apache.hadoop.dfs.DataNode: Starting thread to transfer block blk_3105072074036734167 to [Lorg.apache.hadoop.dfs.DatanodeInfo;@6fc40f 2007-12-03 15:42:44,757 WARN org.apache.hadoop.dfs.DataNode: Failed to transfer blk_3105072074036734167 to XX.YY.ZZ.AAA:50010 got java.net.SocketException: Broken pipe
      at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)
      at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
      at java.io.DataOutputStream.write(DataOutputStream.java:90)
at org.apache.hadoop.dfs.DataNode$BlockSender.sendChunk(DataNode.java:1175) at org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1208) at org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:1460)
      at java.lang.Thread.run(Thread.java:619)


-- On the machines that the transfers were targeted to, the following was in the log file.

2007-12-03 15:42:18,508 ERROR org.apache.hadoop.dfs.DataNode: DataXceiver: java.io.IOException: Block blk_3105072074036734167 has already been started (though not completed), and thus cannot be created. at org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:568) at org.apache.hadoop.dfs.DataNode$BlockReceiver.<init>(DataNode.java:1257) at org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:901) at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:804)
      at java.lang.Thread.run(Thread.java:619)


Reply via email to