Re: Has anyone had hdfs block move synchronization failures with hadoop 0.15.0?

Jason Venner Tue, 04 Dec 2007 14:07:18 -0800

This failure seems to be repeatable with this job and this cluster.

I reran, and had the same problem, 2 machines unable to transfer someblocks.

I have a mapper, a combiner and a reducer. My combiner results in abouta 4 to 1 reduction in data volumes.

This is the same job that shows up with the slow reducer transfer ratesI have asked about earlier.


reduce > copy (643 of 789 at 0.12 MB/s) >
reduce > copy (656 of 789 at 0.12 MB/s) >
reduce > copy (644 of 789 at 0.12 MB/s) >
reduce > copy (644 of 789 at 0.12 MB/s) >
reduce > copy (656 of 789 at 0.12 MB/s) >
reduce > copy (656 of 789 at 0.12 MB/s) >
reduce > copy (643 of 789 at 0.12 MB/s) >
reduce > copy (623 of 789 at 0.12 MB/s) >
reduce > copy (621 of 789 at 0.12 MB/s) >

Raghu Angadi wrote:

I would think after an hours or so things are ok.. but that might nothave helped the job.
Raghu.

Jason Venner wrote:
We have a small cluster of 9 machines on a shared Gig Switch (with alot of other machines)
The other day, running a job, the reduce stalled, when the map was99.99x% done.7 of the 9 machines were idle, and 2 of the machines were using 100%of 1 cpu (1 job per machine).
So it appears that there was a synchronization failure, in that onemachine thought the transfer hadn't started and the other machinethought it had.
We did have a momentary network outage on the switch during this job.We tried stopping the hadoop processes on the machines with thesending failures, and after 10 minutes they went 'dead' but the jobnever resumed.
Looking into the log files of the spinning machines, they wereendlessly trying to start a block move to any of a set of othermachines in the cluster. The shape of their log message repeats arebelow.
007-12-03 15:42:44,755 INFO org.apache.hadoop.dfs.DataNode: Startingthread to transfer block blk_3105072074036734167 to[Lorg.apache.hadoop.dfs.DatanodeInfo;@6fc40f2007-12-03 15:42:44,757 WARN org.apache.hadoop.dfs.DataNode: Failedto transfer blk_3105072074036734167 to XX.YY.ZZ.AAA:50010 gotjava.net.SocketException: Broken pipe
      at java.net.SocketOutputStream.socketWrite0(Native Method)
atjava.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)
      at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
atjava.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)atjava.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
      at java.io.DataOutputStream.write(DataOutputStream.java:90)
atorg.apache.hadoop.dfs.DataNode$BlockSender.sendChunk(DataNode.java:1175)atorg.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1208)atorg.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:1460)
      at java.lang.Thread.run(Thread.java:619)
-- On the machines that the transfers were targeted to, the followingwas in the log file.
2007-12-03 15:42:18,508 ERROR org.apache.hadoop.dfs.DataNode:DataXceiver: java.io.IOException: Block blk_3105072074036734167 hasalready been started (though not completed), and thus cannot be created.atorg.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:568)atorg.apache.hadoop.dfs.DataNode$BlockReceiver.<init>(DataNode.java:1257)atorg.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:901)atorg.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:804)
      at java.lang.Thread.run(Thread.java:619)

Re: Has anyone had hdfs block move synchronization failures with hadoop 0.15.0?

Reply via email to