Re: copy -> sort hanging

Chris K Wensel Thu, 13 Mar 2008 17:00:30 -0700

I don't really have these logs as i've bounce my cluster. But amwilling to ferret out anything in particular on my next failed run.


On Mar 13, 2008, at 4:32 PM, Raghu Angadi wrote:

Yeah, its kind of hard to deal with these failure once they startoccurring.
Are all these logs from the same datnode? Could you separate logsfrom different datanodes?
If the first exception stack is while replicating a block (asopposed to initial write), then http://issues.apache.org/jira/browse/HADOOP-3007would help there. i.e. failure on next datanode should not affectthis datanode, you still need to check why the remote datanode failed.
Another problem is that once DataNode fails to write a block, thesame back can not be written to this node for next one hour. Theseare the "can not be written to" errors you see below. We shouldreally fix this. I will file a jira.
Raghu.

Chris K Wensel wrote:
here is a reset, followed by three attempts to write the block.
2008-03-13 13:40:06,892 INFO org.apache.hadoop.dfs.DataNode:Receiving block blk_7813471133156061911 src: /10.251.26.3:35762dest: /10.251.26.3:500102008-03-13 13:40:06,957 INFO org.apache.hadoop.dfs.DataNode:Exception in receiveBlock for block blk_7813471133156061911java.net.SocketException: Connection reset2008-03-13 13:40:06,957 INFO org.apache.hadoop.dfs.DataNode:writeBlock blk_7813471133156061911 received exceptionjava.net.SocketException: Connection reset2008-03-13 13:40:06,958 ERROR org.apache.hadoop.dfs.DataNode:10.251.65.207:50010:DataXceiver: java.net.SocketException:Connection resetatjava.net.SocketOutputStream.socketWrite(SocketOutputStream.java:96)
   at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
atjava.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
   at java.io.DataOutputStream.flush(DataOutputStream.java:106)
at org.apache.hadoop.dfs.DataNode$BlockReceiver.receivePacket(DataNode.java:2194) atorg.apache.hadoop.dfs.DataNode$BlockReceiver.receiveBlock(DataNode.java:2244) atorg.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:1150)at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:938)
   at java.lang.Thread.run(Thread.java:619)
2008-03-13 13:40:11,751 INFO org.apache.hadoop.dfs.DataNode:Receiving block blk_7813471133156061911 src: /10.251.27.148:48384dest: /10.251.27.148:500102008-03-13 13:40:11,752 INFO org.apache.hadoop.dfs.DataNode:writeBlock blk_7813471133156061911 received exceptionjava.io.IOException: Block blk_7813471133156061911 has already beenstarted (though not completed), and thus cannot be created.2008-03-13 13:40:11,752 ERROR org.apache.hadoop.dfs.DataNode:10.251.65.207:50010:DataXceiver: java.io.IOException: Blockblk_7813471133156061911 has already been started (though notcompleted), and thus cannot be created.at org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:638)at org.apache.hadoop.dfs.DataNode$BlockReceiver.<init>(DataNode.java:1983)at org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:1074)at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:938)
   at java.lang.Thread.run(Thread.java:619)
2008-03-13 13:48:37,925 INFO org.apache.hadoop.dfs.DataNode:Receiving block blk_7813471133156061911 src: /10.251.70.210:37345dest: /10.251.70.210:500102008-03-13 13:48:37,925 INFO org.apache.hadoop.dfs.DataNode:writeBlock blk_7813471133156061911 received exceptionjava.io.IOException: Block blk_7813471133156061911 has already beenstarted (though not completed), and thus cannot be created.2008-03-13 13:48:37,925 ERROR org.apache.hadoop.dfs.DataNode:10.251.65.207:50010:DataXceiver: java.io.IOException: Blockblk_7813471133156061911 has already been started (though notcompleted), and thus cannot be created.at org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:638)at org.apache.hadoop.dfs.DataNode$BlockReceiver.<init>(DataNode.java:1983)at org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:1074)at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:938)
   at java.lang.Thread.run(Thread.java:619)
2008-03-13 14:08:36,089 INFO org.apache.hadoop.dfs.DataNode:Receiving block blk_7813471133156061911 src: /10.251.26.223:49176dest: /10.251.26.223:500102008-03-13 14:08:36,089 INFO org.apache.hadoop.dfs.DataNode:writeBlock blk_7813471133156061911 received exceptionjava.io.IOException: Block blk_7813471133156061911 has already beenstarted (though not completed), and thus cannot be created.2008-03-13 14:08:36,089 ERROR org.apache.hadoop.dfs.DataNode:10.251.65.207:50010:DataXceiver: java.io.IOException: Blockblk_7813471133156061911 has already been started (though notcompleted), and thus cannot be created.at org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:638)at org.apache.hadoop.dfs.DataNode$BlockReceiver.<init>(DataNode.java:1983)at org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:1074)at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:938)
   at java.lang.Thread.run(Thread.java:619)
On Mar 13, 2008, at 11:25 AM, Chris K Wensel wrote:
should add that 10.251.65.207 (receiving end ofNameSystem.pendingTransfer below) has this datanode log entry.
2008-03-13 14:08:36,089 INFO org.apache.hadoop.dfs.DataNode:writeBlock blk_7813471133156061911 received exceptionjava.io.IOException: Block blk_7813471133156061911 has alreadybeen started (though not completed), and thus cannot be created.2008-03-13 14:08:36,089 ERROR org.apache.hadoop.dfs.DataNode:10.251.65.207:50010:DataXceiver: java.io.IOException: Blockblk_7813471133156061911 has already been started (though notcompleted), and thus cannot be created.at org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:638)at org.apache.hadoop.dfs.DataNode$BlockReceiver.<init>(DataNode.java:1983)at org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:1074)at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:938)
   at java.lang.Thread.run(Thread.java:619)

On Mar 13, 2008, at 11:22 AM, Chris K Wensel wrote:
hey all
I have about 40 jobs in a batch i'm running. but consistently oneparticular mr job hangs at the tail of the copy or at thebeginning of the sort (it 'looks' like it's still copying, but itisn't)
This job is a little bigger than the previous successful ones.The mapper dumped about 21,450,689,962 bytes. and combine outputrecords jives with map output records (not using a specialcombiner).
The namenode logs shows this..
2008-03-13 13:48:29,789 WARN org.apache.hadoop.fs.FSNamesystem:PendingReplicationMonitor timed out block blk_78134711331560619112008-03-13 13:48:33,310 INFO org.apache.hadoop.dfs.StateChange:BLOCK* NameSystem.pendingTransfer: ask 10.251.70.210:50010 toreplicate blk_7813471133156061911 to datanode(s)10.251.65.207:50010 10.251.126.6:500102008-03-13 13:58:29,809 WARN org.apache.hadoop.fs.FSNamesystem:PendingReplicationMonitor timed out block blk_78134711331560619112008-03-13 13:58:35,589 INFO org.apache.hadoop.dfs.StateChange:BLOCK* NameSystem.pendingTransfer: ask 10.251.127.198:50010 toreplicate blk_7813471133156061911 to datanode(s)10.251.127.228:50010 10.251.69.162:500102008-03-13 14:08:29,729 WARN org.apache.hadoop.fs.FSNamesystem:PendingReplicationMonitor timed out block blk_78134711331560619112008-03-13 14:08:34,869 INFO org.apache.hadoop.dfs.StateChange:BLOCK* NameSystem.pendingTransfer: ask 10.251.73.223:50010 toreplicate blk_7813471133156061911 to datanode(s)10.251.26.223:50010 10.251.65.207:50010
and I should add there have been periodic connection resets amongthe nodes (20 slaves). but my hang happens consistently on thisjob at this point. i also run a fresh cluster every time i execthis batch. so there isn't any cruft in the dfs.
also, this job has completed fine in the past. but i don'tremember seeing so much network static in the past either. buthistorically i have enabled block compression, the last two hangscompression was disabled. unsure if it ever hung with compressionor not (i will try a fresh cluster with compression enabled toconfirm).
any ideas on how to unjam, debug?

ckw

Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/


Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/

Re: copy -> sort hanging

Reply via email to