On Nov 25, 2009, at 11:27 AM, David J. O'Dell wrote:

I've intermittently seen the following errors on both of my clusters, it happens when writing files. I was hoping this would go away with the new version but I see the same behavior on both versions. The namenode logs don't show any problems, its always on the client and datanodes.

[leaving errors below for reference]

I've seen similar errors on my 0.19.2 cluster when the cluster is decently busy. I've traced this more or less to the host in question doing verification on its blocks, an operation which seems to take the datanode out for upwards of 500 seconds in some cases.

In 0.19.2, if you look at o.a.h.hdfs.server.datanode.FSDataset.FSVolumeSet, you will see that all methods are synchronized. All operations for the dataset on the node seem to drop through methods in this class which in turn causes a backup when one thread spends a large amount of time locking the monitor...

You can grab a few jstacks and use a dump analyzer (like https://tda.dev.java.net/) to poke through them to see if you have the same behavior.

I have not spent enough time digging into this to understand whether the whole dataset really needs to be locked during the operation or if the locks could be moved closer to the FSDir operations.

dave bayer

original logs clips included here:

Client log:
09/11/25 10:54:15 INFO hdfs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException: 69000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.1.75.11:37852 remote=/10.1.75.125:50010] 09/11/25 10:54:15 INFO hdfs.DFSClient: Abandoning block blk_-105422935413230449_22608 09/11/25 10:54:15 INFO hdfs.DFSClient: Waiting to find target node: 10.1.75.125:50010

Datanode log:
2009-11-25 10:54:51,170 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.1.75.125:50010, storageID=DS-1401408597-10.1.75.125-50010-1258737830230, infoPort=50075, ipcPort=50020):DataXceiver java.net.SocketTimeoutException: 120000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=/ 10.1.75.104:50010] at org .apache .hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:213)
      at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404)
at org .apache .hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java: 282) at org .apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java: 103)
      at java.lang.Thread.run(Thread.java:619)

Reply via email to