Hi,

I am a complete HBase and HDFS newbie, so I apologize in advance for
the inevitable bloopers.

We are doing feasibility testing on NoSql data store options, with rather
high ingest rate requirements.  So far, HBase is looking good, with only
one issue identified. Running at an ingest rate of ~30K rows per second
on a 4 2.2Mhz CPU 8G RAM machine I am slowly leaking sockets.

This is a single node setup - no replication.  The CPU load is only about
50%-60%, with the majority of that in userland, system and iowait are
averaging less than 3%.  There is no swapping going on.

The problem is that on the datanode there are a large number of sockets
in FIN_WAIT1, with corresponding peers on master in ESTABLISHED.
These pairs hang around for quite some time, at at my ingest rate this
means that the total sockets held by datanode and master is slowly going
up.

If my understanding of TCP is correct, then this indicates that the master
peer has stopped reading incoming data from the datanode - i.e, it is
sending a window of zero; and that the datanode has called close(2) on
it's peer.

There was a thread some time ago:

http://www.mail-archive.com/hbase-u...@hadoop.apache.org/msg03329.html

There was no real conclusion.  I have played with the config params as
suggested on that thread, but no luck yet.  Also, in that case the problem
seemed to be between datanodes for replication operations - not the case
with me.  Changing timeouts to avoid the slow increase might not really
solve the problem IFF the master peer has in fact ceased to read it's
socket.  The data outstanding in the TCP stack buffer would be lost.
Whether that would imply data loss is beyond me.

I am posting this here as although the only logs with errors/exceptions
are the datanode logs, netstat and wireshark seem to indicate that the
problem is on the master side.

The master, namenode, regionserver and zookeeper and logs shows no
warning or errors.  The datanode log shows this, over and over:

2010-07-16 00:33:09,269 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(127.0.0.1:50010, storageID=DS-1028643313-10.1.1.200-50010-1279026099917, infoPort=50075, ipcPort=50020):Got exception while serving blk_3684861726145519813_22386 to /127.0.0.1: java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/127.0.0.1:50010 remote=/127.0.0.1:54774] at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246) at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159) at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:313) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:400) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:180) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:95)
        at java.lang.Thread.run(Thread.java:619)

2010-07-16 00:33:09,269 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(127.0.0.1:50010, storageID=DS-1028643313-10.1.1.200-50010-1279026099917, infoPort=50075, ipcPort=50020):DataXceiver java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/127.0.0.1:50010 remote=/127.0.0.1:54774] at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246) at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159) at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:313) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:400) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:180) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:95)
        at java.lang.Thread.run(Thread.java:619)

It there is any other info that might help, or any steps you would like
me to take, just let me know.

Thanks

Thomas Downing

Reply via email to