Hello,

I'm testing a cluster with Hadoop 0.18.1 / HBase 0.18.0.

These last days a problem arises with my hdfs:

My topology is 4 nodes. 3 nodes run DataNode and RegionServer, and one runs
HBase master, NameNode and Secondary NameNode. The cluster works for some
hours, then one of the DataNode freezes using 100% of the CPU. 
Half an hour later, another node freezes using 0% CPU.

Concerning the looping process I tried to connect a JConsole on the process,
but unfortunately the process doesn't answer. Then I tried to dump the stack
trace, but I can just see 365 threads of the 385 total, before jstack util
freezes as well (using 100% of the CPU as well). I'll attach here the
(partial) stack dump of the 100% CPU datanode, and the (complete) stack dump
of the 0% other one (320 threads).

Here are the last life signs of the frozen-100% datanode:



2008-12-23 03:00:16,791 INFO org.apache.hadoop.dfs.DataNode: Deleting block
blk_-7774912183793020552_688332 file
/home/hadoop/var/hadoop-datastore/hadoop-hadoop/dfs/data/current/subdir30/subdir36/blk_-7774912183793020552
2008-12-23 03:00:16,811 INFO org.apache.hadoop.dfs.DataNode: Receiving block
blk_-4972564414025526862_688374 src: /192.168.1.13:47796 dest:
/192.168.1.13:50010
2008-12-23 03:00:16,852 INFO org.apache.hadoop.dfs.DataNode: Receiving block
blk_-2067646347119274088_688375 src: /192.168.1.10:35808 dest:
/192.168.1.10:50010
2008-12-23 03:00:16,852 INFO org.apache.hadoop.dfs.DataNode: writeBlock
blk_-2067646347119274088_688375 received exception java.io.IOException:
Block blk_-2067646347119274088_688375 is valid, and cannot be written to.
2008-12-23 03:00:16,853 ERROR org.apache.hadoop.dfs.DataNode:
DatanodeRegistration(192.168.1.15:50010,
storageID=DS-991601312-127.0.1.1-50010-1227525626257, infoPort=50075,
ipcPort=50020):DataXceiver: java.io.IOException: Block
blk_-2067646347119274088_688375 is valid, and cannot be written to.
        at org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:892)
        at org.apache.hadoop.dfs.DataNode$BlockReceiver.(DataNode.java:2320)
        at
org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:1187)
        at
org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:1045)
        at java.lang.Thread.run(Thread.java:619)

2008-12-23 03:00:16,854 INFO org.apache.hadoop.dfs.DataNode: Receiving block
blk_-5747903783599753912_688381 src: /192.168.1.13:47801 dest:
/192.168.1.13:50010


The last line concerns the replication of a block coming from the machine
that freezes half an hour later (might be linked ?). Here is the log at the
same time of the other machine 



2008-12-23 03:00:18,803 INFO org.apache.hadoop.dfs.DataNode: Receiving block
blk_-5747903783599753912_688381 src: /192.168.1.13:50537 dest:
/192.168.1.13:50010
2008-12-23 03:00:18,852 INFO org.apache.hadoop.dfs.DataNode:
DatanodeRegistration(192.168.1.13:50010,
storageID=DS-1681396969-127.0.1.1-50010-1227536709605, infoPort=50075,
ipcPort=50020) Served block blk_6825954372982263356_688368 to /192.168.1.13
2008-12-23 03:00:18,857 INFO org.apache.hadoop.dfs.DataNode:
DatanodeRegistration(192.168.1.13:50010,
storageID=DS-1681396969-127.0.1.1-50010-1227536709605, infoPort=50075,
ipcPor
[...]
2008-12-23 03:02:18,857 INFO org.apache.hadoop.dfs.DataNode: PacketResponder
blk_-5747903783599753912_688381 2 Exception java.net.SocketTimeoutException:
Read timed out
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.read(SocketInputStream.java:129)
        at java.io.DataInputStream.readFully(DataInputStream.java:178)
        at java.io.DataInputStream.readLong(DataInputStream.java:399)
        at
org.apache.hadoop.dfs.DataNode$PacketResponder.run(DataNode.java:2148)
        at java.lang.Thread.run(Thread.java:619)

2008-12-23 03:02:18,857 INFO org.apache.hadoop.dfs.DataNode: PacketResponder
2 for block blk_-5747903783599753912_688381 terminating


And finally the last line of this machine



2008-12-23 03:25:04,073 INFO org.apache.hadoop.dfs.DataNode:
DatanodeRegistration(192.168.1.13:50010,
storageID=DS-1681396969-127.0.1.1-50010-1227536709605, infoPort=50075,
ipcPort=50020) Served block blk_557286790531942083_687584 to /192.168.1.13


Not very relevant I think


If someone experiments the same problem, or have any idea to get a complete
stack dump, in order to see in which method the datanode is looping. Note
that my guess is that the jvm is an inconsistant state, because it doesn't
answer to the -QUIT nor -INT signal, and the jstack util needs the -F option
to get a stack tace. If it was a normal method that loops, above method
should work.

jvm version: 

java version "1.6.0_07"

Java(TM) SE Runtime Environment (build 1.6.0_07-b06)

Java HotSpot(TM) Client VM (build 10.0-b23, mixed mode, sharing)



Thanks for all.

-- Jean-Adrien

Attached:

Stack dump of 192.168.1.15 (100% cpu, not responding): 
http://www.nabble.com/file/p21141905/jstack.1.out jstack.1.out 

Stack dump of 192.168.1.13 (0% cpu, dead but responding): 
http://www.nabble.com/file/p21141905/jstack.0.out jstack.0.out 



-- 
View this message in context: 
http://www.nabble.com/Infinite-loop-in-a-DataNode-tp21141905p21141905.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Reply via email to