Hi, I'm running a 37 DN hdfs cluster. There are 12 nodes have 20TB capacity each node, and the other 25 nodes have 24TB each node.Unfortunately, there are several nodes that contain much more data than others, and I can still see the data increasing crazy. The 'dstat' shows
dstat -ta 2 -----time----- ----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system-- date/time |usr sys idl wai hiq siq| read writ| recv send| in out | int csw 24-06 00:42:43| 1 1 95 2 0 0| 25M 62M| 0 0 | 0 0.1 |3532 5644 24-06 00:42:45| 7 1 91 0 0 0| 16k 176k|8346B 1447k| 0 0 |1201 365 24-06 00:42:47| 7 1 91 0 0 0| 12k 172k|9577B 1493k| 0 0 |1223 334 24-06 00:42:49| 11 3 83 1 0 1| 26M 11M| 78M 66M| 0 0 | 12k 18k 24-06 00:42:51| 4 3 90 1 0 2| 17M 181M| 117M 53M| 0 0 | 15k 26k 24-06 00:42:53| 4 3 87 4 0 2| 15M 375M| 117M 55M| 0 0 | 16k 26k 24-06 00:42:55| 3 2 94 1 0 1| 15M 37M| 80M 17M| 0 0 | 10k 15k 24-06 00:42:57| 0 0 98 1 0 0| 18M 23M|7259k 5988k| 0 0 |1932 1066 24-06 00:42:59| 0 0 98 1 0 0| 16M 132M| 708k 106k| 0 0 |1484 491 24-06 00:43:01| 4 2 91 2 0 1| 23M 64M| 76M 41M| 0 0 |8441 13k 24-06 00:43:03| 4 3 88 3 0 1| 17M 207M| 91M 48M| 0 0 | 11k 16k >From the result of dstat, we can see that the throughput of write is much more than read. I've started a balancer processor, with dfs.balance.bandwidthPerSec set to bytes. From the balancer log, I can see the balancer works well. But the balance operation can not catch up with the write operation. Now I can only stop the mad increase of data size by stopping the datanode, and setting dfs.datanode.du.reserved 300GB, then starting the datanode again. Until the total size reaches the 300GB reservation line, the increase stopped. The output of 'hadoop dfsadmin -report' shows for the crazy nodes, Name: 10.150.161.88:50010 Decommission Status : Normal Configured Capacity: 20027709382656 (18.22 TB) DFS Used: 14515387866480 (13.2 TB) Non DFS Used: 0 (0 KB) DFS Remaining: 5512321516176(5.01 TB) DFS Used%: 72.48% DFS Remaining%: 27.52% Last contact: Wed Jun 29 21:03:01 CST 2011 Name: 10.150.161.76:50010 Decommission Status : Normal Configured Capacity: 20027709382656 (18.22 TB) DFS Used: 16554450730194 (15.06 TB) Non DFS Used: 0 (0 KB) DFS Remaining: 3473258652462(3.16 TB) DFS Used%: 82.66% DFS Remaining%: 17.34% Last contact: Wed Jun 29 21:03:02 CST 2011 while the other normal datanode, it just like Name: 10.150.161.65:50010 Decommission Status : Normal Configured Capacity: 23627709382656 (21.49 TB) DFS Used: 5953984552236 (5.42 TB) Non DFS Used: 1200643810004 (1.09 TB) DFS Remaining: 16473081020416(14.98 TB) DFS Used%: 25.2% DFS Remaining%: 69.72% Last contact: Wed Jun 29 21:03:01 CST 2011 Name: 10.150.161.80:50010 Decommission Status : Normal Configured Capacity: 23627709382656 (21.49 TB) DFS Used: 5982565373592 (5.44 TB) Non DFS Used: 1202701691240 (1.09 TB) DFS Remaining: 16442442317824(14.95 TB) DFS Used%: 25.32% DFS Remaining%: 69.59% Last contact: Wed Jun 29 21:03:02 CST 2011 Any hint on this issue? We are using 0.20.2-cdh3u0. Thanks and regards, Mao Xu-Feng
