Hi all,

we are using hadoop 0.20.2 with hbase 0.20.6 on about 30 nodes. Our cluster is under heavy write load, causing hbase to do a lot of compactions, which in turn causes many files with many new blocks to be created in HDFS. Now, the problem is, that several datanodes are periodically running out of disk space (they consume the space very rapidly, we have seen 5 TiB to be exhausted in single day). We have investigated a bit this problem and created a patch in FSNamesystem.invalidateWorkForOneNode(). The patch changes selection of node to invalidate - in original version the first node is selected, we changed this to random node. After aplying the patch the problem seems to disappear and all our DNs are having constant and balanced disk usage. The question is, is our problem a general issue, or are we missing something? What is the reason to take the first node to invalidate, when this can potentially cause starvation of other nodes?

Thanx,
 Jan

Reply via email to