Hi , Our cluster sometimes is busy, and some of the slave nodes(DT, TT, regionserver and zookeeper.HQuorumPeer on every node) is in high-load state.
Today when I see the NN browser report(dfshealth.jsp), I found a dead DT. But when I login this node, I found that everything seems normal in this DT. And the same time the JT can touch this TT, the NN can't touch this DT(marked it as dead), the hbase can't touch this regionserver, and the ganglia shows this server is down. After a while, the ganglia and the JT shows this DT server is in normal state, but the NN and Hbase master can't . And all of the time I can login this DT server. I think when someone commits a big job, one of the DT is in so high-load state that the NN had not received the heartbeat package which is send by this DT. After a while, though this DT is in normal state and low-load, the NN can't receive the heartbeat package send by this DT. I don't know why. Can someone help me? Thanks, Jameson Li.