Phantom wrote:
I am sure re-replication is not done on every heartbeat miss since that would be very expensive and inefficient. At the same time you cannot really tell if a node is partitioned away, crashed or just slow. Is it threshold based i.e I missed N heartbeats so re-replicate ?
Yes, detection of datanode failure is threshold-based. It is currently ten minutes plus ten missed heartbeats.
Which package in the source code could I look at to glean this information ?
This is in dfs/FSNameSystem.java. Doug
