I think there is a problem, but maybe not the one we surmised. Are there any lease timeout reports in your master log?
When a lease is timed out, all the regions being served by the region server get reassigned almost immediately, which could be the cause of a region being assigned to another server while the original server still thought it had exclusive access to the region. There are a couple of problems going on here if you see lease timeouts. One is that the region server is not sending its heartbeat soon enough (See HBASE-616 "We slept XXXXXX ms, ten times longer than scheduled: 3000" happens frequently."). This can happen on both the master and the region server meaning either the master does not process the heartbeat until after the lease times out, or the region server does not send the heartbeat until after the lease times out. This can be caused by thread starvation due to either too many runnable threads on a machine, not enough CPUs to handle the thread load, or just a bad thread scheduler. Hopefully, lease timeout will work better after HBase is integrated with Zookeeper. ===== In the normal case, the master will not reassign a region due to load balancing until the region server reports that it has closed the region: Nothing happens with the mostLoadedRegions until it gets to RegionManager.unassignSomeRegions which is called by RegionManager.assignRegions In unassignSomeRegions, any regions selected are put into the closingRegions Set and a MSG_REGION_CLOSE will get sent to the region server. Candidates for assignment are only taken from the unassignedRegions Map. Not until the master receives a MSG_REPORT_CLOSE does any further action take place on that region. First the region is removed from the closingRegions Set. If the region was being split, the HRegionInfo received by the master will indicate that that the region is offline and split. In this case, it does not get reassigned. Otherwise it is added to the unassignedRegions Map and is now a candidate for reassignment.
