I am looking the code at HBaseFsck#checkRegionConsistency(). It checks region consistency and repair the corruption if requested. However, this function expects some exceptions. For example, in one aspect of region repair, it calls HBaseFsckRepair#waitUntilAssigned(), if a region is in transition for over 120 seconds, the timeout would throw IOException.
The problem I see is that one exception in checkRegionConsistency() would kill entire hbck operation, because the exception would propagate. I think the better approach is to skip the troubled region and let hbck continue to other regions. At the end, users only has a few regions that needs multiple runs of hbck or manual fix. (Maybe one exception is for meta table, if a region in meta table is not repaired successful, we should not continue.) How do you think? Thanks Stephen
