I am sure the following logic is a bug, but I'd like to know the rational
behind it so that I can fix it correctly.
In HBaseFsck#checkRegionConsistency(), we skip some regions that are
recently changed. This is undesirable (at least in the situation I am
testing).
I can easily repro a problem by modifying an existing unit test -
TestHBaseFsck#testOverlapAndOrphan ()
- All unit test passed in 0 as the recently changed lagging time. Default
is 60 seconds. I change to default value - 60 seconds.
- then run the UT, the UT generates an orphaned HDFS region by removing
regioninfo in the dir
- the HBCK repair code creates a new region to repair the problem.
- However, it was skipped in HBaseFsck#checkRegionConsistency() and hence
the region is not assigned and added in META.
- At the end of UT, it failed because the repair did not fix the error.
{code}
private void checkRegionConsistency(final String key, final HbckInfo hbi)
...
boolean recentlyModified = inHdfs && hbi.getModTime() + timelag >
System.currentTimeMillis();
...
} else *if (recentlyModified) {*
* LOG.warn("Region " + descriptiveName + " was recently modified --
skipping");*
* return;*
}
...
}
{code}
If I changed the timelag from 0 to 60 seconds (default value), run UTs in
TestHBaseFsck. A lot of UT fails. I think this is a valid customer
scenario - people usually not change default value unless they know what
they are doing.
(Surpriselly, I could not find any complains from google search. Maybe
HBASE is so reliable that we never had some particular corruption in
production :-)
- note: the workaround is to run hbck/repair twice; the second run would
fix this issue - maybe our customer just always run the hbck multiple times
before reporting issues).
I have not go back to history and find why this logic was implemented in
the first place. Does anyone in this list knows the logic behind (should I
simply remove it? or I need to add some information in hbi to indicate that
we should not skip a target region)?
Thanks
Stephen