HBase 2 changed how the RIT metric and RIT over threshold metrics are
calculated. Previously in HBase 1 it was calculated by looking at
assignment state. After the introduction of the AMv2 instead the RIT
related metrics are tied to whether a TRSP is executing or not. The
problem is, there is not always a correspondence between the two
things, in cases of bugs or operator administrative activity or error,
leading to states where a region can be offline but should be assigned
and yet RIT and RIT over threshold metrics are 0. We encountered this
state in our production and it got us thinking deeper about RIT
tracking.

After HBASE-28158 (Decouple RIT list management from TRSP invocation)
a region will be considered in transition whenever its current state
is not at the desired terminal state for the table's 'enabled' status.
If a table is enabled, and a region of this table is not in OPEN
state, it will be by this new definition in transition (and perhaps
stuck); and conversely if a table is disabled, and a region of the
table is not in CLOSED state, the region is in transition (and perhaps
stuck).

We are going to adopt this change in our 2.5 based production but I
want to run this by the community before merging the change back all
the way to 2.5 in open source, thus including 2.6 as well. The RIT
metric and RIT over threshold metrics will be calculated differently
(IMHO, now correctly) and so this may affect your production metrics
and monitoring. I can stop at branch-2 for now or bring it all the way
back.

Are there any concerns?

-- 
Best regards,
Andrew

Reply via email to