[
https://issues.apache.org/jira/browse/HBASE-21788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16764687#comment-16764687
]
Bahram Chehrazy commented on HBASE-21788:
-----------------------------------------
The root cause of this and HBASE-21844 seem to be the same. ProcWAL get
corrupted, master restarts, finds those dead servers and adds them to the
deadServer list, but fails to resume the incomplete procedures due to corrupted
WAL, then waits forever for those regions to become OPEN. If one of theose
regions happen to be the meta, it would get stuck during initialization
(HBASE-21844).
I'm still not sure about the root cause of corruption, but is it not possible
to resume orphan procedures based on the current state during master restart?
At least, for OpenRegionProcedure it should be easy, isn't it?
> OpenRegionProcedure (after recovery?) is unreliable and needs to be improved
> ----------------------------------------------------------------------------
>
> Key: HBASE-21788
> URL: https://issues.apache.org/jira/browse/HBASE-21788
> Project: HBase
> Issue Type: Bug
> Affects Versions: 3.0.0
> Reporter: Sergey Shelukhin
> Assignee: stack
> Priority: Critical
>
> Not much for this one yet.
> I repeatedly see the cases when the region is stuck in OPENING, and after
> master restart RIT is recovered, and stays WAITING; its OpenRegionProcedure
> (also recovered) is stuck in Runnable and never does anything for hours. I
> cannot find logs on the target server indicating that it ever tried to do
> anything after master restart.
> This procedure needs at the very least logging of what it's trying to do, and
> maybe a timeout so it unconditionally fails after a configurable period (1
> hour?).
> I may also investigate why it doesn't do anything and file a separate bug. I
> wonder if it's somehow related to the region status check, but this is just a
> hunch.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)