[
https://issues.apache.org/jira/browse/HBASE-22081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16820416#comment-16820416
]
Sergey Shelukhin commented on HBASE-22081:
------------------------------------------
Not sure what actually positively requires proc executor and WAL. I don't think
there's any shutdown-specific procedure logic that needs to be executed. As for
the procedures already in progress during shutdown, their state needs to be
saved, but what stage we interrupt them on - a bit earlier or a bit later -
seems to be a race anyway, so might was well do it early.
I'll provide a simple patch, we'll see what tests break.
> master shutdown: close RpcServer first thing, close procWAL as soon as
> viable, and delete znode the last thing
> --------------------------------------------------------------------------------------------------------------
>
> Key: HBASE-22081
> URL: https://issues.apache.org/jira/browse/HBASE-22081
> Project: HBase
> Issue Type: Bug
> Reporter: Sergey Shelukhin
> Assignee: Sergey Shelukhin
> Priority: Major
>
> I had a master get stuck due to HBASE-22079 and noticed it was logging RS
> abort messages during shutdown.
> [~bahramch] found some issues where messages are processed by old master
> during shutdown due to a race condition in RS cache (or it could also happen
> due to a network race).
> Previously I found some bug where SCP was created during master shutdown that
> had incorrect state (because some structures already got cleaned).
> I think before master fencing is implemented we can at least make these
> issues much less likely by thinking about shutdown order.
> 1) First kill RCP server so we don't receive any more messages. There's no
> need to receive messages when we are shutting down. Server heartbeats could
> be impacted I guess, but I don't think they will be cause we currently only
> kill RS on ZK timeout.
> 2) Then do whatever cleanup we think is needed that requires proc wal.
> 3) Then close proc WAL so no errant threads can create more procs.
> 4) Then do whatever other cleanup.
> 5) Finally delete znode.
> Right now znode is deleted somewhat early I think, and RpcServer is closed
> very late.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)