[jira] [Commented] (HBASE-22081) master shutdown: close RpcServer and procWAL first thing

Sergey Shelukhin (JIRA) Mon, 29 Apr 2019 18:12:17 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-22081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16829874#comment-16829874
 ]


Sergey Shelukhin commented on HBASE-22081:
------------------------------------------

This patch is getting more and more interesting.
Looks like some procedures do not handle interruptedioexception correctly, 
retrying it forever, which in the case of minicluster, prevents it from 
shutting down. Not sure how the order of termination affected it, probably 
procwal terminating early just catches the proc in the test in a different 
state than it did before.

> master shutdown: close RpcServer and procWAL first thing
> --------------------------------------------------------
>
>                 Key: HBASE-22081
>                 URL: https://issues.apache.org/jira/browse/HBASE-22081
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>            Assignee: Sergey Shelukhin
>            Priority: Major
>         Attachments: HBASE-22081.01.patch, HBASE-22081.02.patch, 
> HBASE-22081.patch
>
>
> I had a master get stuck due to HBASE-22079 and noticed it was logging RS 
> abort messages during shutdown.
> [~bahramch] found some issues where messages are processed by old master 
> during shutdown due to a race condition in RS cache (or it could also happen 
> due to a network race).
> Previously I found some bug where SCP was created during master shutdown that 
> had incorrect state (because some structures already got cleaned).
> I think before master fencing is implemented we can at least make these 
> issues much less likely by thinking about shutdown order.
> 1) First kill RCP server so we don't receive any more messages. There's no 
> need to receive messages when we are shutting down. Server heartbeats could 
> be impacted I guess, but I don't think they will be cause we currently only 
> kill RS on ZK timeout.
> 2) Then do whatever cleanup we think is needed that requires proc wal.
> 3) Then close proc WAL so no errant threads can create more procs.
> 4) Then do whatever other cleanup.
> 5) Finally delete znode.
> Right now znode is deleted somewhat early I think, and RpcServer is closed 
> very late.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-22081) master shutdown: close RpcServer and procWAL first thing

Reply via email to