[jira] [Commented] (HBASE-20700) Move meta region when server crash can cause the procedure to be stuck

Duo Zhang (JIRA) Fri, 08 Jun 2018 20:09:01 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-20700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16506795#comment-16506795
 ]


Duo Zhang commented on HBASE-20700:
-----------------------------------

{quote}
I worry the below check of ONLINE. Is it too specific?

971     if (serverNode.isInState(ServerState.SPLITTING, ServerState.OFFLINE)) { 
972     if (!serverNode.isInState(ServerState.ONLINE)) {

We can see I suppose. Would be good if we could get away with it.
{quote}
I think this is the common case? If the server is not in state ONLINE then it 
means there is a SCP for it which means it has already crashed...

{quote}
I'm wary of calls to this method below settting server state inside 
setServerState because it will create the server node if it doesn't exist (It 
may not exist because it has been processed by SCP). If we call the below after 
SCP is done w/ it, the server comes back to life. You sure we will not do this?
{quote}
These methods will only be called in SCP, and at the end of SCP we will call 
removeServer to remove the ServerStateNode. Let me add some comments.

{quote}
What is the lifecycle for a server node now? ONLINE => SPLITTING => OFFLINE is 
what it used to be. It can still do this? But it can also go ONLINE => 
META_SPLITTING => META_SPLITTING_DONE => SPLITTING => OFFLINE? We might want to 
not this somewhere. Not obvious.
{quote}
If not carrying meta then ONLINE=>SPLITTING=>OFFLINE, otherwise 
ONLINE=>META_SPLITTING_META_SPLITTING_DONE=>SPLITTING=>OFFLINE.
I've added comments in UnassignProcedure to say why we need these state. We can 
only fail an unassign after we make sure that the log splitting is finished, 
otherwise we may schedule an AssignProcedure which will cause data loss. And 
for unassign meta, the SCP will wait until the RMP is finished before splitting 
other logs, so if we do not introduce special states for meta splitting, we 
will stuck there forever...

{quote}
Oh... this is interesting.... adding the synchronized....

public synchronized void remoteCallFailed(final MasterProcedureEnv env,

... Up to this we've been synchronizing on the objects whose state we change. 
What you thinking by adding the synchronize? I can't see anything wrong w/ 
it.....
{quote}
It could be called in two places, one is from the RemoteProcedureScheduler, 
where the remote call is failed, and the other is from SCP or RMP's handleRIT, 
I think there is no strong guarantee that they will not happen at the same time 
so it is better to add a synchronized on the method...

{quote}
If MoveRegionProcedure gets scheduled before RecoverMetaProcedure, what happens 
now?
{quote}
Now the RMP will not hold the same lock with MRP, so it could break the 
execution of UnassignProcedure scheduled by MRP. And also, if the 
UnassignProcedure is scheduled after we calling handleRIT, when calling 
isLogSplittingDone method in remoteCallFailed, it will find that the meta log 
splitting has already been done and give up. So there will be no dead lock any 
more.

{quote}
s/MetaProcedureInterface/MetaProcedure/
{quote}

Just follow the patterns, we have TableProcedureInterface, 
RegionProcedureInterface, ServerProcedureInterface, etc.




> Move meta region when server crash can cause the procedure to be stuck
> ----------------------------------------------------------------------
>
>                 Key: HBASE-20700
>                 URL: https://issues.apache.org/jira/browse/HBASE-20700
>             Project: HBase
>          Issue Type: Sub-task
>          Components: master, proc-v2, Region Assignment
>            Reporter: Duo Zhang
>            Assignee: Duo Zhang
>            Priority: Critical
>             Fix For: 3.0.0, 2.1.0, 2.0.1
>
>         Attachments: HBASE-20700-UT.patch, HBASE-20700-v1.patch, 
> HBASE-20700.patch
>
>
> As said in HBASE-20682.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-20700) Move meta region when server crash can cause the procedure to be stuck

Reply via email to