[ 
https://issues.apache.org/jira/browse/HBASE-18551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-18551:
--------------------------
      Resolution: Fixed
    Hadoop Flags: Reviewed
    Release Note: 
Unassign will not proceed if it is unable to talk to the remote server. Now it 
will expire the server it is unable to communicate with and then wait until it 
is signaled by ServerCrashProcedure that the server's logs have been split. 
Only then will judge the unassign successful. 

We do this because a subsequent assign lacking the crashed server context might 
open a region w/o first splitting logs.

  was:
Unassign will not proceed if it is unable to talk to remote server. Now it will 
expire the server it is unable to communicate with and then wait until it is 
signaled by ServerCrashProcedure that the server's logs have been split. Only 
then proceed with the unassign.

We do this because a subsequent assign lacking the crashed server context might 
open a region w/o first splitting logs.

          Status: Resolved  (was: Patch Available)

Pushed master and branch-2 after one-line fix for failing test (patch included 
vestige of other ongoing work).

> [AMv2] UnassignProcedure and crashed regionservers
> --------------------------------------------------
>
>                 Key: HBASE-18551
>                 URL: https://issues.apache.org/jira/browse/HBASE-18551
>             Project: HBase
>          Issue Type: Bug
>          Components: amv2
>            Reporter: stack
>            Assignee: stack
>             Fix For: 2.0.0
>
>         Attachments: HBASE-18551.master.001.patch, 
> HBASE-18551.master.002.patch, HBASE-18551.master.003.patch
>
>
> This has been [~uagashe] and my obsession over the last few days, what should 
> an UnassignProcedure do when it dispatches a CLOSE but the CLOSE fails 
> because of ConnectException or SocketTimeout.
> + We used to let UnassignProcedure continue presuming the Region would be 
> closed since the server is dead. BUT, if the unassign was part of a 
> MoveProcedure, the unassign would proceed and the Move would then run WITHOUT 
> first splitting logs. Bad.
> + So, we made it so UnassignProcedure failed; let the upper layers take care 
> of the failure. See HBASE-18491 that enabled this behavior. BUT, we are since 
> figuring that even if the UP completes as a failure, since it gives up the 
> Region lock on completion, another procedure -- say an AssignProcedure -- 
> could cut in before the ServerCrashProcedure had finished and again there 
> could be dataloss.
> + Now we are thinking the UP should hold on to the Region lock until we are 
> signalled by a ServerCrashProcedure; only then let go of the region. The UP 
> has context that is hard to pass another. Waiting on a SCP has the UP living 
> on for what could be a good amount of time. It might be ok if we can suspend 
> the procedure.
> There is a good sample scenario that came up doing the no-regions-on-master 
> issue, HBASE-18511. When meta is not on master, TestSplitTransactionOnCluster 
> is failing. It fails because though the test completes, the tests commonly 
> kill a RegionServer. The teardown for the test runs before we've noticed the 
> aborted RS. So, the disable of the table in the teardown prepartory to our 
> deleting the test table as part of clean up, goes to unassign regions but the 
> unassign fails against the aborted server.
> Good stuff.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to