[jira] [Commented] (HBASE-20137) TestRSGroups is flakey

stack (JIRA) Wed, 07 Mar 2018 12:43:15 -0800

    [ 
https://issues.apache.org/jira/browse/HBASE-20137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16390185#comment-16390185
 ]


stack commented on HBASE-20137:
-------------------------------

bq. Moving the procedures state from CLOSING to CLOSED could be dangerous if 
the procedure is part of a compound of procedures; i.e. the next procedure will 
think it is fine to fire which may not be appropriate if a ServerCrashProcedure 
is ongoing. Will be back...

Above has been confirmed. Explanation is a little involved. Let me come up with 
a much more conservative patch for the case of one procedure waiting on 
another's signal when it will never arrive.

Here is the confirmation. Last night, TestSplitTransactionOnCluster failed:

https://builds.apache.org/job/HBase%20Nightly/job/branch-2/452/testReport/junit/org.apache.hadoop.hbase.regionserver/TestSplitTransactionOnCluster/org_apache_hadoop_hbase_regionserver_TestSplitTransactionOnCluster/

(Download the full log, available in a zip file published as part of build 
artifiacts if you want to follow blow-by-blow).

At first blush, its a good one. In the test, we create a table, split regions, 
kill hosting server in midst of split, wait on daughter regions to show up in a 
particular state, then we exit, disable the table, delete it and move to the 
next test.

Well, the abort of the hosting regionserver schedules a ServerCrashProcedure. 
In last nights test run, the Disable Table Procedure had run before the SCP 
could complete. By the time the SCP went to assign regions from the dead 
server, they'd been disabled. By the time, the SCP assign actually ran, the 
regions had been deleted. The Assign could never succeed so we got logs of the 
below because region could not open on regionserver....

{code}
2018-03-07 07:11:26,513 WARN  [ProcExecTimeout] 
assignment.AssignmentManager(1199): TODO Handle stuck in transition: 
rit=OPENING, location=2065cdf9afe1,36365,1520406613523, 
table=testShutdownFixupWhenDaughterHasSplit, 
region=6f9cb46f9fa29c263a02d3d5c3bf41ac
{code}

Fix seems straight-forward; don't assign disabled/deleted regions. We should 
add this anyways.

But, the failure comes about because of the patch made here. Going back to a 
run before this patch, I see that though the disable table is scheduled BEFORE 
the server crash procedure (by 100ms), when it comes time for the disable table 
unassign to run, the crashed server is no longer available and so the RPC 
fails. We go to expire the crashed server but it already expired, so this 
fails; the unassign procedure is therefore stuck in the suspend state and 
doesn't complete until SCP is done (I noticed another issue in here in that we 
delete enabled regions, but that is for another issue). This patch made it so 
that when the expire failed, we moved the unassign from CLOSING to CLOSED so 
the subsequent disable table/drop table procedures could proceed. We don't want 
this.

Let me make a more conservative patch. Let me do it in new issue so I can do 
clean messaging rather than this mess that comes of research. Will open patch 
after I do more study.

> TestRSGroups is flakey
> ----------------------
>
>                 Key: HBASE-20137
>                 URL: https://issues.apache.org/jira/browse/HBASE-20137
>             Project: HBase
>          Issue Type: Bug
>          Components: flakey
>    Affects Versions: 2.0.0-beta-2
>            Reporter: stack
>            Assignee: stack
>            Priority: Major
>             Fix For: 2.0.0
>
>         Attachments: HBASE-20137.branch-2.001.patch, 
> HBASE-20137.branch-2.002.patch, HBASE-20137.branch-2.003.patch, 
> HBASE-20137.branch-2.003.patch
>
>
> It was the single test that failed the hbase-2 nightlies in #440 at the 
> hadoop2 stage.
> The failure manifests as a timeout. It actually has an interesting cause 
> calling into question some of the clauses in 
> UnassignProcedure#remoteCallFailed.
> We are running a disabletable concurrent with a shutdown. pid=309 is the 
> disable. pid=311 is the interesting one. The below is a little hard to read 
> -- the exception 'message' is the the current procedure as a String... hard 
> to parse, fixing -- but we are trying to unassign as part of a the 
> disabletable. Our RPC fails because the server we are trying to rpc too is 
> currently being processed as crashed (pid=308 is a servercrashprocedure for 
> this server). As part of the processing of the failed RPC we will expire the 
> server -- if we can't RPC to it, it must be gone. The current procedure is 
> then suspended until it gets woken up by the servercrashprocedure triggered 
> by the expire.... only in this case we are shutting down so the expire is 
> ignored... The current procedure is left in its suspend state. This prevents 
> the Master going down. So we time out.
> 2018-03-05 11:29:22,507 INFO  [PEWorker-13] 
> assignment.RegionTransitionProcedure(213): Dispatch pid=311, ppid=309, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure 
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, 
> server=1cfd208ff882,40584,1520249102524; rit=CLOSING, 
> location=1cfd208ff882,40584,1520249102524
> 2018-03-05 11:29:22,508 WARN  [PEWorker-13] 
> assignment.RegionTransitionProcedure(187): Remote call failed pid=311, 
> ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure 
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, 
> server=1cfd208ff882,40584,1520249102524; rit=CLOSING, 
> location=1cfd208ff882,40584,1520249102524; exception=pid=311, ppid=309, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure 
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, 
> server=1cfd208ff882,40584,1520249102524 to 1cfd208ff882,40584,1520249102524
> 2018-03-05 11:29:22,508 WARN  [PEWorker-13] 
> assignment.UnassignProcedure(276): Expiring server pid=311, ppid=309, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure 
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, 
> server=1cfd208ff882,40584,1520249102524; rit=CLOSING, 
> location=1cfd208ff882,40584,1520249102524, 
> exception=org.apache.hadoop.hbase.master.assignment.FailedRemoteDispatchException:
>  pid=311, ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; 
> UnassignProcedure table=Group_ns:testKillRS, 
> region=de7534c208a06502537cd95c248b3043, 
> server=1cfd208ff882,40584,1520249102524 to 1cfd208ff882,40584,1520249102524
> 2018-03-05 11:29:22,508 WARN  [PEWorker-13] master.ServerManager(580): 
> Expiration of 1cfd208ff882,40584,1520249102524 but server shutdown already in 
> progress
> I need to cater for case where the expire server is rejected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-20137) TestRSGroups is flakey

Reply via email to