[
https://issues.apache.org/jira/browse/HBASE-20137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16389668#comment-16389668
]
stack commented on HBASE-20137:
-------------------------------
Status: Looking at a failure from last night [1]. It is giving me pause on the
soln provided here. Moving the procedures state from CLOSING to CLOSED could be
dangerous if the procedure is part of a compound of procedures; i.e. the next
procedure will think it is fine to fire which may not be appropriate if a
ServerCrashProcedure is ongoing. Will be back...
1.
https://builds.apache.org/job/HBase%20Nightly/job/branch-2/452/testReport/junit/org.apache.hadoop.hbase.regionserver/TestSplitTransactionOnCluster/org_apache_hadoop_hbase_regionserver_TestSplitTransactionOnCluster/
> TestRSGroups is flakey
> ----------------------
>
> Key: HBASE-20137
> URL: https://issues.apache.org/jira/browse/HBASE-20137
> Project: HBase
> Issue Type: Bug
> Components: flakey
> Affects Versions: 2.0.0-beta-2
> Reporter: stack
> Assignee: stack
> Priority: Major
> Fix For: 2.0.0
>
> Attachments: HBASE-20137.branch-2.001.patch,
> HBASE-20137.branch-2.002.patch, HBASE-20137.branch-2.003.patch,
> HBASE-20137.branch-2.003.patch
>
>
> It was the single test that failed the hbase-2 nightlies in #440 at the
> hadoop2 stage.
> The failure manifests as a timeout. It actually has an interesting cause
> calling into question some of the clauses in
> UnassignProcedure#remoteCallFailed.
> We are running a disabletable concurrent with a shutdown. pid=309 is the
> disable. pid=311 is the interesting one. The below is a little hard to read
> -- the exception 'message' is the the current procedure as a String... hard
> to parse, fixing -- but we are trying to unassign as part of a the
> disabletable. Our RPC fails because the server we are trying to rpc too is
> currently being processed as crashed (pid=308 is a servercrashprocedure for
> this server). As part of the processing of the failed RPC we will expire the
> server -- if we can't RPC to it, it must be gone. The current procedure is
> then suspended until it gets woken up by the servercrashprocedure triggered
> by the expire.... only in this case we are shutting down so the expire is
> ignored... The current procedure is left in its suspend state. This prevents
> the Master going down. So we time out.
> 2018-03-05 11:29:22,507 INFO [PEWorker-13]
> assignment.RegionTransitionProcedure(213): Dispatch pid=311, ppid=309,
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043,
> server=1cfd208ff882,40584,1520249102524; rit=CLOSING,
> location=1cfd208ff882,40584,1520249102524
> 2018-03-05 11:29:22,508 WARN [PEWorker-13]
> assignment.RegionTransitionProcedure(187): Remote call failed pid=311,
> ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043,
> server=1cfd208ff882,40584,1520249102524; rit=CLOSING,
> location=1cfd208ff882,40584,1520249102524; exception=pid=311, ppid=309,
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043,
> server=1cfd208ff882,40584,1520249102524 to 1cfd208ff882,40584,1520249102524
> 2018-03-05 11:29:22,508 WARN [PEWorker-13]
> assignment.UnassignProcedure(276): Expiring server pid=311, ppid=309,
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043,
> server=1cfd208ff882,40584,1520249102524; rit=CLOSING,
> location=1cfd208ff882,40584,1520249102524,
> exception=org.apache.hadoop.hbase.master.assignment.FailedRemoteDispatchException:
> pid=311, ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH;
> UnassignProcedure table=Group_ns:testKillRS,
> region=de7534c208a06502537cd95c248b3043,
> server=1cfd208ff882,40584,1520249102524 to 1cfd208ff882,40584,1520249102524
> 2018-03-05 11:29:22,508 WARN [PEWorker-13] master.ServerManager(580):
> Expiration of 1cfd208ff882,40584,1520249102524 but server shutdown already in
> progress
> I need to cater for case where the expire server is rejected.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)