[
https://issues.apache.org/jira/browse/HBASE-20796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16525979#comment-16525979
]
stack commented on HBASE-20796:
-------------------------------
Just saw another instance so might be worth taking this patch back to
branch-2.0.
Problem looks like this....
* ve1037.X.Y.Z.com gets really sick... goes dead, falls off the net.
* Master doesn't realize and keeps assigning it regions though all dispatches
to this server are hanging.
* Eventually its zk ephemeral node timesout and we expire the server, run the
SCP and assign the regions elsewhere.
* Later (3minutes is our rpc timeout), the rpcs timeout and they call
cleanup/handleFailure which 'wakes' finished Procedures. These Procedures have
fallen out of the executor because they completed so the Procedures are woken
up into a limbo, cycling, complaining about being STUCK though their region is
happily assigned.
Here is Master noticing the expiration: 2018-06-27 16:26:24,022 INFO
org.apache.hadoop.hbase.master.RegionServerTracker: RegionServer ephemeral node
deleted, processing expiration [ve1037.X.Y.Z.com,22101,1530130338696]
Here is the failed RPC a few minutes later:
2018-06-27 16:27:28,984 WARN
org.apache.hadoop.hbase.master.procedure.RSProcedureDispatcher: Failed dispatch
to server=ve1037.X.Y.Z.com,22101,1530130338696 try=0
java.io.IOException: Call to ve1037.X.Y.Z.com/10.10.10.10:22101 failed on local
exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=2054883,
waitTime=180008, rpcTimeout=180000
at org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:180)
at
org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:390)
at
org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:95)
at
org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:410)
at
org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:406)
at org.apache.hadoop.hbase.ipc.Call.setTimeout(Call.java:96)
at
org.apache.hadoop.hbase.ipc.RpcConnection$1.run(RpcConnection.java:199)
at
org.apache.hbase.thirdparty.io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:663)
at
org.apache.hbase.thirdparty.io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:738)
at
org.apache.hbase.thirdparty.io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:466)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=2054883,
waitTime=180008, rpcTimeout=180000
at
org.apache.hadoop.hbase.ipc.RpcConnection$1.run(RpcConnection.java:200)
... 4 more
> STUCK RIT though region successfully assigned
> ---------------------------------------------
>
> Key: HBASE-20796
> URL: https://issues.apache.org/jira/browse/HBASE-20796
> Project: HBase
> Issue Type: Bug
> Components: amv2
> Reporter: stack
> Assignee: stack
> Priority: Major
> Fix For: 3.0.0, 2.1.0, 2.0.2
>
> Attachments: HBASE-20796.branch-2.0.001.patch
>
>
> This is a good one. We keep logging messages like this:
> {code}
> 2018-06-26 12:32:24,859 WARN
> org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK
> Region-In-Transition rit=OPENING,
> location=vd0410.X.Y.com,22101,1529611445046,
> table=IntegrationTestBigLinkedList_20180525080406,
> region=e10b35d49528e2453a04c7038e3393d7
> {code}
> ...though the region is successfully assigned.
> Story:
> * Dispatch an assign 2018-06-26 12:31:27,390 INFO
> org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure: Dispatch
> pid=370829, ppid=370391, state=RUNNABLE:REGION_TRANSITION_DISPATCH;
> AssignProcedure table=IntegrationTestBigLinkedList_20180612114844,
> region=f69ccf7d9178ce166b515e0e2ef019d2; rit=OPENING,
> location=vd0410.X.Y.Z,22101,1529611445046
> * It gets stuck 2018-06-26 12:32:29,860 WARN
> org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK
> Region-In-Transition rit=OPENING, location=vd0410.X.Y.Z,22101,1529611445046,
> table=IntegrationTestBigLinkedList_20180612114844,
> region=f69ccf7d9178ce166b515e0e2ef019d2 (Because the server was killed)
> * We stay STUCK for a while.
> * The Master notices the server as crashed and starts a SCP.
> * SCP kills ongoing assign: 2018-06-26 12:32:54,809 INFO
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: pid=371105
> found RIT pid=370829, ppid=370391, state=RUNNABLE:REGION_TRANSITION_DISPATCH;
> AssignProcedure table=IntegrationTestBigLinkedList_20180612114844,
> region=f69ccf7d9178ce166b515e0e2ef019d2; rit=OPENING,
> location=vd0410.X.Y.Z,22101,1529611445046
> * The kill brings on a retry ... 2018-06-26 12:32:54,810 WARN
> org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure: Remote
> call failed pid=370829, ppid=370391,
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure
> table=IntegrationTestBigLinkedList_20180612114844,
> region=f69ccf7d9178ce166b515e0e2ef019d2; rit=OPENING,
> location=vd0410.X.Y.Z,22101,1529611445046; exception=ServerCrashProcedure
> pid=371105, server=vd0410.X.Y.Z,22101,1529611445046
> * Which eventually succeeds..... Successfully deployed to new server
> 2018-06-26 12:32:55,429 INFO
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished pid=370829,
> ppid=370391, state=SUCCESS; AssignProcedure
> table=IntegrationTestBigLinkedList_20180612114844,
> region=f69ccf7d9178ce166b515e0e2ef019d2 in 1mins, 35.379sec
> * But then, it looks like the RPC was ongoing and it broke in following way
> 2018-06-26 12:33:06,378 WARN
> org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure: Remote
> call failed pid=370829, ppid=370391, state=SUCCESS; AssignProcedure
> table=IntegrationTestBigLinkedList_20180612114844,
> region=f69ccf7d9178ce166b515e0e2ef019d2; rit=OPEN,
> location=vc0614.halxg.cloudera.com,22101,1529611443424; exception=Call to
> vd0410.X.Y.Z/10.10.10.10:22101 failed on local exception:
> org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
> syscall:read(..) failed: Connection reset by peer (Notice how state for
> region is OPEN and 'SUCCESS').
> * Then says 2018-06-26 12:33:06,380 INFO
> org.apache.hadoop.hbase.master.assignment.AssignProcedure: Retry=1 of max=10;
> pid=370829, ppid=370391, state=SUCCESS; AssignProcedure
> table=IntegrationTestBigLinkedList_20180612114844,
> region=f69ccf7d9178ce166b515e0e2ef019d2; rit=OPEN,
> location=vc0614.X.Y.Z,22101,1529611443424
> * And finally... 2018-06-26 12:34:10,727 WARN
> org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK
> Region-In-Transition rit=OFFLINE, location=null,
> table=IntegrationTestBigLinkedList_20180612114844,
> region=f69ccf7d9178ce166b515e0e2ef019d2
> Restart of Master got rid of the STUCK complaints.
> This is interesting because the stuck rpc and the successful reassign are all
> riding on the same pid.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)