[ 
https://issues.apache.org/jira/browse/HBASE-23597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17000541#comment-17000541
 ] 

Michael Stack commented on HBASE-23597:
---------------------------------------

We check queued RPC size and send CallQueueTooBigException if in excess even if 
high-priority. I've been playing w/ trying to let high priority calls in 
anyways even when in excess of size limits... But then there are other checks 
deeper in the RPC handling that look at count of RPCs outstanding that cause us 
fail dispatch. These are a bit harder to figure...  Our queuing needs a bit of 
study. The CoDel stuff needs another look. 

> Give high priority for meta assign procedure and ServerCrashProcedure which 
> carry meta.
> ---------------------------------------------------------------------------------------
>
>                 Key: HBASE-23597
>                 URL: https://issues.apache.org/jira/browse/HBASE-23597
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 2.2.2
>            Reporter: Lijin Bin
>            Priority: Major
>
> {code}
> 2019-12-18 16:14:41,698 INFO  [KeepAlivePEWorker-18] 
> assignment.TransitRegionStateProcedure: Starting pid=23568, ppid=23567, 
> state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; 
> TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN; 
> openRegion rit=OPEN, location=100.107.165.61,60020,1576553057082; 
> loc=100.107.165.61,60020,1576553057082
> 2019-12-18 16:14:41,698 INFO  [KeepAlivePEWorker-18] 
> zookeeper.MetaTableLocator: Setting hbase:meta (replicaId=0) location in 
> ZooKeeper as 100.107.165.61,60020,1576553057082
> 2019-12-18 16:14:43,515 INFO  [PEWorker-2] procedure2.ProcedureExecutor: 
> Finished subprocedure pid=23569, resume processing parent pid=23568, 
> ppid=23567, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, 
> locked=true; TransitRegionStateProcedure table=hbase:meta, region=1588230740, 
> ASSIGN
> 2019-12-18 16:14:43,518 INFO  [PEWorker-2] procedure2.ProcedureExecutor: 
> Finished pid=23569, ppid=23568, state=SUCCESS; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure in 1.5970sec
> 2019-12-18 16:14:43,522 INFO  [PEWorker-9] procedure2.ProcedureExecutor: 
> Finished subprocedure pid=23568, resume processing parent pid=23567, 
> state=RUNNABLE:SERVER_CRASH_GET_REGIONS, locked=true; ServerCrashProcedure 
> server=100.107.165.22,60020,1576553019781, splitWal=true, meta=true
> 2019-12-18 16:14:43,522 INFO  [PEWorker-9] procedure2.ProcedureExecutor: 
> Finished pid=23568, ppid=23567, state=SUCCESS; TransitRegionStateProcedure 
> table=hbase:meta, region=1588230740, ASSIGN in 6.4630sec
> 2019-12-18 16:15:07,212 DEBUG [RegionServerTracker-0] 
> procedure2.ProcedureExecutor: Stored pid=69619, 
> state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure 
> server=100.107.165.61,60020,1576553057082, splitWal=true, meta=true
> 2019-12-18 16:15:07,212 DEBUG [RegionServerTracker-0] 
> procedure.MasterProcedureScheduler: Add 
> ServerQueue(100.107.165.61,60020,1576553057082, xlock=false sharedLock=0 
> size=1) to run queue because: the exclusive lock is not held by anyone when 
> adding pid=69619, state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure 
> server=100.107.165.61,60020,1576553057082, splitWal=true, meta=true
> 2019-12-18 16:15:07,212 INFO  [RegionServerTracker-0] 
> assignment.AssignmentManager: Scheduled SCP pid=69619 for 
> 100.107.165.61,60020,1576553057082 (carryingMeta=true) 
> 100.107.165.61,60020,1576553057082/CRASHED/regionCount=13026/lock=java.util.concurrent.locks.ReentrantReadWriteLock@68f2ee72[Write
>  locks = 1, Read locks = 0], oldState=ONLINE.
> 2019-12-18 16:15:21,629 DEBUG 
> [RpcServer.default.FPBQ.Fifo.handler=959,queue=191,port=60000] 
> master.DeadServer: Removed 100.107.165.61,60020,1576553057082, 
> processing=true, numProcessing=0
> 2019-12-18 16:16:14,779 DEBUG [qtp1688526221-1038] 
> client.ConnectionImplementation: locateRegionInMeta parentTable='hbase:meta', 
> attempt=0 of 31 failed; retrying after sleep of 31
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
> attempts=31, exceptions:
> Wed Dec 18 16:16:14 CST 2019, null, java.net.SocketTimeoutException: 
> callTimeout=60000, callDuration=81362: 
> org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not 
> online on 100.107.165.61,60020,1576656916048
>         at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3349)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3326)
>         at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1439)
>         at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.newRegionScanner(RSRpcServices.java:2967)
>         at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3300)
>         at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:42190)
>         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413)
>         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
>         at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338)
>         at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318)
>  row 'hbase:rsgroup,,99999999999999' on table 'hbase:meta' at 
> region=hbase:meta,,1.1588230740, hostname=100.107.165.61,60020,1576553057082, 
> seqNum=-1
> 2019-12-18 16:33:04,715 ERROR [PEWorker-15] master.HMaster: Master server 
> abort: loaded coprocessors are: 
> [org.apache.hadoop.hbase.rsgroup.RSGroupAdminEndpoint]
> 2019-12-18 16:33:04,715 ERROR [PEWorker-15] master.HMaster: ***** ABORTING 
> master 100.107.176.225,60000,1576656778460: FAILED persisting 
> region=38d18fd824890c80cff972cbf2e4c174 state=OPENING *****
> java.net.SocketTimeoutException: callTimeout=1200000, callDuration=1286005: 
> org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not 
> online on 100.107.165.61,60020,1576656916048
>         at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3349)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3326)
>         at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1439)
>         at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.mutate(RSRpcServices.java:2772)
>         at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:42188)
>         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413)
>         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
>         at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338)
>         at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318)
>  row 'table1w,user22322,1575810347321.38d18fd824890c80cff972cbf2e4c174.' on 
> table 'hbase:meta' at region=hbase:meta,,1.1588230740, 
> hostname=100.107.165.61,60020,1576553057082, seqNum=-1
>         at 
> org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:159)
>         at org.apache.hadoop.hbase.client.HTable.put(HTable.java:540)
>         at 
> org.apache.hadoop.hbase.master.assignment.RegionStateStore.updateRegionLocation(RegionStateStore.java:209)
>         at 
> org.apache.hadoop.hbase.master.assignment.RegionStateStore.updateUserRegionLocation(RegionStateStore.java:203)
>         at 
> org.apache.hadoop.hbase.master.assignment.RegionStateStore.updateRegionLocation(RegionStateStore.java:141)
>         at 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.transitStateAndUpdate(AssignmentManager.java:1632)
>         at 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.regionOpening(AssignmentManager.java:1647)
>         at 
> org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.openRegion(TransitRegionStateProcedure.java:207)
>         at 
> org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.executeFromState(TransitRegionStateProcedure.java:339)
>         at 
> org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.executeFromState(TransitRegionStateProcedure.java:102)
>         at 
> org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:194)
>         at 
> org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.execute(TransitRegionStateProcedure.java:319)
>         at 
> org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.execute(TransitRegionStateProcedure.java:102)
>         at 
> org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:962)
>         at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1648)
>         at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1395)
>         at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:78)
>         at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1965)
> Caused by: org.apache.hadoop.hbase.NotServingRegionException: 
> org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not 
> online on 100.107.165.61,60020,1576656916048
>         at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3349)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3326)
>         at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1439)
>         at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.mutate(RSRpcServices.java:2772)
>         at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:42188)
>         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413)
>         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
>         at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338)
>         at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318)
>         at sun.reflect.GeneratedConstructorAccessor14.newInstance(Unknown 
> Source)
>         at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>         at 
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException.instantiateException(RemoteWithExtrasException.java:99)
>         at 
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException.unwrapRemoteException(RemoteWithExtrasException.java:89)
>         at 
> org.apache.hadoop.hbase.protobuf.ProtobufUtil.makeIOExceptionOfException(ProtobufUtil.java:282)
>         at 
> org.apache.hadoop.hbase.protobuf.ProtobufUtil.handleRemoteException(ProtobufUtil.java:269)
>         at 
> org.apache.hadoop.hbase.client.RegionServerCallable.call(RegionServerCallable.java:129)
>         at 
> org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:107)
>         ... 17 more
> Caused by: 
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.NotServingRegionException):
>  org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not 
> online on 100.107.165.61,60020,1576656916048
>         at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3349)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3326)
>         at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1439)
>         at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.mutate(RSRpcServices.java:2772)
>         at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:42188)
>         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413)
>         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
>         at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338)
>         at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318)
>         at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:389)
>         at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:97)
>         at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:423)
>         at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:419)
>         at org.apache.hadoop.hbase.ipc.Call.callComplete(Call.java:103)
>         at org.apache.hadoop.hbase.ipc.Call.setException(Call.java:118)
>         at 
> org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.readResponse(NettyRpcDuplexHandler.java:162)
>         at 
> org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.channelRead(NettyRpcDuplexHandler.java:192)
>         at 
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:359)
> 2019-12-18 16:33:04,728 DEBUG [PEWorker-15] 
> procedure.MasterProcedureScheduler: Remove 
> ServerQueue(100.107.165.61,60020,1576553057082, xlock=false sharedLock=0 
> size=0) from run queue because: queue is empty after polling out pid=69619, 
> state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure 
> server=100.107.165.61,60020,1576553057082, splitWal=true, meta=true
> 2019-12-18 16:33:04,728 DEBUG [PEWorker-15] 
> procedure.MasterProcedureScheduler: Remove 
> ServerQueue(100.107.165.61,60020,1576553057082, xlock=true (69619) 
> sharedLock=0 size=0) from run queue because: pid=69619, 
> state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure 
> server=100.107.165.61,60020,1576553057082, splitWal=true, meta=true held 
> exclusive lock
> 2019-12-18 16:33:06,353 INFO  [PEWorker-13] 
> assignment.TransitRegionStateProcedure: Starting pid=69626, ppid=69619, 
> state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; 
> TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN; 
> openRegion rit=OPEN, location=100.107.165.22,60020,1576656877749; 
> loc=100.107.165.22,60020,1576656877749
> 2019-12-18 16:33:06,353 INFO  [PEWorker-13] zookeeper.MetaTableLocator: 
> Setting hbase:meta (replicaId=0) location in ZooKeeper as 
> 100.107.165.22,60020,1576656877749
> {code}
> Looks at this logs, we can see the ServerCrashProcedure 
> server=100.107.165.61,60020,1576553057082(carry meta) queued for quit a while 
> blocked by other procedure.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to