[ 
https://issues.apache.org/jira/browse/HBASE-23597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17000625#comment-17000625
 ] 

Michael Stack commented on HBASE-23597:
---------------------------------------

I also upped the two configs below from theiir default of 10 to 100000.  We 
will fail an RPC dispatch (which will throw a CallQueueTooB..Exception) if the 
count of outstanding RPCs is handler count * the configs below.  In my case it 
is about 1000.... Which should be enough and less the default queue size of 1G 
but hard to tell. Trying it out.

public static final String IPC_SERVER_MAX_CALLQUEUE_LENGTH = 
"hbase.ipc.server.max.callqueue.length";
public static final String IPC_SERVER_PRIORITY_MAX_CALLQUEUE_LENGTH = 
"hbase.ipc.server.priority.max.callqueue.length";

> Give high priority for meta assign procedure and ServerCrashProcedure which 
> carry meta.
> ---------------------------------------------------------------------------------------
>
>                 Key: HBASE-23597
>                 URL: https://issues.apache.org/jira/browse/HBASE-23597
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 2.2.2
>            Reporter: Lijin Bin
>            Priority: Major
>         Attachments: priority.rpc.patch
>
>
> {code}
> 2019-12-18 16:14:41,698 INFO  [KeepAlivePEWorker-18] 
> assignment.TransitRegionStateProcedure: Starting pid=23568, ppid=23567, 
> state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; 
> TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN; 
> openRegion rit=OPEN, location=100.107.165.61,60020,1576553057082; 
> loc=100.107.165.61,60020,1576553057082
> 2019-12-18 16:14:41,698 INFO  [KeepAlivePEWorker-18] 
> zookeeper.MetaTableLocator: Setting hbase:meta (replicaId=0) location in 
> ZooKeeper as 100.107.165.61,60020,1576553057082
> 2019-12-18 16:14:43,515 INFO  [PEWorker-2] procedure2.ProcedureExecutor: 
> Finished subprocedure pid=23569, resume processing parent pid=23568, 
> ppid=23567, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, 
> locked=true; TransitRegionStateProcedure table=hbase:meta, region=1588230740, 
> ASSIGN
> 2019-12-18 16:14:43,518 INFO  [PEWorker-2] procedure2.ProcedureExecutor: 
> Finished pid=23569, ppid=23568, state=SUCCESS; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure in 1.5970sec
> 2019-12-18 16:14:43,522 INFO  [PEWorker-9] procedure2.ProcedureExecutor: 
> Finished subprocedure pid=23568, resume processing parent pid=23567, 
> state=RUNNABLE:SERVER_CRASH_GET_REGIONS, locked=true; ServerCrashProcedure 
> server=100.107.165.22,60020,1576553019781, splitWal=true, meta=true
> 2019-12-18 16:14:43,522 INFO  [PEWorker-9] procedure2.ProcedureExecutor: 
> Finished pid=23568, ppid=23567, state=SUCCESS; TransitRegionStateProcedure 
> table=hbase:meta, region=1588230740, ASSIGN in 6.4630sec
> 2019-12-18 16:15:07,212 DEBUG [RegionServerTracker-0] 
> procedure2.ProcedureExecutor: Stored pid=69619, 
> state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure 
> server=100.107.165.61,60020,1576553057082, splitWal=true, meta=true
> 2019-12-18 16:15:07,212 DEBUG [RegionServerTracker-0] 
> procedure.MasterProcedureScheduler: Add 
> ServerQueue(100.107.165.61,60020,1576553057082, xlock=false sharedLock=0 
> size=1) to run queue because: the exclusive lock is not held by anyone when 
> adding pid=69619, state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure 
> server=100.107.165.61,60020,1576553057082, splitWal=true, meta=true
> 2019-12-18 16:15:07,212 INFO  [RegionServerTracker-0] 
> assignment.AssignmentManager: Scheduled SCP pid=69619 for 
> 100.107.165.61,60020,1576553057082 (carryingMeta=true) 
> 100.107.165.61,60020,1576553057082/CRASHED/regionCount=13026/lock=java.util.concurrent.locks.ReentrantReadWriteLock@68f2ee72[Write
>  locks = 1, Read locks = 0], oldState=ONLINE.
> 2019-12-18 16:15:21,629 DEBUG 
> [RpcServer.default.FPBQ.Fifo.handler=959,queue=191,port=60000] 
> master.DeadServer: Removed 100.107.165.61,60020,1576553057082, 
> processing=true, numProcessing=0
> 2019-12-18 16:16:14,779 DEBUG [qtp1688526221-1038] 
> client.ConnectionImplementation: locateRegionInMeta parentTable='hbase:meta', 
> attempt=0 of 31 failed; retrying after sleep of 31
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
> attempts=31, exceptions:
> Wed Dec 18 16:16:14 CST 2019, null, java.net.SocketTimeoutException: 
> callTimeout=60000, callDuration=81362: 
> org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not 
> online on 100.107.165.61,60020,1576656916048
>         at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3349)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3326)
>         at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1439)
>         at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.newRegionScanner(RSRpcServices.java:2967)
>         at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3300)
>         at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:42190)
>         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413)
>         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
>         at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338)
>         at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318)
>  row 'hbase:rsgroup,,99999999999999' on table 'hbase:meta' at 
> region=hbase:meta,,1.1588230740, hostname=100.107.165.61,60020,1576553057082, 
> seqNum=-1
> 2019-12-18 16:33:04,715 ERROR [PEWorker-15] master.HMaster: Master server 
> abort: loaded coprocessors are: 
> [org.apache.hadoop.hbase.rsgroup.RSGroupAdminEndpoint]
> 2019-12-18 16:33:04,715 ERROR [PEWorker-15] master.HMaster: ***** ABORTING 
> master 100.107.176.225,60000,1576656778460: FAILED persisting 
> region=38d18fd824890c80cff972cbf2e4c174 state=OPENING *****
> java.net.SocketTimeoutException: callTimeout=1200000, callDuration=1286005: 
> org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not 
> online on 100.107.165.61,60020,1576656916048
>         at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3349)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3326)
>         at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1439)
>         at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.mutate(RSRpcServices.java:2772)
>         at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:42188)
>         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413)
>         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
>         at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338)
>         at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318)
>  row 'table1w,user22322,1575810347321.38d18fd824890c80cff972cbf2e4c174.' on 
> table 'hbase:meta' at region=hbase:meta,,1.1588230740, 
> hostname=100.107.165.61,60020,1576553057082, seqNum=-1
>         at 
> org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:159)
>         at org.apache.hadoop.hbase.client.HTable.put(HTable.java:540)
>         at 
> org.apache.hadoop.hbase.master.assignment.RegionStateStore.updateRegionLocation(RegionStateStore.java:209)
>         at 
> org.apache.hadoop.hbase.master.assignment.RegionStateStore.updateUserRegionLocation(RegionStateStore.java:203)
>         at 
> org.apache.hadoop.hbase.master.assignment.RegionStateStore.updateRegionLocation(RegionStateStore.java:141)
>         at 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.transitStateAndUpdate(AssignmentManager.java:1632)
>         at 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.regionOpening(AssignmentManager.java:1647)
>         at 
> org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.openRegion(TransitRegionStateProcedure.java:207)
>         at 
> org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.executeFromState(TransitRegionStateProcedure.java:339)
>         at 
> org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.executeFromState(TransitRegionStateProcedure.java:102)
>         at 
> org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:194)
>         at 
> org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.execute(TransitRegionStateProcedure.java:319)
>         at 
> org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.execute(TransitRegionStateProcedure.java:102)
>         at 
> org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:962)
>         at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1648)
>         at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1395)
>         at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:78)
>         at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1965)
> Caused by: org.apache.hadoop.hbase.NotServingRegionException: 
> org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not 
> online on 100.107.165.61,60020,1576656916048
>         at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3349)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3326)
>         at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1439)
>         at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.mutate(RSRpcServices.java:2772)
>         at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:42188)
>         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413)
>         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
>         at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338)
>         at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318)
>         at sun.reflect.GeneratedConstructorAccessor14.newInstance(Unknown 
> Source)
>         at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>         at 
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException.instantiateException(RemoteWithExtrasException.java:99)
>         at 
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException.unwrapRemoteException(RemoteWithExtrasException.java:89)
>         at 
> org.apache.hadoop.hbase.protobuf.ProtobufUtil.makeIOExceptionOfException(ProtobufUtil.java:282)
>         at 
> org.apache.hadoop.hbase.protobuf.ProtobufUtil.handleRemoteException(ProtobufUtil.java:269)
>         at 
> org.apache.hadoop.hbase.client.RegionServerCallable.call(RegionServerCallable.java:129)
>         at 
> org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:107)
>         ... 17 more
> Caused by: 
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.NotServingRegionException):
>  org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not 
> online on 100.107.165.61,60020,1576656916048
>         at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3349)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3326)
>         at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1439)
>         at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.mutate(RSRpcServices.java:2772)
>         at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:42188)
>         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413)
>         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
>         at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338)
>         at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318)
>         at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:389)
>         at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:97)
>         at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:423)
>         at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:419)
>         at org.apache.hadoop.hbase.ipc.Call.callComplete(Call.java:103)
>         at org.apache.hadoop.hbase.ipc.Call.setException(Call.java:118)
>         at 
> org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.readResponse(NettyRpcDuplexHandler.java:162)
>         at 
> org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.channelRead(NettyRpcDuplexHandler.java:192)
>         at 
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:359)
> 2019-12-18 16:33:04,728 DEBUG [PEWorker-15] 
> procedure.MasterProcedureScheduler: Remove 
> ServerQueue(100.107.165.61,60020,1576553057082, xlock=false sharedLock=0 
> size=0) from run queue because: queue is empty after polling out pid=69619, 
> state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure 
> server=100.107.165.61,60020,1576553057082, splitWal=true, meta=true
> 2019-12-18 16:33:04,728 DEBUG [PEWorker-15] 
> procedure.MasterProcedureScheduler: Remove 
> ServerQueue(100.107.165.61,60020,1576553057082, xlock=true (69619) 
> sharedLock=0 size=0) from run queue because: pid=69619, 
> state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure 
> server=100.107.165.61,60020,1576553057082, splitWal=true, meta=true held 
> exclusive lock
> 2019-12-18 16:33:06,353 INFO  [PEWorker-13] 
> assignment.TransitRegionStateProcedure: Starting pid=69626, ppid=69619, 
> state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; 
> TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN; 
> openRegion rit=OPEN, location=100.107.165.22,60020,1576656877749; 
> loc=100.107.165.22,60020,1576656877749
> 2019-12-18 16:33:06,353 INFO  [PEWorker-13] zookeeper.MetaTableLocator: 
> Setting hbase:meta (replicaId=0) location in ZooKeeper as 
> 100.107.165.22,60020,1576656877749
> {code}
> Looks at this logs, we can see the ServerCrashProcedure 
> server=100.107.165.61,60020,1576553057082(carry meta) queued for quit a while 
> blocked by other procedure.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to