[ 
https://issues.apache.org/jira/browse/HBASE-22287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17119138#comment-17119138
 ] 

Michael Stack commented on HBASE-22287:
---------------------------------------

Here are logs showing retry 525 and 526 with 100ms in between attempts from 
trace-level log attached to HBASE-22041
{code}
 2020-05-21 17:29:49,267 TRACE [RSProcedureDispatcher-pool3-t44] 
procedure.RSProcedureDispatcher: Building request with operations count=1
 2020-05-21 17:29:49,268 DEBUG [RSProcedureDispatcher-pool3-t44] 
ipc.AbstractRpcClient: Not trying to connect to 
regionserver-2.hbase.hbase.svc.cluster.local/10.128.14.39:16020 this server is 
in the failed servers list
 2020-05-21 17:29:49,268 TRACE [RSProcedureDispatcher-pool3-t44] 
ipc.AbstractRpcClient: Call: ExecuteProcedures, callTime: 0ms
 2020-05-21 17:29:49,268 DEBUG [RSProcedureDispatcher-pool3-t44] 
procedure.RSProcedureDispatcher: request to 
regionserver-2.hbase.hbase.svc.cluster.local,16020,1590082132059 failed, try=525
 org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
regionserver-2.hbase.hbase.svc.cluster.local/10.128.14.39:16020 failed on local 
exception: org.apache.hadoop.hbase.ipc.FailedServerException: This server is in 
the failed servers list: regionserver-2.hbase.hbase.svc.cluster.local/ 
10.128.14.39:16020
   at sun.reflect.GeneratedConstructorAccessor8.newInstance(Unknown Source)
   at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
   at org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:220)
   at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:392)
   at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:97)
   at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:423)
   at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:419)
   at org.apache.hadoop.hbase.ipc.Call.callComplete(Call.java:117)
   at org.apache.hadoop.hbase.ipc.Call.setException(Call.java:132)
   at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.callMethod(AbstractRpcClient.java:436)
   at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:330)
   at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$200(AbstractRpcClient.java:97)
   at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:585)
   at 
org.apache.hadoop.hbase.shaded.protobuf.generated.AdminProtos$AdminService$BlockingStub.executeProcedures(AdminProtos.java:31006)
   at 
org.apache.hadoop.hbase.master.procedure.RSProcedureDispatcher$ExecuteProceduresRemoteCall.sendRequest(RSProcedureDispatcher.java:349)
   at 
org.apache.hadoop.hbase.master.procedure.RSProcedureDispatcher$ExecuteProceduresRemoteCall.run(RSProcedureDispatcher.java:314)
   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   at java.lang.Thread.run(Thread.java:748)
 Caused by: org.apache.hadoop.hbase.ipc.FailedServerException: This server is 
in the failed servers list: 
regionserver-2.hbase.hbase.svc.cluster.local/10.128.14.39:16020
   at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.getConnection(AbstractRpcClient.java:354)
   at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.callMethod(AbstractRpcClient.java:433)
   ... 9 more
 2020-05-21 17:29:49,268 WARN  [RSProcedureDispatcher-pool3-t44] 
procedure.RSProcedureDispatcher: request to server 
regionserver-2.hbase.hbase.svc.cluster.local,16020,1590082132059 failed due to 
org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
regionserver-2.hbase.hbase.svc.      cluster.local/10.128.14.39:16020 failed on 
local exception: org.apache.hadoop.hbase.ipc.FailedServerException: This server 
is in the failed servers list: 
regionserver-2.hbase.hbase.svc.cluster.local/10.128.14.39:16020, try=525, 
retrying...
 2020-05-21 17:29:49,368 TRACE [RSProcedureDispatcher-pool3-t45] 
procedure.RSProcedureDispatcher: Building request with operations count=1
 2020-05-21 17:29:49,369 DEBUG [RSProcedureDispatcher-pool3-t45] 
ipc.AbstractRpcClient: Not trying to connect to 
regionserver-2.hbase.hbase.svc.cluster.local/10.128.14.39:16020 this server is 
in the failed servers list
 2020-05-21 17:29:49,369 TRACE [RSProcedureDispatcher-pool3-t45] 
ipc.AbstractRpcClient: Call: ExecuteProcedures, callTime: 1ms
 2020-05-21 17:29:49,369 DEBUG [RSProcedureDispatcher-pool3-t45] 
procedure.RSProcedureDispatcher: request to 
regionserver-2.hbase.hbase.svc.cluster.local,16020,1590082132059 failed, try=526
 org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
regionserver-2.hbase.hbase.svc.cluster.local/10.128.14.39:16020 failed on local 
exception: org.apache.hadoop.hbase.ipc.FailedServerException: This server is in 
the failed servers list: regionserver-2.hbase.hbase.svc.cluster.local/ 
10.128.14.39:16020
   at sun.reflect.GeneratedConstructorAccessor8.newInstance(Unknown Source)
   at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
   at org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:220)
   at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:392)
   at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:97)
   at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:423)
   at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:419)
   at org.apache.hadoop.hbase.ipc.Call.callComplete(Call.java:117)
   at org.apache.hadoop.hbase.ipc.Call.setException(Call.java:132)
   at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.callMethod(AbstractRpcClient.java:436)
   at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:330)
   at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$200(AbstractRpcClient.java:97)
   at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:585)
   at 
org.apache.hadoop.hbase.shaded.protobuf.generated.AdminProtos$AdminService$BlockingStub.executeProcedures(AdminProtos.java:31006)
   at 
org.apache.hadoop.hbase.master.procedure.RSProcedureDispatcher$ExecuteProceduresRemoteCall.sendRequest(RSProcedureDispatcher.java:349)
   at 
org.apache.hadoop.hbase.master.procedure.RSProcedureDispatcher$ExecuteProceduresRemoteCall.run(RSProcedureDispatcher.java:314)
   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   at java.lang.Thread.run(Thread.java:748)
 Caused by: org.apache.hadoop.hbase.ipc.FailedServerException: This server is 
in the failed servers list: 
regionserver-2.hbase.hbase.svc.cluster.local/10.128.14.39:16020
   at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.getConnection(AbstractRpcClient.java:354)
   at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.callMethod(AbstractRpcClient.java:433)
   ... 9 more
{code}

Should at least have some backoff.

What should happen for this case is that we should get a ServerCrashProcedure 
for this failed server and it will cleanup this outstanding attempt at rpc.... 
In HBASE-22041 it had queued the SCP and all the cleanup ran but DNS cache 
meant we had wrong IP for new server and so attempts at connect could not 
succeed.

We shouldn't fill logs even in this case.

> inifinite retries on failed server in RSProcedureDispatcher
> -----------------------------------------------------------
>
>                 Key: HBASE-22287
>                 URL: https://issues.apache.org/jira/browse/HBASE-22287
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>            Priority: Major
>
> We observed this recently on some cluster, I'm still investigating the root 
> cause however seems like the retries should have special handling for this 
> exception; and separately probably a cap on number of retries
> {noformat}
> 2019-04-20 04:24:27,093 WARN  [RSProcedureDispatcher-pool4-t1285] 
> procedure.RSProcedureDispatcher: request to server ,17020,1555742560432 
> failed due to java.io.IOException: Call to :17020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: :17020, try=26603, retrying...
> {noformat}
> The corresponding worker is stuck



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to