[
https://issues.apache.org/jira/browse/HBASE-22287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17119138#comment-17119138
]
Michael Stack commented on HBASE-22287:
---------------------------------------
Here are logs showing retry 525 and 526 with 100ms in between attempts from
trace-level log attached to HBASE-22041
{code}
2020-05-21 17:29:49,267 TRACE [RSProcedureDispatcher-pool3-t44]
procedure.RSProcedureDispatcher: Building request with operations count=1
2020-05-21 17:29:49,268 DEBUG [RSProcedureDispatcher-pool3-t44]
ipc.AbstractRpcClient: Not trying to connect to
regionserver-2.hbase.hbase.svc.cluster.local/10.128.14.39:16020 this server is
in the failed servers list
2020-05-21 17:29:49,268 TRACE [RSProcedureDispatcher-pool3-t44]
ipc.AbstractRpcClient: Call: ExecuteProcedures, callTime: 0ms
2020-05-21 17:29:49,268 DEBUG [RSProcedureDispatcher-pool3-t44]
procedure.RSProcedureDispatcher: request to
regionserver-2.hbase.hbase.svc.cluster.local,16020,1590082132059 failed, try=525
org.apache.hadoop.hbase.ipc.FailedServerException: Call to
regionserver-2.hbase.hbase.svc.cluster.local/10.128.14.39:16020 failed on local
exception: org.apache.hadoop.hbase.ipc.FailedServerException: This server is in
the failed servers list: regionserver-2.hbase.hbase.svc.cluster.local/
10.128.14.39:16020
at sun.reflect.GeneratedConstructorAccessor8.newInstance(Unknown Source)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:220)
at
org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:392)
at
org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:97)
at
org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:423)
at
org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:419)
at org.apache.hadoop.hbase.ipc.Call.callComplete(Call.java:117)
at org.apache.hadoop.hbase.ipc.Call.setException(Call.java:132)
at
org.apache.hadoop.hbase.ipc.AbstractRpcClient.callMethod(AbstractRpcClient.java:436)
at
org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:330)
at
org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$200(AbstractRpcClient.java:97)
at
org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:585)
at
org.apache.hadoop.hbase.shaded.protobuf.generated.AdminProtos$AdminService$BlockingStub.executeProcedures(AdminProtos.java:31006)
at
org.apache.hadoop.hbase.master.procedure.RSProcedureDispatcher$ExecuteProceduresRemoteCall.sendRequest(RSProcedureDispatcher.java:349)
at
org.apache.hadoop.hbase.master.procedure.RSProcedureDispatcher$ExecuteProceduresRemoteCall.run(RSProcedureDispatcher.java:314)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hadoop.hbase.ipc.FailedServerException: This server is
in the failed servers list:
regionserver-2.hbase.hbase.svc.cluster.local/10.128.14.39:16020
at
org.apache.hadoop.hbase.ipc.AbstractRpcClient.getConnection(AbstractRpcClient.java:354)
at
org.apache.hadoop.hbase.ipc.AbstractRpcClient.callMethod(AbstractRpcClient.java:433)
... 9 more
2020-05-21 17:29:49,268 WARN [RSProcedureDispatcher-pool3-t44]
procedure.RSProcedureDispatcher: request to server
regionserver-2.hbase.hbase.svc.cluster.local,16020,1590082132059 failed due to
org.apache.hadoop.hbase.ipc.FailedServerException: Call to
regionserver-2.hbase.hbase.svc. cluster.local/10.128.14.39:16020 failed on
local exception: org.apache.hadoop.hbase.ipc.FailedServerException: This server
is in the failed servers list:
regionserver-2.hbase.hbase.svc.cluster.local/10.128.14.39:16020, try=525,
retrying...
2020-05-21 17:29:49,368 TRACE [RSProcedureDispatcher-pool3-t45]
procedure.RSProcedureDispatcher: Building request with operations count=1
2020-05-21 17:29:49,369 DEBUG [RSProcedureDispatcher-pool3-t45]
ipc.AbstractRpcClient: Not trying to connect to
regionserver-2.hbase.hbase.svc.cluster.local/10.128.14.39:16020 this server is
in the failed servers list
2020-05-21 17:29:49,369 TRACE [RSProcedureDispatcher-pool3-t45]
ipc.AbstractRpcClient: Call: ExecuteProcedures, callTime: 1ms
2020-05-21 17:29:49,369 DEBUG [RSProcedureDispatcher-pool3-t45]
procedure.RSProcedureDispatcher: request to
regionserver-2.hbase.hbase.svc.cluster.local,16020,1590082132059 failed, try=526
org.apache.hadoop.hbase.ipc.FailedServerException: Call to
regionserver-2.hbase.hbase.svc.cluster.local/10.128.14.39:16020 failed on local
exception: org.apache.hadoop.hbase.ipc.FailedServerException: This server is in
the failed servers list: regionserver-2.hbase.hbase.svc.cluster.local/
10.128.14.39:16020
at sun.reflect.GeneratedConstructorAccessor8.newInstance(Unknown Source)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:220)
at
org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:392)
at
org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:97)
at
org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:423)
at
org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:419)
at org.apache.hadoop.hbase.ipc.Call.callComplete(Call.java:117)
at org.apache.hadoop.hbase.ipc.Call.setException(Call.java:132)
at
org.apache.hadoop.hbase.ipc.AbstractRpcClient.callMethod(AbstractRpcClient.java:436)
at
org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:330)
at
org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$200(AbstractRpcClient.java:97)
at
org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:585)
at
org.apache.hadoop.hbase.shaded.protobuf.generated.AdminProtos$AdminService$BlockingStub.executeProcedures(AdminProtos.java:31006)
at
org.apache.hadoop.hbase.master.procedure.RSProcedureDispatcher$ExecuteProceduresRemoteCall.sendRequest(RSProcedureDispatcher.java:349)
at
org.apache.hadoop.hbase.master.procedure.RSProcedureDispatcher$ExecuteProceduresRemoteCall.run(RSProcedureDispatcher.java:314)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hadoop.hbase.ipc.FailedServerException: This server is
in the failed servers list:
regionserver-2.hbase.hbase.svc.cluster.local/10.128.14.39:16020
at
org.apache.hadoop.hbase.ipc.AbstractRpcClient.getConnection(AbstractRpcClient.java:354)
at
org.apache.hadoop.hbase.ipc.AbstractRpcClient.callMethod(AbstractRpcClient.java:433)
... 9 more
{code}
Should at least have some backoff.
What should happen for this case is that we should get a ServerCrashProcedure
for this failed server and it will cleanup this outstanding attempt at rpc....
In HBASE-22041 it had queued the SCP and all the cleanup ran but DNS cache
meant we had wrong IP for new server and so attempts at connect could not
succeed.
We shouldn't fill logs even in this case.
> inifinite retries on failed server in RSProcedureDispatcher
> -----------------------------------------------------------
>
> Key: HBASE-22287
> URL: https://issues.apache.org/jira/browse/HBASE-22287
> Project: HBase
> Issue Type: Bug
> Reporter: Sergey Shelukhin
> Priority: Major
>
> We observed this recently on some cluster, I'm still investigating the root
> cause however seems like the retries should have special handling for this
> exception; and separately probably a cap on number of retries
> {noformat}
> 2019-04-20 04:24:27,093 WARN [RSProcedureDispatcher-pool4-t1285]
> procedure.RSProcedureDispatcher: request to server ,17020,1555742560432
> failed due to java.io.IOException: Call to :17020 failed on local exception:
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the
> failed servers list: :17020, try=26603, retrying...
> {noformat}
> The corresponding worker is stuck
--
This message was sent by Atlassian Jira
(v8.3.4#803005)