David Manning created HBASE-28422: ------------------------------------- Summary: SplitWalProcedure will attempt SplitWalRemoteProcedure on the same target RegionServer indefinitely Key: HBASE-28422 URL: https://issues.apache.org/jira/browse/HBASE-28422 Project: HBase Issue Type: Bug Components: master, proc-v2, wal Affects Versions: 2.5.5 Reporter: David Manning
Similar to HBASE-28050. If HMaster selects a RegionServer for SplitWalRemoteProcedure, it will retry this server as long as the server is alive. I believe this is because even though {{RSProcedureDispatcher.ExecuteProceduresRemoteCall.run}} calls {{{}remoteCallFailed{}}}, there is no logic after this to select a new target server. For {{TransitRegionStateProcedure}} there is logic to select a new server for opening a region, using {{{}forceNewPlan{}}}. But SplitWalRemoteProcedure only has logic to try another server if we receive a {{DoNotRetryIOException}} in SplitWALRemoteProcedure#complete: [https://github.com/apache/hbase/blob/780ff56b3f23e7041ef1b705b7d3d0a53fdd05ae/hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/SplitWALRemoteProcedure.java#L104-L110] If we receive any other IOException, we will just retry the target server forever. Just like in HBASE-28050, if there is a SaslException, this will never lead to retrying a SplitWalRemoteProcedure on a new server, which can lead to ServerCrashProcedure never finishing until the target server for SplitWalRemoteProcedure is restarted. The following log is seen repeatedly, always sending to the same host. {code:java} 2024-01-31 15:59:43,616 WARN [RSProcedureDispatcher-pool-72846] procedure.SplitWALRemoteProcedure - Failed split of hdfs://<ns>/hbase/WALs/<host>,1704984571464-splitting/<host>1704984571464.1706710908543, retry... java.io.IOException: Call to address=<host> failed on local exception: java.io.IOException: Can not send request because relogin is in progress. at sun.reflect.GeneratedConstructorAccessor363.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:239) at org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:391) at org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:92) at org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:425) at org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:420) at org.apache.hadoop.hbase.ipc.Call.callComplete(Call.java:114) at org.apache.hadoop.hbase.ipc.Call.setException(Call.java:129) at org.apache.hadoop.hbase.ipc.NettyRpcConnection.lambda$sendRequest$4(NettyRpcConnection.java:365) at org.apache.hbase.thirdparty.io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174) at org.apache.hbase.thirdparty.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167) at org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470) at org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:403) at org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) at org.apache.hbase.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at org.apache.hbase.thirdparty.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.lang.Thread.run(Thread.java:750) Caused by: java.io.IOException: Can not send request because relogin is in progress. at org.apache.hadoop.hbase.ipc.NettyRpcConnection.sendRequest0(NettyRpcConnection.java:321) at org.apache.hadoop.hbase.ipc.NettyRpcConnection.lambda$sendRequest$4(NettyRpcConnection.java:363) ... 8 more {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)