Viraj Jasani created HBASE-27955:
------------------------------------
Summary: RefreshPeerProcedure should be resilient to replication
endpoint failures
Key: HBASE-27955
URL: https://issues.apache.org/jira/browse/HBASE-27955
Project: HBase
Issue Type: Improvement
Reporter: Viraj Jasani
UpdatePeerConfigProcedure gets stuck when we see some failures in
RefreshPeerProcedure. The only way to move forward is either by restarting
active master or bypassing the stuck procedure.
For instance,
{code:java}
2023-06-26 17:22:08,375 WARN [,queue=24,port=61000]
replication.RefreshPeerProcedure - Refresh peer core1.hbase1a_aws.prod5.uswest2
for UPDATE_CONFIG on {host},{port},1687053857180 failed
java.lang.NullPointerException via
{host},{port},1687053857180:java.lang.NullPointerException:
at
org.apache.hadoop.hbase.procedure2.RemoteProcedureException.fromProto(RemoteProcedureException.java:123)
at
org.apache.hadoop.hbase.master.MasterRpcServices.lambda$reportProcedureDone$4(MasterRpcServices.java:2406)
at java.util.ArrayList.forEach(ArrayList.java:1259)
at
java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1082)
at
org.apache.hadoop.hbase.master.MasterRpcServices.reportProcedureDone(MasterRpcServices.java:2401)
at
org.apache.hadoop.hbase.shaded.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:16296)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:385)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:132)
at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:369)
at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:349)
Caused by: java.lang.NullPointerException:
at xyz(Abc.java:89) <========= replication endpoint failure example
at xyz(Abc.java:79) <========= replication endpoint failure example
at
org.apache.hadoop.hbase.replication.ReplicationPeerImpl.lambda$setPeerConfig$0(ReplicationPeerImpl.java:63)
at java.util.ArrayList.forEach(ArrayList.java:1259)
at
org.apache.hadoop.hbase.replication.ReplicationPeerImpl.setPeerConfig(ReplicationPeerImpl.java:63)
at
org.apache.hadoop.hbase.replication.regionserver.PeerProcedureHandlerImpl.updatePeerConfig(PeerProcedureHandlerImpl.java:131)
at
org.apache.hadoop.hbase.replication.regionserver.RefreshPeerCallable.call(RefreshPeerCallable.java:70)
at
org.apache.hadoop.hbase.replication.regionserver.RefreshPeerCallable.call(RefreshPeerCallable.java:35)
at
org.apache.hadoop.hbase.regionserver.handler.RSProcedureHandler.process(RSProcedureHandler.java:49)
at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:98)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750) {code}
RefreshPeerProcedure should support reporting this failure and rollback of the
parent procedure.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)