[jira] [Comment Edited] (HBASE-27955) RefreshPeerProcedure should be resilient to replication endpoint failures

Viraj Jasani (Jira) Thu, 29 Jun 2023 14:33:04 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-27955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17738226#comment-17738226
 ]


Viraj Jasani edited comment on HBASE-27955 at 6/29/23 9:31 PM:
---------------------------------------------------------------

That is correct, NPE is code bug in the custom replication endpoint, however 
the point i am trying to make is: as soon as this NPE gets reported, 
RefreshPeerProcedure gets completed but not rolled back (rollback is not 
supported). And the next step in the parent procedure i.e. 
POST_PEER_MODIFICATION would stay stuck and it doesn't even get executed. The 
only clue i have is that the previous step of the procedure had above NPE 
reported and it got completed (succ flag is modified to false)
{code:java}
@Override
protected void complete(MasterProcedureEnv env, Throwable error) {
  if (error != null) {
    LOG.warn("Refresh peer {} for {} on {} failed", peerId, type, targetServer, 
error);
    this.succ = false;
  } else {
    LOG.info("Refresh peer {} for {} on {} suceeded", peerId, type, 
targetServer);
    this.succ = true;
  }
} {code}
Thread dumps had nothing reported that could indicate why 
POST_PEER_MODIFICATION was stuck. No INFO logs from POST_PEER_MODIFICATION step 
execution either.

 

Hence, if we could introduce rollback in RefreshPeerProcedure, that would help 
at least complete the procedure with rollback rather than letting it stay stuck 
at next step (POST_PEER_MODIFICATION).


was (Author: vjasani):
That is correct, NPE is code bug in the custom replication endpoint, however 
the point i am trying to make is: as soon as this NPE gets reported, 
RefreshPeerProcedure gets completed but not rolled back (rollback is not 
supported). And the next step in the parent procedure i.e. 
POST_PEER_MODIFICATION would stay stuck and it doesn't even get executed. The 
only clue i have is that the previous step of the procedure had above NPE 
reported and it got completed (succ flag is modified to false)

 
{code:java}
@Override
protected void complete(MasterProcedureEnv env, Throwable error) {
  if (error != null) {
    LOG.warn("Refresh peer {} for {} on {} failed", peerId, type, targetServer, 
error);
    this.succ = false;
  } else {
    LOG.info("Refresh peer {} for {} on {} suceeded", peerId, type, 
targetServer);
    this.succ = true;
  }
} {code}
 

 

Thread dumps had nothing reported that could indicate why 
POST_PEER_MODIFICATION was stuck.

 

If we could introduce rollback in RefreshPeerProcedure, that could help at 
least complete the procedure with rollback rather than letting it stay stuck at 
next step (POST_PEER_MODIFICATION).

> RefreshPeerProcedure should be resilient to replication endpoint failures
> -------------------------------------------------------------------------
>
>                 Key: HBASE-27955
>                 URL: https://issues.apache.org/jira/browse/HBASE-27955
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 2.4.14
>            Reporter: Viraj Jasani
>            Priority: Major
>
> UpdatePeerConfigProcedure gets stuck when we see some failures in 
> RefreshPeerProcedure. The only way to move forward is either by restarting 
> active master or bypassing the stuck procedure.
>  
> For instance,
> {code:java}
> 2023-06-26 17:22:08,375 WARN  [,queue=24,port=61000] 
> replication.RefreshPeerProcedure - Refresh peer peer0 for UPDATE_CONFIG on 
> {host},{port},1687053857180 failed
> java.lang.NullPointerException via 
> {host},{port},1687053857180:java.lang.NullPointerException: 
>     at 
> org.apache.hadoop.hbase.procedure2.RemoteProcedureException.fromProto(RemoteProcedureException.java:123)
>     at 
> org.apache.hadoop.hbase.master.MasterRpcServices.lambda$reportProcedureDone$4(MasterRpcServices.java:2406)
>     at java.util.ArrayList.forEach(ArrayList.java:1259)
>     at 
> java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1082)
>     at 
> org.apache.hadoop.hbase.master.MasterRpcServices.reportProcedureDone(MasterRpcServices.java:2401)
>     at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:16296)
>     at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:385)
>     at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:132)
>     at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:369)
>     at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:349)
> Caused by: java.lang.NullPointerException: 
>     at xyz(Abc.java:89)     <========= replication endpoint failure example
>     at xyz(Abc.java:79)     <========= replication endpoint failure example
>     at 
> org.apache.hadoop.hbase.replication.ReplicationPeerImpl.lambda$setPeerConfig$0(ReplicationPeerImpl.java:63)
>     at java.util.ArrayList.forEach(ArrayList.java:1259)
>     at 
> org.apache.hadoop.hbase.replication.ReplicationPeerImpl.setPeerConfig(ReplicationPeerImpl.java:63)
>     at 
> org.apache.hadoop.hbase.replication.regionserver.PeerProcedureHandlerImpl.updatePeerConfig(PeerProcedureHandlerImpl.java:131)
>     at 
> org.apache.hadoop.hbase.replication.regionserver.RefreshPeerCallable.call(RefreshPeerCallable.java:70)
>     at 
> org.apache.hadoop.hbase.replication.regionserver.RefreshPeerCallable.call(RefreshPeerCallable.java:35)
>     at 
> org.apache.hadoop.hbase.regionserver.handler.RSProcedureHandler.process(RSProcedureHandler.java:49)
>     at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:98)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:750) {code}
> RefreshPeerProcedure should support reporting this failure and rollback of 
> the parent procedure.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (HBASE-27955) RefreshPeerProcedure should be resilient to replication endpoint failures

Reply via email to