[ 
https://issues.apache.org/jira/browse/HBASE-29381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huginn updated HBASE-29381:
---------------------------
    Description: 
When a subProcedure has completed but procedure has not yet completed, if a new 
different procedure is submitted during this window, it may lead to the 
repeated submission of the current subProcedure. As a result, duplicated 
subProcedure remain active when procedure has completed, which will cause 
concurrent conflicts.

Here are some logs related to an issue that occurred during taking snapshot.

*Master*
{code:java}
2025-04-29 02:03:24,716 INFO [proc-coordinator-pool-25] procedure.Procedure: 
Starting procedure 'snapshotName__1745863344'

acquire
2025-04-29 02:03:24,716 DEBUG [proc-coordinator-pool-25] procedure.Procedure: 
Procedure 'snapshotName__1745863344' starting 'acquire'
2025-04-29 02:03:24,716 DEBUG [proc-coordinator-pool-25] 
procedure.ZKProcedureCoordinator: Creating acquire 
znode:/hbase/online-snapshot/acquired/snapshotName__1745863344
2025-04-29 02:03:24,718 DEBUG [proc-coordinator-pool-25] 
procedure.ZKProcedureCoordinator: Watching for acquire 
node:/hbase/online-snapshot/acquired/snapshotName__1745863344/rs1
2025-04-29 02:03:24,718 DEBUG [proc-coordinator-pool-25] 
procedure.ZKProcedureCoordinator: Watching for acquire 
node:/hbase/online-snapshot/acquired/snapshotName__1745863344/rs2
2025-04-29 02:03:27,670 DEBUG [zk-event-processor-pool-0] 
procedure.ZKProcedureCoordinator: Node created: 
/hbase/online-snapshot/acquired/snapshotName__1745863344/rs2
2025-04-29 02:03:27,908 DEBUG [zk-event-processor-pool-0] 
procedure.ZKProcedureCoordinator: Node created: 
/hbase/online-snapshot/acquired/snapshotName__1745863344/rs1

reach
2025-04-29 02:03:28,133 DEBUG [proc-coordinator-pool-25] procedure.Procedure: 
Procedure 'snapshotName__1745863344' starting 'in-barrier' execution.
2025-04-29 02:03:28,133 DEBUG [proc-coordinator-pool-25] 
procedure.ZKProcedureCoordinator: Creating reached barrier zk 
node:/hbase/online-snapshot/reached/snapshotName__1745863344
2025-04-29 02:03:30,477 DEBUG [zk-event-processor-pool-0] 
procedure.ZKProcedureCoordinator: Node created: 
/hbase/online-snapshot/reached/snapshotName__1745863344/rs1

2025-04-29 02:03:30,652 DEBUG [zk-event-processor-pool-0] 
procedure.ZKProcedureCoordinator: Node created: 
/hbase/online-snapshot/reached/snapshotName__1745863344/rs2

complete
2025-04-29 02:03:30,768 INFO [proc-coordinator-pool-25] procedure.Procedure: 
Procedure 'snapshotName__1745863344' execution completed
2025-04-29 02:03:30,768 INFO [proc-coordinator-pool-25] 
procedure.ZKProcedureUtil: Clearing all znodes for procedure 
snapshotName__1745863344including nodes /hbase/online-snapshot/acquired 
/hbase/online-snapshot/reached /hbase/online-snapshot/abort

2025-04-29 02:03:30,906 INFO [MASTER_SNAPSHOT_OPERATIONS-master] 
snapshot.EnabledTableSnapshotHandler: Done waiting - online snapshot for 
snapshotName__1745863344
2025-04-29 02:03:31,750 INFO [MASTER_SNAPSHOT_OPERATIONS-master] 
snapshot.TakeSnapshotHandler: Snapshot snapshotName__1745863344 of table 
Test:table1 completed
 
abort
2025-04-29 02:03:31,186 DEBUG [zk-event-processor-pool-0] 
procedure.ZKProcedureCoordinator: Node created: 
/hbase/online-snapshot/abort/snapshotName__1745863344
2025-04-29 02:03:31,218 DEBUG [zk-event-processor-pool-0] 
procedure.ProcedureCoordinator: abort procedure snapshotName__1745863344
org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via 
rs2:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: 
org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does 
not exist: 
/hbase/.hbase-snapshot/.tmp/snapshotName__1745863344/region-manifest.abcdef122344556667777
{code}

*rs2*
{code:java}
acquire
2025-04-29 02:03:26,069 INFO [zk-event-processor-pool-0] 
procedure.ZKProcedureMemberRpcs: Received procedure start children changed 
event: /hbase/online-snapshot/acquired
2025-04-29 02:03:26,083 ERROR [zk-event-processor-pool-0] 
procedure.ProcedureMember: Subproc 'snapshotName__1745863344' is already 
running. Bailing out
2025-04-29 02:03:26,522 INFO [zk-event-processor-pool-0] 
procedure.ZKProcedureMemberRpcs: Received procedure start children changed 
event: /hbase/online-snapshot/acquired
2025-04-29 02:03:26,540 ERROR [zk-event-processor-pool-0] 
procedure.ProcedureMember: Subproc 'snapshotName__1745863344' is already 
running. Bailing out

reach
2025-04-29 02:03:28,134 INFO [zk-event-processor-pool-0] 
procedure.ZKProcedureMemberRpcs: Received created 
event:/hbase/online-snapshot/reached/snapshotName__1745863344
2025-04-29 02:03:29,910 INFO [zk-event-processor-pool-0] 
procedure.ZKProcedureMemberRpcs: Received procedure start children changed 
event: /hbase/online-snapshot/acquired
2025-04-29 02:03:29,927 WARN [zk-event-processor-pool-0] 
procedure.ProcedureMember: A completed old subproc snapshotName__1745863344 is 
still present, removing
2025-04-29 02:03:30,208 INFO [zk-event-processor-pool-0] 
procedure.ZKProcedureMemberRpcs: Received procedure start children changed 
event: /hbase/online-snapshot/acquired
2025-04-29 02:03:30,215 ERROR [zk-event-processor-pool-0] 
procedure.ProcedureMember: Subproc 'snapshotName__1745863344' is already 
running. Bailing out
2025-04-29 02:03:30,610 INFO [zk-event-processor-pool-0] 
procedure.ZKProcedureMemberRpcs: Received procedure start children changed 
event: /hbase/online-snapshot/acquired
2025-04-29 02:03:30,616 WARN [zk-event-processor-pool-0] 
procedure.ProcedureMember: A completed old subproc snapshotName__1745863344 is 
still present, removing

clean
2025-04-29 02:03:30,770 INFO [zk-event-processor-pool-0] 
procedure.ZKProcedureMemberRpcs: Received created 
event:/hbase/online-snapshot/abort/snapshotName__1745863344
2025-04-29 02:03:30,770 INFO [zk-event-processor-pool-0] 
procedure.ZKProcedureMemberRpcs: Received procedure abort children changed 
event: /hbase/online-snapshot/abort

abort
2025-04-29 02:03:31,037 ERROR [zk-event-processor-pool-0] 
procedure.ProcedureMember: Propagating foreign exception to subprocedure 
snapshotName__1745863344
org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via 
rs2:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: 
org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does 
not exist: 
/hbase/.hbase-snapshot/.tmp/snapshotName__1745863344/region-manifest.abcdef122344556667777
{code}
 

*Client*
{code:java}
2025-04-29 02:03:36,289 ERROR c.s.d.h.c.CommonUtils [RoutineTaskExecutor-7] 
Error message org.apache.hadoop.hbase.snapshot.HBaseSnapshotException: Snapshot 
\{ ss=snapshotName__1745863344 table=table1 type=SKIPFLUSH ttl=0 } had an 
error.  Procedure snapshotName__1745863344 \{ waiting=[] done=[] }
        at 
org.apache.hadoop.hbase.master.snapshot.SnapshotManager.isSnapshotDone(SnapshotManager.java:402)
        at 
org.apache.hadoop.hbase.master.MasterRpcServices.isSnapshotDone(MasterRpcServices.java:1266)
        at 
org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java)
        at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:409)
        at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
        at org.apache.hadoop.hbase.ipc.RpcHandler.run(RpcHandler.java:105)
        at org.apache.hadoop.hbase.ipc.RpcHandler.run(RpcHandler.java:85)
Caused by: 
org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via 
rs2:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: 
org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does 
not exist: 
/hbase/.hbase-snapshot/.tmp/snapshotName__1745863344/region-manifest.abcdef122344556667777
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3250)
        at 
org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.completeFileInternal(FSDirWriteFileOp.java:718)
        at 
org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.completeFile(FSDirWriteFileOp.java:704)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:3294)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:1130)
        at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:667)
        at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:637)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:605)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:589)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1146)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1249)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1147)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:2005)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3395)

        at 
org.apache.hadoop.hbase.errorhandling.ForeignExceptionDispatcher.rethrowException(ForeignExceptionDispatcher.java:82)
        at 
org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler.rethrowExceptionIfFailed(TakeSnapshotHandler.java:330)
        at 
org.apache.hadoop.hbase.master.snapshot.SnapshotManager.isSnapshotDone(SnapshotManager.java:391)
        ... 6 more
Caused by: 
org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: 
org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does 
not exist: 
/hbase/.hbase-snapshot/.tmp/snapshotName__1745863344/region-manifest.abcdef122344556667777
{code}

We can see that during subProcedure has completed but procedure has not yet 
completed, this completed old subProcedure is removed and a duplicated 
subProcedure is submitted. After procedure has completed, the duplicated 
subProcedures is active and generating region manifest in the working 
directory.  When taking snapshot has completed, master cleans out the working 
directory which causes duplicated subProcedure to fail in generating the region 
manifest and throw an exception. In the end,  the snapshot is successfully 
created, but the subProcedure reportes an exception. The root cause of all this 
is that after the members have reached the barrier, the coordinator does not 
clean up the corresponding acquired barrier, which results in the repeated 
submission of subProcedure.

In addition, I find that during the entire process, there is no watch set on 
node /hbase/online-snapshot/acquired/snapshotName__1745863344 , so it is 
unnecessary to create this node to trigger its watch event before cleaning it 
up.

  was:
When a subProcedure has completed but procedure has not yet completed, if a new 
different procedure is submitted during this window, it may lead to the 
repeated submission of the current subProcedure. As a result, duplicated 
subProcedure remain active when procedure has completed, which will cause 
concurrent conflicts.

Here are some logs related to an issue that occurred during taking snapshot.

*Master*
{code:java}
2025-04-29 02:03:24,716 INFO [proc-coordinator-pool-25] procedure.Procedure: 
Starting procedure 'snapshotName__1745863344'

acquire
2025-04-29 02:03:24,716 DEBUG [proc-coordinator-pool-25] procedure.Procedure: 
Procedure 'snapshotName__1745863344' starting 'acquire'
2025-04-29 02:03:24,716 DEBUG [proc-coordinator-pool-25] 
procedure.ZKProcedureCoordinator: Creating acquire 
znode:/shopee-hbase/online-snapshot/acquired/snapshotName__1745863344
2025-04-29 02:03:24,718 DEBUG [proc-coordinator-pool-25] 
procedure.ZKProcedureCoordinator: Watching for acquire 
node:/shopee-hbase/online-snapshot/acquired/snapshotName__1745863344/rs1
2025-04-29 02:03:24,718 DEBUG [proc-coordinator-pool-25] 
procedure.ZKProcedureCoordinator: Watching for acquire 
node:/shopee-hbase/online-snapshot/acquired/snapshotName__1745863344/rs2
2025-04-29 02:03:27,670 DEBUG [zk-event-processor-pool-0] 
procedure.ZKProcedureCoordinator: Node created: 
/shopee-hbase/online-snapshot/acquired/snapshotName__1745863344/rs2
2025-04-29 02:03:27,908 DEBUG [zk-event-processor-pool-0] 
procedure.ZKProcedureCoordinator: Node created: 
/shopee-hbase/online-snapshot/acquired/snapshotName__1745863344/rs1

reach
2025-04-29 02:03:28,133 DEBUG [proc-coordinator-pool-25] procedure.Procedure: 
Procedure 'snapshotName__1745863344' starting 'in-barrier' execution.
2025-04-29 02:03:28,133 DEBUG [proc-coordinator-pool-25] 
procedure.ZKProcedureCoordinator: Creating reached barrier zk 
node:/shopee-hbase/online-snapshot/reached/snapshotName__1745863344
2025-04-29 02:03:30,477 DEBUG [zk-event-processor-pool-0] 
procedure.ZKProcedureCoordinator: Node created: 
/shopee-hbase/online-snapshot/reached/snapshotName__1745863344/rs1

2025-04-29 02:03:30,652 DEBUG [zk-event-processor-pool-0] 
procedure.ZKProcedureCoordinator: Node created: 
/shopee-hbase/online-snapshot/reached/snapshotName__1745863344/rs2

complete
2025-04-29 02:03:30,768 INFO [proc-coordinator-pool-25] procedure.Procedure: 
Procedure 'snapshotName__1745863344' execution completed
2025-04-29 02:03:30,768 INFO [proc-coordinator-pool-25] 
procedure.ZKProcedureUtil: Clearing all znodes for procedure 
snapshotName__1745863344including nodes /shopee-hbase/online-snapshot/acquired 
/shopee-hbase/online-snapshot/reached /shopee-hbase/online-snapshot/abort

2025-04-29 02:03:30,906 INFO [MASTER_SNAPSHOT_OPERATIONS-master] 
snapshot.EnabledTableSnapshotHandler: Done waiting - online snapshot for 
snapshotName__1745863344
2025-04-29 02:03:31,750 INFO [MASTER_SNAPSHOT_OPERATIONS-master] 
snapshot.TakeSnapshotHandler: Snapshot snapshotName__1745863344 of table 
Test:table1 completed
 
abort
2025-04-29 02:03:31,186 DEBUG [zk-event-processor-pool-0] 
procedure.ZKProcedureCoordinator: Node created: 
/shopee-hbase/online-snapshot/abort/snapshotName__1745863344
2025-04-29 02:03:31,218 DEBUG [zk-event-processor-pool-0] 
procedure.ProcedureCoordinator: abort procedure snapshotName__1745863344
org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via 
rs2:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: 
org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does 
not exist: 
/hbase/.hbase-snapshot/.tmp/snapshotName__1745863344/region-manifest.abcdef122344556667777
{code}

*rs2*
{code:java}
acquire
2025-04-29 02:03:26,069 INFO [zk-event-processor-pool-0] 
procedure.ZKProcedureMemberRpcs: Received procedure start children changed 
event: /shopee-hbase/online-snapshot/acquired
2025-04-29 02:03:26,083 ERROR [zk-event-processor-pool-0] 
procedure.ProcedureMember: Subproc 'snapshotName__1745863344' is already 
running. Bailing out
2025-04-29 02:03:26,522 INFO [zk-event-processor-pool-0] 
procedure.ZKProcedureMemberRpcs: Received procedure start children changed 
event: /shopee-hbase/online-snapshot/acquired
2025-04-29 02:03:26,540 ERROR [zk-event-processor-pool-0] 
procedure.ProcedureMember: Subproc 'snapshotName__1745863344' is already 
running. Bailing out

reach
2025-04-29 02:03:28,134 INFO [zk-event-processor-pool-0] 
procedure.ZKProcedureMemberRpcs: Received created 
event:/shopee-hbase/online-snapshot/reached/snapshotName__1745863344
2025-04-29 02:03:29,910 INFO [zk-event-processor-pool-0] 
procedure.ZKProcedureMemberRpcs: Received procedure start children changed 
event: /shopee-hbase/online-snapshot/acquired
2025-04-29 02:03:29,927 WARN [zk-event-processor-pool-0] 
procedure.ProcedureMember: A completed old subproc snapshotName__1745863344 is 
still present, removing
2025-04-29 02:03:30,208 INFO [zk-event-processor-pool-0] 
procedure.ZKProcedureMemberRpcs: Received procedure start children changed 
event: /shopee-hbase/online-snapshot/acquired
2025-04-29 02:03:30,215 ERROR [zk-event-processor-pool-0] 
procedure.ProcedureMember: Subproc 'snapshotName__1745863344' is already 
running. Bailing out
2025-04-29 02:03:30,610 INFO [zk-event-processor-pool-0] 
procedure.ZKProcedureMemberRpcs: Received procedure start children changed 
event: /shopee-hbase/online-snapshot/acquired
2025-04-29 02:03:30,616 WARN [zk-event-processor-pool-0] 
procedure.ProcedureMember: A completed old subproc snapshotName__1745863344 is 
still present, removing

clean
2025-04-29 02:03:30,770 INFO [zk-event-processor-pool-0] 
procedure.ZKProcedureMemberRpcs: Received created 
event:/shopee-hbase/online-snapshot/abort/snapshotName__1745863344
2025-04-29 02:03:30,770 INFO [zk-event-processor-pool-0] 
procedure.ZKProcedureMemberRpcs: Received procedure abort children changed 
event: /shopee-hbase/online-snapshot/abort

abort
2025-04-29 02:03:31,037 ERROR [zk-event-processor-pool-0] 
procedure.ProcedureMember: Propagating foreign exception to subprocedure 
snapshotName__1745863344
org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via 
rs2:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: 
org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does 
not exist: 
/hbase/.hbase-snapshot/.tmp/snapshotName__1745863344/region-manifest.abcdef122344556667777
{code}
 

*Client*
{code:java}
2025-04-29 02:03:36,289 ERROR c.s.d.h.c.CommonUtils [RoutineTaskExecutor-7] 
Error message org.apache.hadoop.hbase.snapshot.HBaseSnapshotException: Snapshot 
\{ ss=snapshotName__1745863344 table=table1 type=SKIPFLUSH ttl=0 } had an 
error.  Procedure snapshotName__1745863344 \{ waiting=[] done=[] }
        at 
org.apache.hadoop.hbase.master.snapshot.SnapshotManager.isSnapshotDone(SnapshotManager.java:402)
        at 
org.apache.hadoop.hbase.master.MasterRpcServices.isSnapshotDone(MasterRpcServices.java:1266)
        at 
org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java)
        at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:409)
        at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
        at org.apache.hadoop.hbase.ipc.RpcHandler.run(RpcHandler.java:105)
        at org.apache.hadoop.hbase.ipc.RpcHandler.run(RpcHandler.java:85)
Caused by: 
org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via 
rs2:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: 
org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does 
not exist: 
/hbase/.hbase-snapshot/.tmp/snapshotName__1745863344/region-manifest.abcdef122344556667777
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3250)
        at 
org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.completeFileInternal(FSDirWriteFileOp.java:718)
        at 
org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.completeFile(FSDirWriteFileOp.java:704)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:3294)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:1130)
        at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:667)
        at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:637)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:605)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:589)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1146)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1249)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1147)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:2005)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3395)

        at 
org.apache.hadoop.hbase.errorhandling.ForeignExceptionDispatcher.rethrowException(ForeignExceptionDispatcher.java:82)
        at 
org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler.rethrowExceptionIfFailed(TakeSnapshotHandler.java:330)
        at 
org.apache.hadoop.hbase.master.snapshot.SnapshotManager.isSnapshotDone(SnapshotManager.java:391)
        ... 6 more
Caused by: 
org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: 
org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does 
not exist: 
/hbase/.hbase-snapshot/.tmp/snapshotName__1745863344/region-manifest.abcdef122344556667777
{code}

We can see that during subProcedure has completed but procedure has not yet 
completed, this completed old subProcedure is removed and a duplicated 
subProcedure is submitted. After procedure has completed, the duplicated 
subProcedures is active and generating region manifest in the working 
directory.  When taking snapshot has completed, master cleans out the working 
directory which causes duplicated subProcedure to fail in generating the region 
manifest and throw an exception. In the end,  the snapshot is successfully 
created, but the subProcedure reportes an exception. The root cause of all this 
is that after the members have reached the barrier, the coordinator does not 
clean up the corresponding acquired barrier, which results in the repeated 
submission of subProcedure.

In addition, I find that during the entire process, there is no watch set on 
node /shopee-hbase/online-snapshot/acquired/snapshotName__1745863344 , so it is 
unnecessary to create this node to trigger its watch event before cleaning it 
up.


> Repeated submission of subProcedures leads to concurrency issues
> ----------------------------------------------------------------
>
>                 Key: HBASE-29381
>                 URL: https://issues.apache.org/jira/browse/HBASE-29381
>             Project: HBase
>          Issue Type: Bug
>          Components: snapshots
>    Affects Versions: 2.4.11
>            Reporter: huginn
>            Priority: Minor
>              Labels: pull-request-available
>
> When a subProcedure has completed but procedure has not yet completed, if a 
> new different procedure is submitted during this window, it may lead to the 
> repeated submission of the current subProcedure. As a result, duplicated 
> subProcedure remain active when procedure has completed, which will cause 
> concurrent conflicts.
> Here are some logs related to an issue that occurred during taking snapshot.
> *Master*
> {code:java}
> 2025-04-29 02:03:24,716 INFO [proc-coordinator-pool-25] procedure.Procedure: 
> Starting procedure 'snapshotName__1745863344'
> acquire
> 2025-04-29 02:03:24,716 DEBUG [proc-coordinator-pool-25] procedure.Procedure: 
> Procedure 'snapshotName__1745863344' starting 'acquire'
> 2025-04-29 02:03:24,716 DEBUG [proc-coordinator-pool-25] 
> procedure.ZKProcedureCoordinator: Creating acquire 
> znode:/hbase/online-snapshot/acquired/snapshotName__1745863344
> 2025-04-29 02:03:24,718 DEBUG [proc-coordinator-pool-25] 
> procedure.ZKProcedureCoordinator: Watching for acquire 
> node:/hbase/online-snapshot/acquired/snapshotName__1745863344/rs1
> 2025-04-29 02:03:24,718 DEBUG [proc-coordinator-pool-25] 
> procedure.ZKProcedureCoordinator: Watching for acquire 
> node:/hbase/online-snapshot/acquired/snapshotName__1745863344/rs2
> 2025-04-29 02:03:27,670 DEBUG [zk-event-processor-pool-0] 
> procedure.ZKProcedureCoordinator: Node created: 
> /hbase/online-snapshot/acquired/snapshotName__1745863344/rs2
> 2025-04-29 02:03:27,908 DEBUG [zk-event-processor-pool-0] 
> procedure.ZKProcedureCoordinator: Node created: 
> /hbase/online-snapshot/acquired/snapshotName__1745863344/rs1
> reach
> 2025-04-29 02:03:28,133 DEBUG [proc-coordinator-pool-25] procedure.Procedure: 
> Procedure 'snapshotName__1745863344' starting 'in-barrier' execution.
> 2025-04-29 02:03:28,133 DEBUG [proc-coordinator-pool-25] 
> procedure.ZKProcedureCoordinator: Creating reached barrier zk 
> node:/hbase/online-snapshot/reached/snapshotName__1745863344
> 2025-04-29 02:03:30,477 DEBUG [zk-event-processor-pool-0] 
> procedure.ZKProcedureCoordinator: Node created: 
> /hbase/online-snapshot/reached/snapshotName__1745863344/rs1
> 2025-04-29 02:03:30,652 DEBUG [zk-event-processor-pool-0] 
> procedure.ZKProcedureCoordinator: Node created: 
> /hbase/online-snapshot/reached/snapshotName__1745863344/rs2
> complete
> 2025-04-29 02:03:30,768 INFO [proc-coordinator-pool-25] procedure.Procedure: 
> Procedure 'snapshotName__1745863344' execution completed
> 2025-04-29 02:03:30,768 INFO [proc-coordinator-pool-25] 
> procedure.ZKProcedureUtil: Clearing all znodes for procedure 
> snapshotName__1745863344including nodes /hbase/online-snapshot/acquired 
> /hbase/online-snapshot/reached /hbase/online-snapshot/abort
> 2025-04-29 02:03:30,906 INFO [MASTER_SNAPSHOT_OPERATIONS-master] 
> snapshot.EnabledTableSnapshotHandler: Done waiting - online snapshot for 
> snapshotName__1745863344
> 2025-04-29 02:03:31,750 INFO [MASTER_SNAPSHOT_OPERATIONS-master] 
> snapshot.TakeSnapshotHandler: Snapshot snapshotName__1745863344 of table 
> Test:table1 completed
>  
> abort
> 2025-04-29 02:03:31,186 DEBUG [zk-event-processor-pool-0] 
> procedure.ZKProcedureCoordinator: Node created: 
> /hbase/online-snapshot/abort/snapshotName__1745863344
> 2025-04-29 02:03:31,218 DEBUG [zk-event-processor-pool-0] 
> procedure.ProcedureCoordinator: abort procedure snapshotName__1745863344
> org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via 
> rs2:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: 
> org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File 
> does not exist: 
> /hbase/.hbase-snapshot/.tmp/snapshotName__1745863344/region-manifest.abcdef122344556667777
> {code}
> *rs2*
> {code:java}
> acquire
> 2025-04-29 02:03:26,069 INFO [zk-event-processor-pool-0] 
> procedure.ZKProcedureMemberRpcs: Received procedure start children changed 
> event: /hbase/online-snapshot/acquired
> 2025-04-29 02:03:26,083 ERROR [zk-event-processor-pool-0] 
> procedure.ProcedureMember: Subproc 'snapshotName__1745863344' is already 
> running. Bailing out
> 2025-04-29 02:03:26,522 INFO [zk-event-processor-pool-0] 
> procedure.ZKProcedureMemberRpcs: Received procedure start children changed 
> event: /hbase/online-snapshot/acquired
> 2025-04-29 02:03:26,540 ERROR [zk-event-processor-pool-0] 
> procedure.ProcedureMember: Subproc 'snapshotName__1745863344' is already 
> running. Bailing out
> reach
> 2025-04-29 02:03:28,134 INFO [zk-event-processor-pool-0] 
> procedure.ZKProcedureMemberRpcs: Received created 
> event:/hbase/online-snapshot/reached/snapshotName__1745863344
> 2025-04-29 02:03:29,910 INFO [zk-event-processor-pool-0] 
> procedure.ZKProcedureMemberRpcs: Received procedure start children changed 
> event: /hbase/online-snapshot/acquired
> 2025-04-29 02:03:29,927 WARN [zk-event-processor-pool-0] 
> procedure.ProcedureMember: A completed old subproc snapshotName__1745863344 
> is still present, removing
> 2025-04-29 02:03:30,208 INFO [zk-event-processor-pool-0] 
> procedure.ZKProcedureMemberRpcs: Received procedure start children changed 
> event: /hbase/online-snapshot/acquired
> 2025-04-29 02:03:30,215 ERROR [zk-event-processor-pool-0] 
> procedure.ProcedureMember: Subproc 'snapshotName__1745863344' is already 
> running. Bailing out
> 2025-04-29 02:03:30,610 INFO [zk-event-processor-pool-0] 
> procedure.ZKProcedureMemberRpcs: Received procedure start children changed 
> event: /hbase/online-snapshot/acquired
> 2025-04-29 02:03:30,616 WARN [zk-event-processor-pool-0] 
> procedure.ProcedureMember: A completed old subproc snapshotName__1745863344 
> is still present, removing
> clean
> 2025-04-29 02:03:30,770 INFO [zk-event-processor-pool-0] 
> procedure.ZKProcedureMemberRpcs: Received created 
> event:/hbase/online-snapshot/abort/snapshotName__1745863344
> 2025-04-29 02:03:30,770 INFO [zk-event-processor-pool-0] 
> procedure.ZKProcedureMemberRpcs: Received procedure abort children changed 
> event: /hbase/online-snapshot/abort
> abort
> 2025-04-29 02:03:31,037 ERROR [zk-event-processor-pool-0] 
> procedure.ProcedureMember: Propagating foreign exception to subprocedure 
> snapshotName__1745863344
> org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via 
> rs2:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: 
> org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File 
> does not exist: 
> /hbase/.hbase-snapshot/.tmp/snapshotName__1745863344/region-manifest.abcdef122344556667777
> {code}
>  
> *Client*
> {code:java}
> 2025-04-29 02:03:36,289 ERROR c.s.d.h.c.CommonUtils [RoutineTaskExecutor-7] 
> Error message org.apache.hadoop.hbase.snapshot.HBaseSnapshotException: 
> Snapshot \{ ss=snapshotName__1745863344 table=table1 type=SKIPFLUSH ttl=0 } 
> had an error.  Procedure snapshotName__1745863344 \{ waiting=[] done=[] }
>         at 
> org.apache.hadoop.hbase.master.snapshot.SnapshotManager.isSnapshotDone(SnapshotManager.java:402)
>         at 
> org.apache.hadoop.hbase.master.MasterRpcServices.isSnapshotDone(MasterRpcServices.java:1266)
>         at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java)
>         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:409)
>         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
>         at org.apache.hadoop.hbase.ipc.RpcHandler.run(RpcHandler.java:105)
>         at org.apache.hadoop.hbase.ipc.RpcHandler.run(RpcHandler.java:85)
> Caused by: 
> org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via 
> rs2:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: 
> org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File 
> does not exist: 
> /hbase/.hbase-snapshot/.tmp/snapshotName__1745863344/region-manifest.abcdef122344556667777
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3250)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.completeFileInternal(FSDirWriteFileOp.java:718)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.completeFile(FSDirWriteFileOp.java:704)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:3294)
>         at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:1130)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:667)
>         at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:637)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:605)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:589)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1146)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1249)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1147)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:2005)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3395)
>         at 
> org.apache.hadoop.hbase.errorhandling.ForeignExceptionDispatcher.rethrowException(ForeignExceptionDispatcher.java:82)
>         at 
> org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler.rethrowExceptionIfFailed(TakeSnapshotHandler.java:330)
>         at 
> org.apache.hadoop.hbase.master.snapshot.SnapshotManager.isSnapshotDone(SnapshotManager.java:391)
>         ... 6 more
> Caused by: 
> org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: 
> org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File 
> does not exist: 
> /hbase/.hbase-snapshot/.tmp/snapshotName__1745863344/region-manifest.abcdef122344556667777
> {code}
> We can see that during subProcedure has completed but procedure has not yet 
> completed, this completed old subProcedure is removed and a duplicated 
> subProcedure is submitted. After procedure has completed, the duplicated 
> subProcedures is active and generating region manifest in the working 
> directory.  When taking snapshot has completed, master cleans out the working 
> directory which causes duplicated subProcedure to fail in generating the 
> region manifest and throw an exception. In the end,  the snapshot is 
> successfully created, but the subProcedure reportes an exception. The root 
> cause of all this is that after the members have reached the barrier, the 
> coordinator does not clean up the corresponding acquired barrier, which 
> results in the repeated submission of subProcedure.
> In addition, I find that during the entire process, there is no watch set on 
> node /hbase/online-snapshot/acquired/snapshotName__1745863344 , so it is 
> unnecessary to create this node to trigger its watch event before cleaning it 
> up.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to