[
https://issues.apache.org/jira/browse/HBASE-29381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
huginn updated HBASE-29381:
---------------------------
Description:
When a subProcedure has completed but procedure has not yet completed, if a new
different procedure is submitted during this window, it may lead to the
repeated submission of the current subProcedure. As a result, duplicated
subProcedure remain active when procedure has completed, which will cause
concurrent conflicts.
Here are some logs related to an issue that occurred during taking snapshot.
*Master*
{code:java}
2025-04-29 02:03:24,716 INFO [proc-coordinator-pool-25] procedure.Procedure:
Starting procedure 'snapshotName__1745863344'
acquire
2025-04-29 02:03:24,716 DEBUG [proc-coordinator-pool-25] procedure.Procedure:
Procedure 'snapshotName__1745863344' starting 'acquire'
2025-04-29 02:03:24,716 DEBUG [proc-coordinator-pool-25]
procedure.ZKProcedureCoordinator: Creating acquire
znode:/hbase/online-snapshot/acquired/snapshotName__1745863344
2025-04-29 02:03:24,718 DEBUG [proc-coordinator-pool-25]
procedure.ZKProcedureCoordinator: Watching for acquire
node:/hbase/online-snapshot/acquired/snapshotName__1745863344/rs1
2025-04-29 02:03:24,718 DEBUG [proc-coordinator-pool-25]
procedure.ZKProcedureCoordinator: Watching for acquire
node:/hbase/online-snapshot/acquired/snapshotName__1745863344/rs2
2025-04-29 02:03:27,670 DEBUG [zk-event-processor-pool-0]
procedure.ZKProcedureCoordinator: Node created:
/hbase/online-snapshot/acquired/snapshotName__1745863344/rs2
2025-04-29 02:03:27,908 DEBUG [zk-event-processor-pool-0]
procedure.ZKProcedureCoordinator: Node created:
/hbase/online-snapshot/acquired/snapshotName__1745863344/rs1
reach
2025-04-29 02:03:28,133 DEBUG [proc-coordinator-pool-25] procedure.Procedure:
Procedure 'snapshotName__1745863344' starting 'in-barrier' execution.
2025-04-29 02:03:28,133 DEBUG [proc-coordinator-pool-25]
procedure.ZKProcedureCoordinator: Creating reached barrier zk
node:/hbase/online-snapshot/reached/snapshotName__1745863344
2025-04-29 02:03:30,477 DEBUG [zk-event-processor-pool-0]
procedure.ZKProcedureCoordinator: Node created:
/hbase/online-snapshot/reached/snapshotName__1745863344/rs1
2025-04-29 02:03:30,652 DEBUG [zk-event-processor-pool-0]
procedure.ZKProcedureCoordinator: Node created:
/hbase/online-snapshot/reached/snapshotName__1745863344/rs2
complete
2025-04-29 02:03:30,768 INFO [proc-coordinator-pool-25] procedure.Procedure:
Procedure 'snapshotName__1745863344' execution completed
2025-04-29 02:03:30,768 INFO [proc-coordinator-pool-25]
procedure.ZKProcedureUtil: Clearing all znodes for procedure
snapshotName__1745863344including nodes /hbase/online-snapshot/acquired
/hbase/online-snapshot/reached /hbase/online-snapshot/abort
2025-04-29 02:03:30,906 INFO [MASTER_SNAPSHOT_OPERATIONS-master]
snapshot.EnabledTableSnapshotHandler: Done waiting - online snapshot for
snapshotName__1745863344
2025-04-29 02:03:31,750 INFO [MASTER_SNAPSHOT_OPERATIONS-master]
snapshot.TakeSnapshotHandler: Snapshot snapshotName__1745863344 of table
Test:table1 completed
abort
2025-04-29 02:03:31,186 DEBUG [zk-event-processor-pool-0]
procedure.ZKProcedureCoordinator: Node created:
/hbase/online-snapshot/abort/snapshotName__1745863344
2025-04-29 02:03:31,218 DEBUG [zk-event-processor-pool-0]
procedure.ProcedureCoordinator: abort procedure snapshotName__1745863344
org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via
rs2:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable:
org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does
not exist:
/hbase/.hbase-snapshot/.tmp/snapshotName__1745863344/region-manifest.abcdef122344556667777
{code}
*rs2*
{code:java}
acquire
2025-04-29 02:03:26,069 INFO [zk-event-processor-pool-0]
procedure.ZKProcedureMemberRpcs: Received procedure start children changed
event: /hbase/online-snapshot/acquired
2025-04-29 02:03:26,083 ERROR [zk-event-processor-pool-0]
procedure.ProcedureMember: Subproc 'snapshotName__1745863344' is already
running. Bailing out
2025-04-29 02:03:26,522 INFO [zk-event-processor-pool-0]
procedure.ZKProcedureMemberRpcs: Received procedure start children changed
event: /hbase/online-snapshot/acquired
2025-04-29 02:03:26,540 ERROR [zk-event-processor-pool-0]
procedure.ProcedureMember: Subproc 'snapshotName__1745863344' is already
running. Bailing out
reach
2025-04-29 02:03:28,134 INFO [zk-event-processor-pool-0]
procedure.ZKProcedureMemberRpcs: Received created
event:/hbase/online-snapshot/reached/snapshotName__1745863344
2025-04-29 02:03:29,910 INFO [zk-event-processor-pool-0]
procedure.ZKProcedureMemberRpcs: Received procedure start children changed
event: /hbase/online-snapshot/acquired
2025-04-29 02:03:29,927 WARN [zk-event-processor-pool-0]
procedure.ProcedureMember: A completed old subproc snapshotName__1745863344 is
still present, removing
2025-04-29 02:03:30,208 INFO [zk-event-processor-pool-0]
procedure.ZKProcedureMemberRpcs: Received procedure start children changed
event: /hbase/online-snapshot/acquired
2025-04-29 02:03:30,215 ERROR [zk-event-processor-pool-0]
procedure.ProcedureMember: Subproc 'snapshotName__1745863344' is already
running. Bailing out
2025-04-29 02:03:30,610 INFO [zk-event-processor-pool-0]
procedure.ZKProcedureMemberRpcs: Received procedure start children changed
event: /hbase/online-snapshot/acquired
2025-04-29 02:03:30,616 WARN [zk-event-processor-pool-0]
procedure.ProcedureMember: A completed old subproc snapshotName__1745863344 is
still present, removing
clean
2025-04-29 02:03:30,770 INFO [zk-event-processor-pool-0]
procedure.ZKProcedureMemberRpcs: Received created
event:/hbase/online-snapshot/abort/snapshotName__1745863344
2025-04-29 02:03:30,770 INFO [zk-event-processor-pool-0]
procedure.ZKProcedureMemberRpcs: Received procedure abort children changed
event: /hbase/online-snapshot/abort
abort
2025-04-29 02:03:31,037 ERROR [zk-event-processor-pool-0]
procedure.ProcedureMember: Propagating foreign exception to subprocedure
snapshotName__1745863344
org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via
rs2:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable:
org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does
not exist:
/hbase/.hbase-snapshot/.tmp/snapshotName__1745863344/region-manifest.abcdef122344556667777
{code}
*Client*
{code:java}
2025-04-29 02:03:36,289 ERROR c.s.d.h.c.CommonUtils [RoutineTaskExecutor-7]
Error message org.apache.hadoop.hbase.snapshot.HBaseSnapshotException: Snapshot
\{ ss=snapshotName__1745863344 table=table1 type=SKIPFLUSH ttl=0 } had an
error. Procedure snapshotName__1745863344 \{ waiting=[] done=[] }
at
org.apache.hadoop.hbase.master.snapshot.SnapshotManager.isSnapshotDone(SnapshotManager.java:402)
at
org.apache.hadoop.hbase.master.MasterRpcServices.isSnapshotDone(MasterRpcServices.java:1266)
at
org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:409)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
at org.apache.hadoop.hbase.ipc.RpcHandler.run(RpcHandler.java:105)
at org.apache.hadoop.hbase.ipc.RpcHandler.run(RpcHandler.java:85)
Caused by:
org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via
rs2:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable:
org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does
not exist:
/hbase/.hbase-snapshot/.tmp/snapshotName__1745863344/region-manifest.abcdef122344556667777
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3250)
at
org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.completeFileInternal(FSDirWriteFileOp.java:718)
at
org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.completeFile(FSDirWriteFileOp.java:704)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:3294)
at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:1130)
at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:667)
at
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:637)
at
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:605)
at
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:589)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1146)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1249)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1147)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:2005)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3395)
at
org.apache.hadoop.hbase.errorhandling.ForeignExceptionDispatcher.rethrowException(ForeignExceptionDispatcher.java:82)
at
org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler.rethrowExceptionIfFailed(TakeSnapshotHandler.java:330)
at
org.apache.hadoop.hbase.master.snapshot.SnapshotManager.isSnapshotDone(SnapshotManager.java:391)
... 6 more
Caused by:
org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable:
org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does
not exist:
/hbase/.hbase-snapshot/.tmp/snapshotName__1745863344/region-manifest.abcdef122344556667777
{code}
We can see that during subProcedure has completed but procedure has not yet
completed, this completed old subProcedure is removed and a duplicated
subProcedure is submitted. After procedure has completed, the duplicated
subProcedures is active and generating region manifest in the working
directory. When taking snapshot has completed, master cleans out the working
directory which causes duplicated subProcedure to fail in generating the region
manifest and throw an exception. In the end, the snapshot is successfully
created, but the subProcedure reportes an exception. The root cause of all this
is that after the members have reached the barrier, the coordinator does not
clean up the corresponding acquired barrier, which results in the repeated
submission of subProcedure.
In addition, I find that during the entire process, there is no watch set on
node /hbase/online-snapshot/acquired/snapshotName__1745863344 , so it is
unnecessary to create this node to trigger its watch event before cleaning it
up.
was:
When a subProcedure has completed but procedure has not yet completed, if a new
different procedure is submitted during this window, it may lead to the
repeated submission of the current subProcedure. As a result, duplicated
subProcedure remain active when procedure has completed, which will cause
concurrent conflicts.
Here are some logs related to an issue that occurred during taking snapshot.
*Master*
{code:java}
2025-04-29 02:03:24,716 INFO [proc-coordinator-pool-25] procedure.Procedure:
Starting procedure 'snapshotName__1745863344'
acquire
2025-04-29 02:03:24,716 DEBUG [proc-coordinator-pool-25] procedure.Procedure:
Procedure 'snapshotName__1745863344' starting 'acquire'
2025-04-29 02:03:24,716 DEBUG [proc-coordinator-pool-25]
procedure.ZKProcedureCoordinator: Creating acquire
znode:/shopee-hbase/online-snapshot/acquired/snapshotName__1745863344
2025-04-29 02:03:24,718 DEBUG [proc-coordinator-pool-25]
procedure.ZKProcedureCoordinator: Watching for acquire
node:/shopee-hbase/online-snapshot/acquired/snapshotName__1745863344/rs1
2025-04-29 02:03:24,718 DEBUG [proc-coordinator-pool-25]
procedure.ZKProcedureCoordinator: Watching for acquire
node:/shopee-hbase/online-snapshot/acquired/snapshotName__1745863344/rs2
2025-04-29 02:03:27,670 DEBUG [zk-event-processor-pool-0]
procedure.ZKProcedureCoordinator: Node created:
/shopee-hbase/online-snapshot/acquired/snapshotName__1745863344/rs2
2025-04-29 02:03:27,908 DEBUG [zk-event-processor-pool-0]
procedure.ZKProcedureCoordinator: Node created:
/shopee-hbase/online-snapshot/acquired/snapshotName__1745863344/rs1
reach
2025-04-29 02:03:28,133 DEBUG [proc-coordinator-pool-25] procedure.Procedure:
Procedure 'snapshotName__1745863344' starting 'in-barrier' execution.
2025-04-29 02:03:28,133 DEBUG [proc-coordinator-pool-25]
procedure.ZKProcedureCoordinator: Creating reached barrier zk
node:/shopee-hbase/online-snapshot/reached/snapshotName__1745863344
2025-04-29 02:03:30,477 DEBUG [zk-event-processor-pool-0]
procedure.ZKProcedureCoordinator: Node created:
/shopee-hbase/online-snapshot/reached/snapshotName__1745863344/rs1
2025-04-29 02:03:30,652 DEBUG [zk-event-processor-pool-0]
procedure.ZKProcedureCoordinator: Node created:
/shopee-hbase/online-snapshot/reached/snapshotName__1745863344/rs2
complete
2025-04-29 02:03:30,768 INFO [proc-coordinator-pool-25] procedure.Procedure:
Procedure 'snapshotName__1745863344' execution completed
2025-04-29 02:03:30,768 INFO [proc-coordinator-pool-25]
procedure.ZKProcedureUtil: Clearing all znodes for procedure
snapshotName__1745863344including nodes /shopee-hbase/online-snapshot/acquired
/shopee-hbase/online-snapshot/reached /shopee-hbase/online-snapshot/abort
2025-04-29 02:03:30,906 INFO [MASTER_SNAPSHOT_OPERATIONS-master]
snapshot.EnabledTableSnapshotHandler: Done waiting - online snapshot for
snapshotName__1745863344
2025-04-29 02:03:31,750 INFO [MASTER_SNAPSHOT_OPERATIONS-master]
snapshot.TakeSnapshotHandler: Snapshot snapshotName__1745863344 of table
Test:table1 completed
abort
2025-04-29 02:03:31,186 DEBUG [zk-event-processor-pool-0]
procedure.ZKProcedureCoordinator: Node created:
/shopee-hbase/online-snapshot/abort/snapshotName__1745863344
2025-04-29 02:03:31,218 DEBUG [zk-event-processor-pool-0]
procedure.ProcedureCoordinator: abort procedure snapshotName__1745863344
org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via
rs2:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable:
org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does
not exist:
/hbase/.hbase-snapshot/.tmp/snapshotName__1745863344/region-manifest.abcdef122344556667777
{code}
*rs2*
{code:java}
acquire
2025-04-29 02:03:26,069 INFO [zk-event-processor-pool-0]
procedure.ZKProcedureMemberRpcs: Received procedure start children changed
event: /shopee-hbase/online-snapshot/acquired
2025-04-29 02:03:26,083 ERROR [zk-event-processor-pool-0]
procedure.ProcedureMember: Subproc 'snapshotName__1745863344' is already
running. Bailing out
2025-04-29 02:03:26,522 INFO [zk-event-processor-pool-0]
procedure.ZKProcedureMemberRpcs: Received procedure start children changed
event: /shopee-hbase/online-snapshot/acquired
2025-04-29 02:03:26,540 ERROR [zk-event-processor-pool-0]
procedure.ProcedureMember: Subproc 'snapshotName__1745863344' is already
running. Bailing out
reach
2025-04-29 02:03:28,134 INFO [zk-event-processor-pool-0]
procedure.ZKProcedureMemberRpcs: Received created
event:/shopee-hbase/online-snapshot/reached/snapshotName__1745863344
2025-04-29 02:03:29,910 INFO [zk-event-processor-pool-0]
procedure.ZKProcedureMemberRpcs: Received procedure start children changed
event: /shopee-hbase/online-snapshot/acquired
2025-04-29 02:03:29,927 WARN [zk-event-processor-pool-0]
procedure.ProcedureMember: A completed old subproc snapshotName__1745863344 is
still present, removing
2025-04-29 02:03:30,208 INFO [zk-event-processor-pool-0]
procedure.ZKProcedureMemberRpcs: Received procedure start children changed
event: /shopee-hbase/online-snapshot/acquired
2025-04-29 02:03:30,215 ERROR [zk-event-processor-pool-0]
procedure.ProcedureMember: Subproc 'snapshotName__1745863344' is already
running. Bailing out
2025-04-29 02:03:30,610 INFO [zk-event-processor-pool-0]
procedure.ZKProcedureMemberRpcs: Received procedure start children changed
event: /shopee-hbase/online-snapshot/acquired
2025-04-29 02:03:30,616 WARN [zk-event-processor-pool-0]
procedure.ProcedureMember: A completed old subproc snapshotName__1745863344 is
still present, removing
clean
2025-04-29 02:03:30,770 INFO [zk-event-processor-pool-0]
procedure.ZKProcedureMemberRpcs: Received created
event:/shopee-hbase/online-snapshot/abort/snapshotName__1745863344
2025-04-29 02:03:30,770 INFO [zk-event-processor-pool-0]
procedure.ZKProcedureMemberRpcs: Received procedure abort children changed
event: /shopee-hbase/online-snapshot/abort
abort
2025-04-29 02:03:31,037 ERROR [zk-event-processor-pool-0]
procedure.ProcedureMember: Propagating foreign exception to subprocedure
snapshotName__1745863344
org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via
rs2:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable:
org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does
not exist:
/hbase/.hbase-snapshot/.tmp/snapshotName__1745863344/region-manifest.abcdef122344556667777
{code}
*Client*
{code:java}
2025-04-29 02:03:36,289 ERROR c.s.d.h.c.CommonUtils [RoutineTaskExecutor-7]
Error message org.apache.hadoop.hbase.snapshot.HBaseSnapshotException: Snapshot
\{ ss=snapshotName__1745863344 table=table1 type=SKIPFLUSH ttl=0 } had an
error. Procedure snapshotName__1745863344 \{ waiting=[] done=[] }
at
org.apache.hadoop.hbase.master.snapshot.SnapshotManager.isSnapshotDone(SnapshotManager.java:402)
at
org.apache.hadoop.hbase.master.MasterRpcServices.isSnapshotDone(MasterRpcServices.java:1266)
at
org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:409)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
at org.apache.hadoop.hbase.ipc.RpcHandler.run(RpcHandler.java:105)
at org.apache.hadoop.hbase.ipc.RpcHandler.run(RpcHandler.java:85)
Caused by:
org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via
rs2:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable:
org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does
not exist:
/hbase/.hbase-snapshot/.tmp/snapshotName__1745863344/region-manifest.abcdef122344556667777
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3250)
at
org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.completeFileInternal(FSDirWriteFileOp.java:718)
at
org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.completeFile(FSDirWriteFileOp.java:704)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:3294)
at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:1130)
at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:667)
at
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:637)
at
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:605)
at
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:589)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1146)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1249)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1147)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:2005)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3395)
at
org.apache.hadoop.hbase.errorhandling.ForeignExceptionDispatcher.rethrowException(ForeignExceptionDispatcher.java:82)
at
org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler.rethrowExceptionIfFailed(TakeSnapshotHandler.java:330)
at
org.apache.hadoop.hbase.master.snapshot.SnapshotManager.isSnapshotDone(SnapshotManager.java:391)
... 6 more
Caused by:
org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable:
org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does
not exist:
/hbase/.hbase-snapshot/.tmp/snapshotName__1745863344/region-manifest.abcdef122344556667777
{code}
We can see that during subProcedure has completed but procedure has not yet
completed, this completed old subProcedure is removed and a duplicated
subProcedure is submitted. After procedure has completed, the duplicated
subProcedures is active and generating region manifest in the working
directory. When taking snapshot has completed, master cleans out the working
directory which causes duplicated subProcedure to fail in generating the region
manifest and throw an exception. In the end, the snapshot is successfully
created, but the subProcedure reportes an exception. The root cause of all this
is that after the members have reached the barrier, the coordinator does not
clean up the corresponding acquired barrier, which results in the repeated
submission of subProcedure.
In addition, I find that during the entire process, there is no watch set on
node /shopee-hbase/online-snapshot/acquired/snapshotName__1745863344 , so it is
unnecessary to create this node to trigger its watch event before cleaning it
up.
> Repeated submission of subProcedures leads to concurrency issues
> ----------------------------------------------------------------
>
> Key: HBASE-29381
> URL: https://issues.apache.org/jira/browse/HBASE-29381
> Project: HBase
> Issue Type: Bug
> Components: snapshots
> Affects Versions: 2.4.11
> Reporter: huginn
> Priority: Minor
> Labels: pull-request-available
>
> When a subProcedure has completed but procedure has not yet completed, if a
> new different procedure is submitted during this window, it may lead to the
> repeated submission of the current subProcedure. As a result, duplicated
> subProcedure remain active when procedure has completed, which will cause
> concurrent conflicts.
> Here are some logs related to an issue that occurred during taking snapshot.
> *Master*
> {code:java}
> 2025-04-29 02:03:24,716 INFO [proc-coordinator-pool-25] procedure.Procedure:
> Starting procedure 'snapshotName__1745863344'
> acquire
> 2025-04-29 02:03:24,716 DEBUG [proc-coordinator-pool-25] procedure.Procedure:
> Procedure 'snapshotName__1745863344' starting 'acquire'
> 2025-04-29 02:03:24,716 DEBUG [proc-coordinator-pool-25]
> procedure.ZKProcedureCoordinator: Creating acquire
> znode:/hbase/online-snapshot/acquired/snapshotName__1745863344
> 2025-04-29 02:03:24,718 DEBUG [proc-coordinator-pool-25]
> procedure.ZKProcedureCoordinator: Watching for acquire
> node:/hbase/online-snapshot/acquired/snapshotName__1745863344/rs1
> 2025-04-29 02:03:24,718 DEBUG [proc-coordinator-pool-25]
> procedure.ZKProcedureCoordinator: Watching for acquire
> node:/hbase/online-snapshot/acquired/snapshotName__1745863344/rs2
> 2025-04-29 02:03:27,670 DEBUG [zk-event-processor-pool-0]
> procedure.ZKProcedureCoordinator: Node created:
> /hbase/online-snapshot/acquired/snapshotName__1745863344/rs2
> 2025-04-29 02:03:27,908 DEBUG [zk-event-processor-pool-0]
> procedure.ZKProcedureCoordinator: Node created:
> /hbase/online-snapshot/acquired/snapshotName__1745863344/rs1
> reach
> 2025-04-29 02:03:28,133 DEBUG [proc-coordinator-pool-25] procedure.Procedure:
> Procedure 'snapshotName__1745863344' starting 'in-barrier' execution.
> 2025-04-29 02:03:28,133 DEBUG [proc-coordinator-pool-25]
> procedure.ZKProcedureCoordinator: Creating reached barrier zk
> node:/hbase/online-snapshot/reached/snapshotName__1745863344
> 2025-04-29 02:03:30,477 DEBUG [zk-event-processor-pool-0]
> procedure.ZKProcedureCoordinator: Node created:
> /hbase/online-snapshot/reached/snapshotName__1745863344/rs1
> 2025-04-29 02:03:30,652 DEBUG [zk-event-processor-pool-0]
> procedure.ZKProcedureCoordinator: Node created:
> /hbase/online-snapshot/reached/snapshotName__1745863344/rs2
> complete
> 2025-04-29 02:03:30,768 INFO [proc-coordinator-pool-25] procedure.Procedure:
> Procedure 'snapshotName__1745863344' execution completed
> 2025-04-29 02:03:30,768 INFO [proc-coordinator-pool-25]
> procedure.ZKProcedureUtil: Clearing all znodes for procedure
> snapshotName__1745863344including nodes /hbase/online-snapshot/acquired
> /hbase/online-snapshot/reached /hbase/online-snapshot/abort
> 2025-04-29 02:03:30,906 INFO [MASTER_SNAPSHOT_OPERATIONS-master]
> snapshot.EnabledTableSnapshotHandler: Done waiting - online snapshot for
> snapshotName__1745863344
> 2025-04-29 02:03:31,750 INFO [MASTER_SNAPSHOT_OPERATIONS-master]
> snapshot.TakeSnapshotHandler: Snapshot snapshotName__1745863344 of table
> Test:table1 completed
>
> abort
> 2025-04-29 02:03:31,186 DEBUG [zk-event-processor-pool-0]
> procedure.ZKProcedureCoordinator: Node created:
> /hbase/online-snapshot/abort/snapshotName__1745863344
> 2025-04-29 02:03:31,218 DEBUG [zk-event-processor-pool-0]
> procedure.ProcedureCoordinator: abort procedure snapshotName__1745863344
> org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via
> rs2:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable:
> org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File
> does not exist:
> /hbase/.hbase-snapshot/.tmp/snapshotName__1745863344/region-manifest.abcdef122344556667777
> {code}
> *rs2*
> {code:java}
> acquire
> 2025-04-29 02:03:26,069 INFO [zk-event-processor-pool-0]
> procedure.ZKProcedureMemberRpcs: Received procedure start children changed
> event: /hbase/online-snapshot/acquired
> 2025-04-29 02:03:26,083 ERROR [zk-event-processor-pool-0]
> procedure.ProcedureMember: Subproc 'snapshotName__1745863344' is already
> running. Bailing out
> 2025-04-29 02:03:26,522 INFO [zk-event-processor-pool-0]
> procedure.ZKProcedureMemberRpcs: Received procedure start children changed
> event: /hbase/online-snapshot/acquired
> 2025-04-29 02:03:26,540 ERROR [zk-event-processor-pool-0]
> procedure.ProcedureMember: Subproc 'snapshotName__1745863344' is already
> running. Bailing out
> reach
> 2025-04-29 02:03:28,134 INFO [zk-event-processor-pool-0]
> procedure.ZKProcedureMemberRpcs: Received created
> event:/hbase/online-snapshot/reached/snapshotName__1745863344
> 2025-04-29 02:03:29,910 INFO [zk-event-processor-pool-0]
> procedure.ZKProcedureMemberRpcs: Received procedure start children changed
> event: /hbase/online-snapshot/acquired
> 2025-04-29 02:03:29,927 WARN [zk-event-processor-pool-0]
> procedure.ProcedureMember: A completed old subproc snapshotName__1745863344
> is still present, removing
> 2025-04-29 02:03:30,208 INFO [zk-event-processor-pool-0]
> procedure.ZKProcedureMemberRpcs: Received procedure start children changed
> event: /hbase/online-snapshot/acquired
> 2025-04-29 02:03:30,215 ERROR [zk-event-processor-pool-0]
> procedure.ProcedureMember: Subproc 'snapshotName__1745863344' is already
> running. Bailing out
> 2025-04-29 02:03:30,610 INFO [zk-event-processor-pool-0]
> procedure.ZKProcedureMemberRpcs: Received procedure start children changed
> event: /hbase/online-snapshot/acquired
> 2025-04-29 02:03:30,616 WARN [zk-event-processor-pool-0]
> procedure.ProcedureMember: A completed old subproc snapshotName__1745863344
> is still present, removing
> clean
> 2025-04-29 02:03:30,770 INFO [zk-event-processor-pool-0]
> procedure.ZKProcedureMemberRpcs: Received created
> event:/hbase/online-snapshot/abort/snapshotName__1745863344
> 2025-04-29 02:03:30,770 INFO [zk-event-processor-pool-0]
> procedure.ZKProcedureMemberRpcs: Received procedure abort children changed
> event: /hbase/online-snapshot/abort
> abort
> 2025-04-29 02:03:31,037 ERROR [zk-event-processor-pool-0]
> procedure.ProcedureMember: Propagating foreign exception to subprocedure
> snapshotName__1745863344
> org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via
> rs2:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable:
> org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File
> does not exist:
> /hbase/.hbase-snapshot/.tmp/snapshotName__1745863344/region-manifest.abcdef122344556667777
> {code}
>
> *Client*
> {code:java}
> 2025-04-29 02:03:36,289 ERROR c.s.d.h.c.CommonUtils [RoutineTaskExecutor-7]
> Error message org.apache.hadoop.hbase.snapshot.HBaseSnapshotException:
> Snapshot \{ ss=snapshotName__1745863344 table=table1 type=SKIPFLUSH ttl=0 }
> had an error. Procedure snapshotName__1745863344 \{ waiting=[] done=[] }
> at
> org.apache.hadoop.hbase.master.snapshot.SnapshotManager.isSnapshotDone(SnapshotManager.java:402)
> at
> org.apache.hadoop.hbase.master.MasterRpcServices.isSnapshotDone(MasterRpcServices.java:1266)
> at
> org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:409)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
> at org.apache.hadoop.hbase.ipc.RpcHandler.run(RpcHandler.java:105)
> at org.apache.hadoop.hbase.ipc.RpcHandler.run(RpcHandler.java:85)
> Caused by:
> org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via
> rs2:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable:
> org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File
> does not exist:
> /hbase/.hbase-snapshot/.tmp/snapshotName__1745863344/region-manifest.abcdef122344556667777
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3250)
> at
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.completeFileInternal(FSDirWriteFileOp.java:718)
> at
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.completeFile(FSDirWriteFileOp.java:704)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:3294)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:1130)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:667)
> at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:637)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:605)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:589)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1146)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1249)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1147)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:2005)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3395)
> at
> org.apache.hadoop.hbase.errorhandling.ForeignExceptionDispatcher.rethrowException(ForeignExceptionDispatcher.java:82)
> at
> org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler.rethrowExceptionIfFailed(TakeSnapshotHandler.java:330)
> at
> org.apache.hadoop.hbase.master.snapshot.SnapshotManager.isSnapshotDone(SnapshotManager.java:391)
> ... 6 more
> Caused by:
> org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable:
> org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File
> does not exist:
> /hbase/.hbase-snapshot/.tmp/snapshotName__1745863344/region-manifest.abcdef122344556667777
> {code}
> We can see that during subProcedure has completed but procedure has not yet
> completed, this completed old subProcedure is removed and a duplicated
> subProcedure is submitted. After procedure has completed, the duplicated
> subProcedures is active and generating region manifest in the working
> directory. When taking snapshot has completed, master cleans out the working
> directory which causes duplicated subProcedure to fail in generating the
> region manifest and throw an exception. In the end, the snapshot is
> successfully created, but the subProcedure reportes an exception. The root
> cause of all this is that after the members have reached the barrier, the
> coordinator does not clean up the corresponding acquired barrier, which
> results in the repeated submission of subProcedure.
> In addition, I find that during the entire process, there is no watch set on
> node /hbase/online-snapshot/acquired/snapshotName__1745863344 , so it is
> unnecessary to create this node to trigger its watch event before cleaning it
> up.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)