[
https://issues.apache.org/jira/browse/HBASE-29381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
huginn updated HBASE-29381:
---------------------------
Priority: Major (was: Minor)
> Repeated submission of subProcedures leads to concurrency issues
> ----------------------------------------------------------------
>
> Key: HBASE-29381
> URL: https://issues.apache.org/jira/browse/HBASE-29381
> Project: HBase
> Issue Type: Bug
> Components: snapshots
> Affects Versions: 2.4.11
> Reporter: huginn
> Priority: Major
> Labels: pull-request-available
> Attachments: HBASE-29381.patch
>
>
> When a subProcedure has completed but procedure has not yet completed, if a
> new different procedure is submitted during this window, it may lead to the
> repeated submission of the current subProcedure. As a result, duplicated
> subProcedure remain active when procedure has completed, which will cause
> concurrent conflicts.
> Here are some logs related to an issue that occurred during taking snapshot.
> *Master*
> {code:java}
> 2025-04-29 02:03:24,716 INFO [proc-coordinator-pool-25] procedure.Procedure:
> Starting procedure 'snapshotName__1745863344'
> acquire
> 2025-04-29 02:03:24,716 DEBUG [proc-coordinator-pool-25] procedure.Procedure:
> Procedure 'snapshotName__1745863344' starting 'acquire'
> 2025-04-29 02:03:24,716 DEBUG [proc-coordinator-pool-25]
> procedure.ZKProcedureCoordinator: Creating acquire
> znode:/hbase/online-snapshot/acquired/snapshotName__1745863344
> 2025-04-29 02:03:24,718 DEBUG [proc-coordinator-pool-25]
> procedure.ZKProcedureCoordinator: Watching for acquire
> node:/hbase/online-snapshot/acquired/snapshotName__1745863344/rs1
> 2025-04-29 02:03:24,718 DEBUG [proc-coordinator-pool-25]
> procedure.ZKProcedureCoordinator: Watching for acquire
> node:/hbase/online-snapshot/acquired/snapshotName__1745863344/rs2
> 2025-04-29 02:03:27,670 DEBUG [zk-event-processor-pool-0]
> procedure.ZKProcedureCoordinator: Node created:
> /hbase/online-snapshot/acquired/snapshotName__1745863344/rs2
> 2025-04-29 02:03:27,908 DEBUG [zk-event-processor-pool-0]
> procedure.ZKProcedureCoordinator: Node created:
> /hbase/online-snapshot/acquired/snapshotName__1745863344/rs1
> reach
> 2025-04-29 02:03:28,133 DEBUG [proc-coordinator-pool-25] procedure.Procedure:
> Procedure 'snapshotName__1745863344' starting 'in-barrier' execution.
> 2025-04-29 02:03:28,133 DEBUG [proc-coordinator-pool-25]
> procedure.ZKProcedureCoordinator: Creating reached barrier zk
> node:/hbase/online-snapshot/reached/snapshotName__1745863344
> 2025-04-29 02:03:30,477 DEBUG [zk-event-processor-pool-0]
> procedure.ZKProcedureCoordinator: Node created:
> /hbase/online-snapshot/reached/snapshotName__1745863344/rs1
> 2025-04-29 02:03:30,652 DEBUG [zk-event-processor-pool-0]
> procedure.ZKProcedureCoordinator: Node created:
> /hbase/online-snapshot/reached/snapshotName__1745863344/rs2
> complete
> 2025-04-29 02:03:30,768 INFO [proc-coordinator-pool-25] procedure.Procedure:
> Procedure 'snapshotName__1745863344' execution completed
> 2025-04-29 02:03:30,768 INFO [proc-coordinator-pool-25]
> procedure.ZKProcedureUtil: Clearing all znodes for procedure
> snapshotName__1745863344including nodes /hbase/online-snapshot/acquired
> /hbase/online-snapshot/reached /hbase/online-snapshot/abort
> 2025-04-29 02:03:30,906 INFO [MASTER_SNAPSHOT_OPERATIONS-master]
> snapshot.EnabledTableSnapshotHandler: Done waiting - online snapshot for
> snapshotName__1745863344
> 2025-04-29 02:03:31,750 INFO [MASTER_SNAPSHOT_OPERATIONS-master]
> snapshot.TakeSnapshotHandler: Snapshot snapshotName__1745863344 of table
> Test:table1 completed
>
> abort
> 2025-04-29 02:03:31,186 DEBUG [zk-event-processor-pool-0]
> procedure.ZKProcedureCoordinator: Node created:
> /hbase/online-snapshot/abort/snapshotName__1745863344
> 2025-04-29 02:03:31,218 DEBUG [zk-event-processor-pool-0]
> procedure.ProcedureCoordinator: abort procedure snapshotName__1745863344
> org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via
> rs2:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable:
> org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File
> does not exist:
> /hbase/.hbase-snapshot/.tmp/snapshotName__1745863344/region-manifest.abcdef122344556667777
> {code}
> *rs2*
> {code:java}
> acquire
> 2025-04-29 02:03:26,069 INFO [zk-event-processor-pool-0]
> procedure.ZKProcedureMemberRpcs: Received procedure start children changed
> event: /hbase/online-snapshot/acquired
> 2025-04-29 02:03:26,083 ERROR [zk-event-processor-pool-0]
> procedure.ProcedureMember: Subproc 'snapshotName__1745863344' is already
> running. Bailing out
> 2025-04-29 02:03:26,522 INFO [zk-event-processor-pool-0]
> procedure.ZKProcedureMemberRpcs: Received procedure start children changed
> event: /hbase/online-snapshot/acquired
> 2025-04-29 02:03:26,540 ERROR [zk-event-processor-pool-0]
> procedure.ProcedureMember: Subproc 'snapshotName__1745863344' is already
> running. Bailing out
> reach
> 2025-04-29 02:03:28,134 INFO [zk-event-processor-pool-0]
> procedure.ZKProcedureMemberRpcs: Received created
> event:/hbase/online-snapshot/reached/snapshotName__1745863344
> 2025-04-29 02:03:29,910 INFO [zk-event-processor-pool-0]
> procedure.ZKProcedureMemberRpcs: Received procedure start children changed
> event: /hbase/online-snapshot/acquired
> 2025-04-29 02:03:29,927 WARN [zk-event-processor-pool-0]
> procedure.ProcedureMember: A completed old subproc snapshotName__1745863344
> is still present, removing
> 2025-04-29 02:03:30,208 INFO [zk-event-processor-pool-0]
> procedure.ZKProcedureMemberRpcs: Received procedure start children changed
> event: /hbase/online-snapshot/acquired
> 2025-04-29 02:03:30,215 ERROR [zk-event-processor-pool-0]
> procedure.ProcedureMember: Subproc 'snapshotName__1745863344' is already
> running. Bailing out
> 2025-04-29 02:03:30,610 INFO [zk-event-processor-pool-0]
> procedure.ZKProcedureMemberRpcs: Received procedure start children changed
> event: /hbase/online-snapshot/acquired
> 2025-04-29 02:03:30,616 WARN [zk-event-processor-pool-0]
> procedure.ProcedureMember: A completed old subproc snapshotName__1745863344
> is still present, removing
> clean
> 2025-04-29 02:03:30,770 INFO [zk-event-processor-pool-0]
> procedure.ZKProcedureMemberRpcs: Received created
> event:/hbase/online-snapshot/abort/snapshotName__1745863344
> 2025-04-29 02:03:30,770 INFO [zk-event-processor-pool-0]
> procedure.ZKProcedureMemberRpcs: Received procedure abort children changed
> event: /hbase/online-snapshot/abort
> abort
> 2025-04-29 02:03:31,037 ERROR [zk-event-processor-pool-0]
> procedure.ProcedureMember: Propagating foreign exception to subprocedure
> snapshotName__1745863344
> org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via
> rs2:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable:
> org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File
> does not exist:
> /hbase/.hbase-snapshot/.tmp/snapshotName__1745863344/region-manifest.abcdef122344556667777
> {code}
>
> *Client*
> {code:java}
> 2025-04-29 02:03:36,289 ERROR c.s.d.h.c.CommonUtils [RoutineTaskExecutor-7]
> Error message org.apache.hadoop.hbase.snapshot.HBaseSnapshotException:
> Snapshot \{ ss=snapshotName__1745863344 table=table1 type=SKIPFLUSH ttl=0 }
> had an error. Procedure snapshotName__1745863344 \{ waiting=[] done=[] }
> at
> org.apache.hadoop.hbase.master.snapshot.SnapshotManager.isSnapshotDone(SnapshotManager.java:402)
> at
> org.apache.hadoop.hbase.master.MasterRpcServices.isSnapshotDone(MasterRpcServices.java:1266)
> at
> org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:409)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
> at org.apache.hadoop.hbase.ipc.RpcHandler.run(RpcHandler.java:105)
> at org.apache.hadoop.hbase.ipc.RpcHandler.run(RpcHandler.java:85)
> Caused by:
> org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via
> rs2:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable:
> org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File
> does not exist:
> /hbase/.hbase-snapshot/.tmp/snapshotName__1745863344/region-manifest.abcdef122344556667777
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3250)
> at
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.completeFileInternal(FSDirWriteFileOp.java:718)
> at
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.completeFile(FSDirWriteFileOp.java:704)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:3294)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:1130)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:667)
> at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:637)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:605)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:589)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1146)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1249)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1147)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:2005)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3395)
> at
> org.apache.hadoop.hbase.errorhandling.ForeignExceptionDispatcher.rethrowException(ForeignExceptionDispatcher.java:82)
> at
> org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler.rethrowExceptionIfFailed(TakeSnapshotHandler.java:330)
> at
> org.apache.hadoop.hbase.master.snapshot.SnapshotManager.isSnapshotDone(SnapshotManager.java:391)
> ... 6 more
> Caused by:
> org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable:
> org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File
> does not exist:
> /hbase/.hbase-snapshot/.tmp/snapshotName__1745863344/region-manifest.abcdef122344556667777
> {code}
> We can see that during subProcedure has completed but procedure has not yet
> completed, this completed old subProcedure is removed and a duplicated
> subProcedure is submitted. After procedure has completed, the duplicated
> subProcedures is active and generating region manifest in the working
> directory. When taking snapshot has completed, master cleans out the working
> directory which causes duplicated subProcedure to fail in generating the
> region manifest and throw an exception. In the end, the snapshot is
> successfully created, but the subProcedure reportes an exception. The root
> cause of all this is that after the members have reached the barrier, the
> coordinator does not clean up the corresponding acquired barrier, which
> results in the repeated submission of subProcedure.
> In addition, I find that during the entire process, there is no watch set on
> node /hbase/online-snapshot/acquired/snapshotName__1745863344 , so it is
> unnecessary to create this node to trigger its watch event before cleaning it
> up.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)