wardlican opened a new issue, #4001: URL: https://github.com/apache/amoro/issues/4001
### What happened? TaskRuntime.stateLock and TableOptimizingProcess.lock can cause deadlocks.This can cause tables in a certain resource group to malfunction, and OptimizerKeeper will also fail to function properly. Thread1: Holding: 0x000000059a732858 Waiting: 0x000000059a739970 、、、 stackTrace: java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x000000059a739970> (a java.util.concurrent.locks.ReentrantLock$NonfairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209) at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285) at org.apache.amoro.server.persistence.StatedPersistentBase.invokeConsistency(StatedPersistentBase.java:48) at org.apache.amoro.server.optimizing.TaskRuntime.tryCanceling(TaskRuntime.java:148) at org.apache.amoro.server.optimizing.OptimizingQueue$TableOptimizingProcess$$Lambda$2143/1314391570.accept(Unknown Source) at java.util.HashMap$Values.forEach(HashMap.java:982) at org.apache.amoro.server.optimizing.OptimizingQueue$TableOptimizingProcess.cancelTasks(OptimizingQueue.java:768) at org.apache.amoro.server.optimizing.OptimizingQueue$TableOptimizingProcess.lambda$persistAndSetCompleted$13(OptimizingQueue.java:748) at org.apache.amoro.server.optimizing.OptimizingQueue$TableOptimizingProcess$$Lambda$2139/44375492.run(Unknown Source) at org.apache.amoro.server.persistence.PersistentBase$$Lambda$573/1461930169.accept(Unknown Source) at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948) at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647) at org.apache.amoro.server.persistence.PersistentBase.doAsTransaction(PersistentBase.java:77) at org.apache.amoro.server.optimizing.OptimizingQueue.access$500(OptimizingQueue.java:74) at org.apache.amoro.server.optimizing.OptimizingQueue$TableOptimizingProcess.persistAndSetCompleted(OptimizingQueue.java:745) at org.apache.amoro.server.optimizing.OptimizingQueue$TableOptimizingProcess.acceptResult(OptimizingQueue.java:539) at org.apache.amoro.server.optimizing.TaskRuntime.lambda$complete$0(TaskRuntime.java:104) at org.apache.amoro.server.optimizing.TaskRuntime$$Lambda$2004/1538729933.run(Unknown Source) at org.apache.amoro.server.persistence.PersistentBase$$Lambda$573/1461930169.accept(Unknown Source) at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948) at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647) at org.apache.amoro.server.persistence.PersistentBase.doAsTransaction(PersistentBase.java:77) at org.apache.amoro.server.persistence.StatedPersistentBase.invokeConsistency(StatedPersistentBase.java:51) at org.apache.amoro.server.optimizing.TaskRuntime.complete(TaskRuntime.java:77) at org.apache.amoro.server.DefaultOptimizingService.completeTask(DefaultOptimizingService.java:266) at sun.reflect.GeneratedMethodAccessor354.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.amoro.server.utils.ThriftServiceProxy.invoke(ThriftServiceProxy.java:56) at com.sun.proxy.$Proxy60.completeTask(Unknown Source) at org.apache.amoro.api.OptimizingService$Processor$completeTask.getResult(OptimizingService.java:629) at org.apache.amoro.api.OptimizingService$Processor$completeTask.getResult(OptimizingService.java:605) at org.apache.amoro.shade.thrift.org.apache.thrift.ProcessFunction.process(ProcessFunction.java:40) at org.apache.amoro.shade.thrift.org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:40) at org.apache.amoro.shade.thrift.org.apache.thrift.TMultiplexedProcessor.process(TMultiplexedProcessor.java:147) at org.apache.amoro.shade.thrift.org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:492) at org.apache.amoro.shade.thrift.org.apache.thrift.server.Invocation.run(Invocation.java:19) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) Locked ownable synchronizers: - <0x000000059a732858> (a java.util.concurrent.locks.ReentrantLock$NonfairSync) - <0x000000059a79be88> (a java.util.concurrent.locks.ReentrantLock$NonfairSync) - <0x000000059bbf71e8> (a java.util.concurrent.ThreadPoolExecutor$Worker) 、、、 Thread2: Holding: 0x000000059a739970 Waiting: 0x000000059a732858 、、、 stackTrace: java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x000000059a732858> (a java.util.concurrent.locks.ReentrantLock$NonfairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209) at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285) at org.apache.amoro.server.optimizing.OptimizingQueue$TableOptimizingProcess.acceptResult(OptimizingQueue.java:487) at org.apache.amoro.server.optimizing.TaskRuntime.lambda$complete$0(TaskRuntime.java:104) at org.apache.amoro.server.optimizing.TaskRuntime$$Lambda$2004/1538729933.run(Unknown Source) at org.apache.amoro.server.persistence.PersistentBase$$Lambda$573/1461930169.accept(Unknown Source) at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948) at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647) at org.apache.amoro.server.persistence.PersistentBase.doAsTransaction(PersistentBase.java:77) at org.apache.amoro.server.persistence.StatedPersistentBase.invokeConsistency(StatedPersistentBase.java:51) at org.apache.amoro.server.optimizing.TaskRuntime.complete(TaskRuntime.java:77) at org.apache.amoro.server.DefaultOptimizingService.completeTask(DefaultOptimizingService.java:266) at sun.reflect.GeneratedMethodAccessor354.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.amoro.server.utils.ThriftServiceProxy.invoke(ThriftServiceProxy.java:56) at com.sun.proxy.$Proxy60.completeTask(Unknown Source) at org.apache.amoro.api.OptimizingService$Processor$completeTask.getResult(OptimizingService.java:629) at org.apache.amoro.api.OptimizingService$Processor$completeTask.getResult(OptimizingService.java:605) at org.apache.amoro.shade.thrift.org.apache.thrift.ProcessFunction.process(ProcessFunction.java:40) at org.apache.amoro.shade.thrift.org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:40) at org.apache.amoro.shade.thrift.org.apache.thrift.TMultiplexedProcessor.process(TMultiplexedProcessor.java:147) at org.apache.amoro.shade.thrift.org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:492) at org.apache.amoro.shade.thrift.org.apache.thrift.server.Invocation.run(Invocation.java:19) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) Locked ownable synchronizers: - <0x000000059a1e0228> (a java.util.concurrent.ThreadPoolExecutor$Worker) - <0x000000059a739970> (a java.util.concurrent.locks.ReentrantLock$NonfairSync) 、、、 #The root cause of the deadlock# ## Nested lock acquisition leading to a circular wait ## - Thread 1 holds a `TaskRuntime.stateLock`, waiting for `TableOptimizingProcess.lock`. - Thread 2 holds a `TableOptimizingProcess.lock`, waiting for a `TaskRuntime.stateLock`. ## Specific problem ## In the `acceptResult()` method, when holding `TableOptimizingProcess.lock`, `persistAndSetCompleted(false)` is called. This method calls `cancelTasks()` in `doAsTransaction()`. `cancelTasks()` iterates through all tasks and calls `TaskRuntime.tryCanceling()`, which requires acquiring the `stateLock` of the `TaskRuntime`. If another thread (Thread 1) already holds a `stateLock` of a `TaskRuntime` and is waiting for `TableOptimizingProcess.lock`, a deadlock will occur. ### Affects Versions 0.7.1 ### What table formats are you seeing the problem on? Iceberg ### What engines are you seeing the problem on? AMS ### How to reproduce _No response_ ### Relevant log output ```shell ``` ### Anything else _No response_ ### Are you willing to submit a PR? - [x] Yes I am willing to submit a PR! ### Code of Conduct - [x] I agree to follow this project's Code of Conduct -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
