wardlican opened a new issue, #4001:
URL: https://github.com/apache/amoro/issues/4001

   ### What happened?
   
   TaskRuntime.stateLock and TableOptimizingProcess.lock can cause 
deadlocks.This can cause tables in a certain resource group to malfunction, and 
OptimizerKeeper will also fail to function properly.
   
   Thread1: Holding: 0x000000059a732858 Waiting: 0x000000059a739970
   、、、
   stackTrace:
   java.lang.Thread.State: WAITING (parking)
   at sun.misc.Unsafe.park(Native Method)
   - parking to wait for <0x000000059a739970> (a 
java.util.concurrent.locks.ReentrantLock$NonfairSync)
   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
   at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
   at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
   at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
   at 
java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
   at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
   at 
org.apache.amoro.server.persistence.StatedPersistentBase.invokeConsistency(StatedPersistentBase.java:48)
   at 
org.apache.amoro.server.optimizing.TaskRuntime.tryCanceling(TaskRuntime.java:148)
   at 
org.apache.amoro.server.optimizing.OptimizingQueue$TableOptimizingProcess$$Lambda$2143/1314391570.accept(Unknown
 Source)
   at java.util.HashMap$Values.forEach(HashMap.java:982)
   at 
org.apache.amoro.server.optimizing.OptimizingQueue$TableOptimizingProcess.cancelTasks(OptimizingQueue.java:768)
   at 
org.apache.amoro.server.optimizing.OptimizingQueue$TableOptimizingProcess.lambda$persistAndSetCompleted$13(OptimizingQueue.java:748)
   at 
org.apache.amoro.server.optimizing.OptimizingQueue$TableOptimizingProcess$$Lambda$2139/44375492.run(Unknown
 Source)
   at 
org.apache.amoro.server.persistence.PersistentBase$$Lambda$573/1461930169.accept(Unknown
 Source)
   at 
java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
   at 
java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647)
   at 
org.apache.amoro.server.persistence.PersistentBase.doAsTransaction(PersistentBase.java:77)
   at 
org.apache.amoro.server.optimizing.OptimizingQueue.access$500(OptimizingQueue.java:74)
   at 
org.apache.amoro.server.optimizing.OptimizingQueue$TableOptimizingProcess.persistAndSetCompleted(OptimizingQueue.java:745)
   at 
org.apache.amoro.server.optimizing.OptimizingQueue$TableOptimizingProcess.acceptResult(OptimizingQueue.java:539)
   at 
org.apache.amoro.server.optimizing.TaskRuntime.lambda$complete$0(TaskRuntime.java:104)
   at 
org.apache.amoro.server.optimizing.TaskRuntime$$Lambda$2004/1538729933.run(Unknown
 Source)
   at 
org.apache.amoro.server.persistence.PersistentBase$$Lambda$573/1461930169.accept(Unknown
 Source)
   at 
java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
   at 
java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647)
   at 
org.apache.amoro.server.persistence.PersistentBase.doAsTransaction(PersistentBase.java:77)
   at 
org.apache.amoro.server.persistence.StatedPersistentBase.invokeConsistency(StatedPersistentBase.java:51)
   at 
org.apache.amoro.server.optimizing.TaskRuntime.complete(TaskRuntime.java:77)
   at 
org.apache.amoro.server.DefaultOptimizingService.completeTask(DefaultOptimizingService.java:266)
   at sun.reflect.GeneratedMethodAccessor354.invoke(Unknown Source)
   at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:498)
   at 
org.apache.amoro.server.utils.ThriftServiceProxy.invoke(ThriftServiceProxy.java:56)
   at com.sun.proxy.$Proxy60.completeTask(Unknown Source)
   at 
org.apache.amoro.api.OptimizingService$Processor$completeTask.getResult(OptimizingService.java:629)
   at 
org.apache.amoro.api.OptimizingService$Processor$completeTask.getResult(OptimizingService.java:605)
   at 
org.apache.amoro.shade.thrift.org.apache.thrift.ProcessFunction.process(ProcessFunction.java:40)
   at 
org.apache.amoro.shade.thrift.org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:40)
   at 
org.apache.amoro.shade.thrift.org.apache.thrift.TMultiplexedProcessor.process(TMultiplexedProcessor.java:147)
   at 
org.apache.amoro.shade.thrift.org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:492)
   at 
org.apache.amoro.shade.thrift.org.apache.thrift.server.Invocation.run(Invocation.java:19)
   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   at java.lang.Thread.run(Thread.java:750)
   Locked ownable synchronizers:
   - <0x000000059a732858> (a 
java.util.concurrent.locks.ReentrantLock$NonfairSync)
   - <0x000000059a79be88> (a 
java.util.concurrent.locks.ReentrantLock$NonfairSync)
   - <0x000000059bbf71e8> (a java.util.concurrent.ThreadPoolExecutor$Worker)
   、、、
   
   Thread2: Holding: 0x000000059a739970 Waiting: 0x000000059a732858
   、、、
   stackTrace:
   java.lang.Thread.State: WAITING (parking)
   at sun.misc.Unsafe.park(Native Method)
   - parking to wait for <0x000000059a732858> (a 
java.util.concurrent.locks.ReentrantLock$NonfairSync)
   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
   at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
   at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
   at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
   at 
java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
   at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
   at 
org.apache.amoro.server.optimizing.OptimizingQueue$TableOptimizingProcess.acceptResult(OptimizingQueue.java:487)
   at 
org.apache.amoro.server.optimizing.TaskRuntime.lambda$complete$0(TaskRuntime.java:104)
   at 
org.apache.amoro.server.optimizing.TaskRuntime$$Lambda$2004/1538729933.run(Unknown
 Source)
   at 
org.apache.amoro.server.persistence.PersistentBase$$Lambda$573/1461930169.accept(Unknown
 Source)
   at 
java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
   at 
java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647)
   at 
org.apache.amoro.server.persistence.PersistentBase.doAsTransaction(PersistentBase.java:77)
   at 
org.apache.amoro.server.persistence.StatedPersistentBase.invokeConsistency(StatedPersistentBase.java:51)
   at 
org.apache.amoro.server.optimizing.TaskRuntime.complete(TaskRuntime.java:77)
   at 
org.apache.amoro.server.DefaultOptimizingService.completeTask(DefaultOptimizingService.java:266)
   at sun.reflect.GeneratedMethodAccessor354.invoke(Unknown Source)
   at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:498)
   at 
org.apache.amoro.server.utils.ThriftServiceProxy.invoke(ThriftServiceProxy.java:56)
   at com.sun.proxy.$Proxy60.completeTask(Unknown Source)
   at 
org.apache.amoro.api.OptimizingService$Processor$completeTask.getResult(OptimizingService.java:629)
   at 
org.apache.amoro.api.OptimizingService$Processor$completeTask.getResult(OptimizingService.java:605)
   at 
org.apache.amoro.shade.thrift.org.apache.thrift.ProcessFunction.process(ProcessFunction.java:40)
   at 
org.apache.amoro.shade.thrift.org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:40)
   at 
org.apache.amoro.shade.thrift.org.apache.thrift.TMultiplexedProcessor.process(TMultiplexedProcessor.java:147)
   at 
org.apache.amoro.shade.thrift.org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:492)
   at 
org.apache.amoro.shade.thrift.org.apache.thrift.server.Invocation.run(Invocation.java:19)
   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   at java.lang.Thread.run(Thread.java:750)
   Locked ownable synchronizers:
   - <0x000000059a1e0228> (a java.util.concurrent.ThreadPoolExecutor$Worker)
   - <0x000000059a739970> (a 
java.util.concurrent.locks.ReentrantLock$NonfairSync)
   、、、
   
   #The root cause of the deadlock#
   ## Nested lock acquisition leading to a circular wait ## 
   
   - Thread 1 holds a `TaskRuntime.stateLock`, waiting for 
`TableOptimizingProcess.lock`.
   - Thread 2 holds a `TableOptimizingProcess.lock`, waiting for a 
`TaskRuntime.stateLock`.
   
   ## Specific problem ##
   
   In the `acceptResult()` method, when holding `TableOptimizingProcess.lock`, 
`persistAndSetCompleted(false)` is called. This method calls `cancelTasks()` in 
`doAsTransaction()`. `cancelTasks()` iterates through all tasks and calls 
`TaskRuntime.tryCanceling()`, which requires acquiring the `stateLock` of the 
`TaskRuntime`.
   
   If another thread (Thread 1) already holds a `stateLock` of a `TaskRuntime` 
and is waiting for `TableOptimizingProcess.lock`, a deadlock will occur.
   
   ### Affects Versions
   
   0.7.1
   
   ### What table formats are you seeing the problem on?
   
   Iceberg
   
   ### What engines are you seeing the problem on?
   
   AMS
   
   ### How to reproduce
   
   _No response_
   
   ### Relevant log output
   
   ```shell
   
   ```
   
   ### Anything else
   
   _No response_
   
   ### Are you willing to submit a PR?
   
   - [x] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's Code of Conduct


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to