zuston commented on PR #1958: URL: https://github.com/apache/incubator-uniffle/pull/1958#issuecomment-2251830818
> Just out of curiosity, why does disk corruption cause threads to hang? What exactly is the jstack trace like?   jstack ``` "Grpc-6" #541 daemon prio=5 os_prio=0 cpu=81467.27ms elapsed=122315.90s tid=0x00007f32cf1cb000 nid=0x236b2 waiting for monitor entry [0x00007f22ef8c4000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.uniffle.server.buffer.ShuffleBufferManager.requireMemory(ShuffleBufferManager.java:336) - waiting to lock <0x00007f242ac01a00> (a org.apache.uniffle.server.buffer.ShuffleBufferManager) at org.apache.uniffle.server.ShuffleTaskManager.requireBuffer(ShuffleTaskManager.java:512) at org.apache.uniffle.server.ShuffleTaskManager.requireBuffer(ShuffleTaskManager.java:507) at org.apache.uniffle.server.ShuffleServerGrpcService.requireBuffer(ShuffleServerGrpcService.java:428) at org.apache.uniffle.proto.ShuffleServerGrpc$MethodHandlers.invoke(ShuffleServerGrpc.java:1150) at io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:182) at io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35) at io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23) at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:352) at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:866) at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133) at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1128) at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:628) at java.lang.Thread.run([email protected]/Thread.java:829) "Grpc-886" #1424 daemon prio=5 os_prio=0 cpu=77330.55ms elapsed=122300.97s tid=0x00007f2318b82000 nid=0x24a20 waiting on condition [0x00007f22ac5c6000] java.lang.Thread.State: WAITING (parking) at jdk.internal.misc.Unsafe.park([email protected]/Native Method) - parking to wait for <0x00007f24c71ecc38> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) at java.util.concurrent.locks.LockSupport.park([email protected]/LockSupport.java:194) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt([email protected]/AbstractQueuedSynchronizer.java:885) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared([email protected]/AbstractQueuedSynchronizer.java:1009) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared([email protected]/AbstractQueuedSynchronizer.java:1324) at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock([email protected]/ReentrantReadWriteLock.java:738) at org.apache.uniffle.server.buffer.ShuffleBufferManager.flushBuffer(ShuffleBufferManager.java:293) at org.apache.uniffle.server.buffer.ShuffleBufferManager.flush(ShuffleBufferManager.java:460) - locked <0x00007f242ac01a00> (a org.apache.uniffle.server.buffer.ShuffleBufferManager) at org.apache.uniffle.server.buffer.ShuffleBufferManager.flushIfNecessary(ShuffleBufferManager.java:266) at org.apache.uniffle.server.buffer.ShuffleBufferManager.cacheShuffleData(ShuffleBufferManager.java:185) - locked <0x00007f242ac01a00> (a org.apache.uniffle.server.buffer.ShuffleBufferManager) at org.apache.uniffle.server.ShuffleTaskManager.cacheShuffleData(ShuffleTaskManager.java:301) at org.apache.uniffle.server.ShuffleServerGrpcService.sendShuffleData(ShuffleServerGrpcService.java:267) at org.apache.uniffle.proto.ShuffleServerGrpc$MethodHandlers.invoke(ShuffleServerGrpc.java:1114) at io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:182) at io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35) at io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23) at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:352) at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:866) at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133) at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1128) at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:628) at java.lang.Thread.run([email protected]/Thread.java:829) ``` From the jstack, all the grpc threads are waiting the app's read/write reentrant lock. And the lock was acquired by the clear thread, like this.  -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
