[ 
https://issues.apache.org/jira/browse/FLINK-23466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444279#comment-17444279
 ] 

Yingjie Cao commented on FLINK-23466:
-------------------------------------

Nov 10 16:13:03 Starting 
org.apache.flink.test.checkpointing.UnalignedCheckpointITCase#execute[pipeline 
with mixed channels, p = 20, timeout = 0, buffersPerChannel = 1].

>From the log, we can see this case hangs. I guess this seems a new issue. From 
>the stack, it seems there is something wrong with the checkpoint coordinator, 
>the following thread locked 0x0000000087db4fb8:
{code:java}
2021-11-10T17:14:21.0899474Z Nov 10 17:14:21 "jobmanager-io-thread-2" #12984 
daemon prio=5 os_prio=0 tid=0x00007f12e000b800 nid=0x3fb6 runnable 
[0x00007f0fcd6d4000]
2021-11-10T17:14:21.0899924Z Nov 10 17:14:21    java.lang.Thread.State: RUNNABLE
2021-11-10T17:14:21.0900300Z Nov 10 17:14:21    at 
java.util.HashMap$TreeNode.balanceDeletion(HashMap.java:2338)
2021-11-10T17:14:21.0900745Z Nov 10 17:14:21    at 
java.util.HashMap$TreeNode.removeTreeNode(HashMap.java:2112)
2021-11-10T17:14:21.0901146Z Nov 10 17:14:21    at 
java.util.HashMap.removeNode(HashMap.java:840)
2021-11-10T17:14:21.0901577Z Nov 10 17:14:21    at 
java.util.LinkedHashMap.afterNodeInsertion(LinkedHashMap.java:301)
2021-11-10T17:14:21.0902002Z Nov 10 17:14:21    at 
java.util.HashMap.putVal(HashMap.java:664)
2021-11-10T17:14:21.0902531Z Nov 10 17:14:21    at 
java.util.HashMap.putMapEntries(HashMap.java:515)
2021-11-10T17:14:21.0902931Z Nov 10 17:14:21    at 
java.util.HashMap.putAll(HashMap.java:785)
2021-11-10T17:14:21.0903429Z Nov 10 17:14:21    at 
org.apache.flink.runtime.checkpoint.ExecutionAttemptMappingProvider.getVertex(ExecutionAttemptMappingProvider.java:60)
2021-11-10T17:14:21.0904060Z Nov 10 17:14:21    at 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.reportStats(CheckpointCoordinator.java:1867)
2021-11-10T17:14:21.0904686Z Nov 10 17:14:21    at 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:1152)
2021-11-10T17:14:21.0905372Z Nov 10 17:14:21    - locked <0x0000000087db4fb8> 
(a java.lang.Object)
2021-11-10T17:14:21.0905895Z Nov 10 17:14:21    at 
org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$acknowledgeCheckpoint$1(ExecutionGraphHandler.java:89)
2021-11-10T17:14:21.0906493Z Nov 10 17:14:21    at 
org.apache.flink.runtime.scheduler.ExecutionGraphHandler$$Lambda$1368/705813936.accept(Unknown
 Source)
2021-11-10T17:14:21.0907086Z Nov 10 17:14:21    at 
org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$processCheckpointCoordinatorMessage$3(ExecutionGraphHandler.java:119)
2021-11-10T17:14:21.0907698Z Nov 10 17:14:21    at 
org.apache.flink.runtime.scheduler.ExecutionGraphHandler$$Lambda$1369/1447418658.run(Unknown
 Source)
2021-11-10T17:14:21.0908210Z Nov 10 17:14:21    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
2021-11-10T17:14:21.0908735Z Nov 10 17:14:21    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
2021-11-10T17:14:21.0909333Z Nov 10 17:14:21    at 
java.lang.Thread.run(Thread.java:748) {code}
But other thread is waiting for the lock. I am not familiar with these logics 
and not sure if this is in the right state. Could anyone who is familiar with 
these logics take a look?

 

BTW, concurrent access of HashMap may cause infinite loop,I see in the stack 
that there are multiple threads are accessing HashMap, though I am not sure if 
they are the same instance.

> UnalignedCheckpointITCase hangs on Azure
> ----------------------------------------
>
>                 Key: FLINK-23466
>                 URL: https://issues.apache.org/jira/browse/FLINK-23466
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.14.0
>            Reporter: Dawid Wysakowicz
>            Priority: Blocker
>              Labels: pull-request-available, test-stability
>             Fix For: 1.14.1
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=20813&view=logs&j=a57e0635-3fad-5b08-57c7-a4142d7d6fa9&t=2ef0effc-1da1-50e5-c2bd-aab434b1c5b7&l=16016
> The problem is the buffer listener will be removed from the listener queue 
> when notified and then it will be added to the listener queue again if it 
> needs more buffers. However, if some buffers are recycled meanwhile, the 
> buffer listener will not be notified of the available buffers. For example:
>     1. Thread 1 calls LocalBufferPool#recycle().
>     2. Thread 1 reaches LocalBufferPool#fireBufferAvailableNotification() and 
> listener.notifyBufferAvailable() is invoked, but Thread 1 sleeps before 
> acquiring the lock to registeredListeners.add(listener).
>     3. Thread 2 is being woken up as a result of notifyBufferAvailable() 
> call. It takes the buffer, but it needs more buffers.
>     4. Other threads, return all buffers, including this one that has been 
> recycled. None are taken. Are all in the LocalBufferPool.
>     5. Thread 1 wakes up, and continues fireBufferAvailableNotification() 
> invocation.
>     6. Thread 1 re-adds listener that's waiting for more buffer 
> registeredListeners.add(listener).
>     7. Thread 1 exits loop LocalBufferPool#recycle(MemorySegment, int) 
> inside, as the original memory segment has been used.
> At the end we have a state where all buffers are in the LocalBufferPool, so 
> no new recycle() calls will happen, but there is still one listener waiting 
> for a buffer (despite buffers being available).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to