[ 
https://issues.apache.org/jira/browse/FLINK-23466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17405844#comment-17405844
 ] 

Yingjie Cao commented on FLINK-23466:
-------------------------------------

Previously, the buffer listener will be removed from the listener queue when 
notified and then it will be added to the listener queue again if it needs more 
buffers. However, if some buffers are recycled meanwhile, the buffer listener 
will not be notified of the available buffers. For example:

    

    1. Thread 1 calls LocalBufferPool#recycle().

    2. Thread 1 reaches LocalBufferPool#fireBufferAvailableNotification() and 
listener.notifyBufferAvailable() is invoked, but Thread 1 sleeps before 
acquiring the lock to registeredListeners.add(listener).

    3. Thread 2 is being woken up as a result of notifyBufferAvailable() call. 
It takes the buffer, but it needs more buffers.

    4. Other threads, return all buffers, including this one that has been 
recycled. None are taken. Are all in the LocalBufferPool.

    5. Thread 1 wakes up, and continues fireBufferAvailableNotification() 
invocation.

    6. Thread 1 re-adds listener that's waiting for more buffer 
registeredListeners.add(listener).

    7. Thread 1 exits loop LocalBufferPool#recycle(MemorySegment, int) inside, 
as the original memory segment has been used.

    

    At the end we have a state where all buffers are in the LocalBufferPool, so 
no new recycle() calls will happen, but there is still one listener waiting for 
a buffer (despite buffers being available).

    

    This change fixes the issue by letting the buffer listener request multiple 
buffers one after another without having to enqueue BufferListener to the 
registeredListener queue.

> UnalignedCheckpointITCase hangs on Azure
> ----------------------------------------
>
>                 Key: FLINK-23466
>                 URL: https://issues.apache.org/jira/browse/FLINK-23466
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.14.0
>            Reporter: Dawid Wysakowicz
>            Assignee: Yingjie Cao
>            Priority: Blocker
>              Labels: pull-request-available, test-stability
>             Fix For: 1.14.0
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=20813&view=logs&j=a57e0635-3fad-5b08-57c7-a4142d7d6fa9&t=2ef0effc-1da1-50e5-c2bd-aab434b1c5b7&l=16016



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to