[ 
https://issues.apache.org/jira/browse/FLINK-31293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weijie Guo updated FLINK-31293:
-------------------------------
    Description: 
In our TPC-DS test, we found that in the case of fierce competition in network 
memory, some tasks may hanging forever.

>From the thread dump information, we can see that the task is waiting for the 
>{{LocalBufferPool}} to become available. It is strange that other tasks have 
>finished and released network memory already. Undoubtedly, this is an 
>unexpected behavior, which implies that there must be a bug in the 
>{{LocalBufferPool}} or {{{}NetworkBufferPool{}}}.

!image-2023-03-02-12-23-50-572.png|width=650,height=153!

By dumping the heap memory, we can find a strange phenomenon that there are 
available buffers in the {{{}LocalBufferPool{}}}, but it was considered to be 
un-available. Another thing to note is that it now holds an overdraft buffer.

!image-2023-03-02-12-28-48-437.png|width=520,height=200!

!image-2023-03-02-12-29-03-003.png|width=438,height=84!

TL;DR: This problem occurred in multi-thread race related to the introduction 
of overdraft buffer.

Suppose we have two threads, called A and B. For simplicity, 
{{LocalBufferPool}} is called {{LocalPool}} and {{NetworkBufferPool}} is called 
{{{}GlobalPool{}}}.

Thread A continuously request buffers blocking from the \{{LocalPool}}.
Thread B continuously return buffers to \{{GlobalPool}}.


1.  If thread A takes the last available buffer of {{{}LocalPool{}}}, but 
{{GlobalPool}} does not have a buffer at this time, it will register a callback 
function with {{{}GlobalPool{}}}.
2. Thread B returns one buffer to {{{}GlobalPool{}}}, but has not started to 
trigger the callback.
3. Thread A continues to request buffer. Because the 
{{availableMemorySegments}} of {{LocalPool}} is empty, it requests the 
overdraftBuffer instead. But there is already a buffer in the 
{{{}GlobalPool{}}}, it successfully gets the buffer.
4. Thread B triggers the callback. Since there is no buffer in {{GlobalPool}} 
now, the callback is re-registered.
5. Thread B continues to return a buffer and triggers the last callback. 
LocalPool puts the buffer into availableMemorySegments. Because the current 
logic of the shouldBeAvailable method is: if there is an overflow buffer, 
LocalPool is not available.

  was:
In our TPC-DS test, we found that in the case of fierce competition in network 
memory, some tasks may hanging forever.

>From the thread dump information, we can see that the task is waiting for the 
>{{LocalBufferPool}} to become available. It is strange that other tasks have 
>finished and released network memory already. Undoubtedly, this is an 
>unexpected behavior, which implies that there must be a bug in the 
>{{LocalBufferPool}} or {{{}NetworkBufferPool{}}}.

!image-2023-03-02-12-23-50-572.png|width=650,height=153!

By dumping the heap memory, we can find a strange phenomenon that there are 
available buffers in the {{{}LocalBufferPool{}}}, but it was considered to be 
un-available. Another thing to note is that it now holds an overdraft buffer.

!image-2023-03-02-12-28-48-437.png|width=520,height=200!

!image-2023-03-02-12-29-03-003.png|width=438,height=84!

 


> Request memory segment from LocalBufferPool may hanging forever.
> ----------------------------------------------------------------
>
>                 Key: FLINK-31293
>                 URL: https://issues.apache.org/jira/browse/FLINK-31293
>             Project: Flink
>          Issue Type: Bug
>    Affects Versions: 1.17.0
>            Reporter: Weijie Guo
>            Priority: Major
>         Attachments: image-2023-03-02-12-23-50-572.png, 
> image-2023-03-02-12-28-48-437.png, image-2023-03-02-12-29-03-003.png
>
>
> In our TPC-DS test, we found that in the case of fierce competition in 
> network memory, some tasks may hanging forever.
> From the thread dump information, we can see that the task is waiting for the 
> {{LocalBufferPool}} to become available. It is strange that other tasks have 
> finished and released network memory already. Undoubtedly, this is an 
> unexpected behavior, which implies that there must be a bug in the 
> {{LocalBufferPool}} or {{{}NetworkBufferPool{}}}.
> !image-2023-03-02-12-23-50-572.png|width=650,height=153!
> By dumping the heap memory, we can find a strange phenomenon that there are 
> available buffers in the {{{}LocalBufferPool{}}}, but it was considered to be 
> un-available. Another thing to note is that it now holds an overdraft buffer.
> !image-2023-03-02-12-28-48-437.png|width=520,height=200!
> !image-2023-03-02-12-29-03-003.png|width=438,height=84!
> TL;DR: This problem occurred in multi-thread race related to the introduction 
> of overdraft buffer.
> Suppose we have two threads, called A and B. For simplicity, 
> {{LocalBufferPool}} is called {{LocalPool}} and {{NetworkBufferPool}} is 
> called {{{}GlobalPool{}}}.
> Thread A continuously request buffers blocking from the \{{LocalPool}}.
> Thread B continuously return buffers to \{{GlobalPool}}.
> 1.  If thread A takes the last available buffer of {{{}LocalPool{}}}, but 
> {{GlobalPool}} does not have a buffer at this time, it will register a 
> callback function with {{{}GlobalPool{}}}.
> 2. Thread B returns one buffer to {{{}GlobalPool{}}}, but has not started to 
> trigger the callback.
> 3. Thread A continues to request buffer. Because the 
> {{availableMemorySegments}} of {{LocalPool}} is empty, it requests the 
> overdraftBuffer instead. But there is already a buffer in the 
> {{{}GlobalPool{}}}, it successfully gets the buffer.
> 4. Thread B triggers the callback. Since there is no buffer in {{GlobalPool}} 
> now, the callback is re-registered.
> 5. Thread B continues to return a buffer and triggers the last callback. 
> LocalPool puts the buffer into availableMemorySegments. Because the current 
> logic of the shouldBeAvailable method is: if there is an overflow buffer, 
> LocalPool is not available.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to