[
https://issues.apache.org/jira/browse/FLINK-31293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Weijie Guo updated FLINK-31293:
-------------------------------
Description:
In our TPC-DS test, we found that in the case of fierce competition in network
memory, some tasks may hanging forever.
>From the thread dump information, we can see that the task is waiting for the
>{{LocalBufferPool}} to become available. It is strange that other tasks have
>finished and released network memory already. Undoubtedly, this is an
>unexpected behavior, which implies that there must be a bug in the
>{{LocalBufferPool}} or {{{}NetworkBufferPool{}}}.
!image-2023-03-02-12-23-50-572.png|width=650,height=153!
By dumping the heap memory, we can find a strange phenomenon that there are
available buffers in the {{{}LocalBufferPool{}}}, but it was considered to be
un-available. Another thing to note is that it now holds an overdraft buffer.
!image-2023-03-02-12-28-48-437.png|width=520,height=200!
!image-2023-03-02-12-29-03-003.png|width=438,height=84!
TL;DR: This problem occurred in multi-thread race related to the introduction
of overdraft buffer.
Suppose we have two threads, called A and B. For simplicity,
{{LocalBufferPool}} is called {{LocalPool}} and {{NetworkBufferPool}} is called
{{{}GlobalPool{}}}.
Thread A continuously request buffers blocking from the {{{}LocalPool{}}}.
Thread B continuously return buffers to {{{}GlobalPool{}}}.
# If thread A takes the last available buffer of {{{}LocalPool{}}}, but
{{GlobalPool}} does not have a buffer at this time, it will register a callback
function with {{{}GlobalPool{}}}.
# Thread B returns one buffer to {{{}GlobalPool{}}}, but has not started to
trigger the callback.
# Thread A continues to request buffer. Because the
{{availableMemorySegments}} of {{LocalPool}} is empty, it requests the
overdraftBuffer instead. But there is already a buffer in the
{{{}GlobalPool{}}}, it successfully gets the buffer.
# Thread B triggers the callback. Since there is no buffer in {{GlobalPool}}
now, the callback is re-registered.
# Thread A continues to request buffer. Because there is no buffer in
{{{}GlobalPool{}}}, it will block on {{{}CompletableFuture#get{}}}.
# Thread B continues to return a buffer and triggers the recently registered
callback. As a result, {{LocalPool}} puts the buffer into
{{{}availableMemorySegments{}}}. Unfortunately, the current logic of
{{shouldBeAvailable}} method is: if there is an overdraft buffer, {{LocalPool}}
is considered as un-available.
# Thread A will keep blocking forever.
was:
In our TPC-DS test, we found that in the case of fierce competition in network
memory, some tasks may hanging forever.
>From the thread dump information, we can see that the task is waiting for the
>{{LocalBufferPool}} to become available. It is strange that other tasks have
>finished and released network memory already. Undoubtedly, this is an
>unexpected behavior, which implies that there must be a bug in the
>{{LocalBufferPool}} or {{{}NetworkBufferPool{}}}.
!image-2023-03-02-12-23-50-572.png|width=650,height=153!
By dumping the heap memory, we can find a strange phenomenon that there are
available buffers in the {{{}LocalBufferPool{}}}, but it was considered to be
un-available. Another thing to note is that it now holds an overdraft buffer.
!image-2023-03-02-12-28-48-437.png|width=520,height=200!
!image-2023-03-02-12-29-03-003.png|width=438,height=84!
TL;DR: This problem occurred in multi-thread race related to the introduction
of overdraft buffer.
Suppose we have two threads, called A and B. For simplicity,
{{LocalBufferPool}} is called {{LocalPool}} and {{NetworkBufferPool}} is called
{{{}GlobalPool{}}}.
Thread A continuously request buffers blocking from the {{{}LocalPool{}}}.
Thread B continuously return buffers to {{{}GlobalPool{}}}.
# If thread A takes the last available buffer of {{{}LocalPool{}}}, but
{{GlobalPool}} does not have a buffer at this time, it will register a callback
function with {{{}GlobalPool{}}}.
# Thread B returns one buffer to {{{}GlobalPool{}}}, but has not started to
trigger the callback.
# Thread A continues to request buffer. Because the
{{availableMemorySegments}} of {{LocalPool}} is empty, it requests the
overdraftBuffer instead. But there is already a buffer in the
{{{}GlobalPool{}}}, it successfully gets the buffer.
# Thread B triggers the callback. Since there is no buffer in {{GlobalPool}}
now, the callback is re-registered.
# Thread A continues to request buffer. Because there is no buffer in
{{{}GlobalPool{}}}, it will block on {{{}CompletableFuture# get{}}}.
# Thread B continues to return a buffer and triggers the recently registered
callback. As a result, {{LocalPool}} puts the buffer into
{{{}availableMemorySegments{}}}. Unfortunately, the current logic of
{{shouldBeAvailable}} method is: if there is an overdraft buffer, {{LocalPool}}
is considered as un-available.
# Thread A will keep blocking forever.
> Request memory segment from LocalBufferPool may hanging forever.
> ----------------------------------------------------------------
>
> Key: FLINK-31293
> URL: https://issues.apache.org/jira/browse/FLINK-31293
> Project: Flink
> Issue Type: Bug
> Affects Versions: 1.17.0
> Reporter: Weijie Guo
> Priority: Major
> Attachments: image-2023-03-02-12-23-50-572.png,
> image-2023-03-02-12-28-48-437.png, image-2023-03-02-12-29-03-003.png
>
>
> In our TPC-DS test, we found that in the case of fierce competition in
> network memory, some tasks may hanging forever.
> From the thread dump information, we can see that the task is waiting for the
> {{LocalBufferPool}} to become available. It is strange that other tasks have
> finished and released network memory already. Undoubtedly, this is an
> unexpected behavior, which implies that there must be a bug in the
> {{LocalBufferPool}} or {{{}NetworkBufferPool{}}}.
> !image-2023-03-02-12-23-50-572.png|width=650,height=153!
> By dumping the heap memory, we can find a strange phenomenon that there are
> available buffers in the {{{}LocalBufferPool{}}}, but it was considered to be
> un-available. Another thing to note is that it now holds an overdraft buffer.
> !image-2023-03-02-12-28-48-437.png|width=520,height=200!
> !image-2023-03-02-12-29-03-003.png|width=438,height=84!
> TL;DR: This problem occurred in multi-thread race related to the introduction
> of overdraft buffer.
> Suppose we have two threads, called A and B. For simplicity,
> {{LocalBufferPool}} is called {{LocalPool}} and {{NetworkBufferPool}} is
> called {{{}GlobalPool{}}}.
> Thread A continuously request buffers blocking from the {{{}LocalPool{}}}.
> Thread B continuously return buffers to {{{}GlobalPool{}}}.
> # If thread A takes the last available buffer of {{{}LocalPool{}}}, but
> {{GlobalPool}} does not have a buffer at this time, it will register a
> callback function with {{{}GlobalPool{}}}.
> # Thread B returns one buffer to {{{}GlobalPool{}}}, but has not started to
> trigger the callback.
> # Thread A continues to request buffer. Because the
> {{availableMemorySegments}} of {{LocalPool}} is empty, it requests the
> overdraftBuffer instead. But there is already a buffer in the
> {{{}GlobalPool{}}}, it successfully gets the buffer.
> # Thread B triggers the callback. Since there is no buffer in {{GlobalPool}}
> now, the callback is re-registered.
> # Thread A continues to request buffer. Because there is no buffer in
> {{{}GlobalPool{}}}, it will block on {{{}CompletableFuture#get{}}}.
> # Thread B continues to return a buffer and triggers the recently registered
> callback. As a result, {{LocalPool}} puts the buffer into
> {{{}availableMemorySegments{}}}. Unfortunately, the current logic of
> {{shouldBeAvailable}} method is: if there is an overdraft buffer,
> {{LocalPool}} is considered as un-available.
> # Thread A will keep blocking forever.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)