[ https://issues.apache.org/jira/browse/FLINK-31293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Weijie Guo updated FLINK-31293: ------------------------------- Description: In our TPC-DS test, we found that in the case of fierce competition in network memory, some tasks may hanging forever. >From the thread dump information, we can see that the task is waiting for the >{{LocalBufferPool}} to become available. It is strange that other tasks have >finished and released network memory already. Undoubtedly, this is an >unexpected behavior, which implies that there must be a bug in the >{{LocalBufferPool}} or {{{}NetworkBufferPool{}}}. !image-2023-03-02-12-23-50-572.png|width=650,height=153! By dumping the heap memory, we can find a strange phenomenon that there are available buffers in the {{{}LocalBufferPool{}}}, but it was considered to be un-available. Another thing to note is that it now holds an overdraft buffer. !image-2023-03-02-12-28-48-437.png|width=520,height=200! !image-2023-03-02-12-29-03-003.png|width=438,height=84! TL;DR: This problem occurred in multi-thread race related to the introduction of overdraft buffer. Suppose we have two threads, called A and B. For simplicity, {{LocalBufferPool}} is called {{LocalPool}} and {{NetworkBufferPool}} is called {{{}GlobalPool{}}}. Thread A continuously request buffers blocking from the \{{LocalPool}}. Thread B continuously return buffers to \{{GlobalPool}}. 1. If thread A takes the last available buffer of {{{}LocalPool{}}}, but {{GlobalPool}} does not have a buffer at this time, it will register a callback function with {{{}GlobalPool{}}}. 2. Thread B returns one buffer to {{{}GlobalPool{}}}, but has not started to trigger the callback. 3. Thread A continues to request buffer. Because the {{availableMemorySegments}} of {{LocalPool}} is empty, it requests the overdraftBuffer instead. But there is already a buffer in the {{{}GlobalPool{}}}, it successfully gets the buffer. 4. Thread B triggers the callback. Since there is no buffer in {{GlobalPool}} now, the callback is re-registered. 5. Thread B continues to return a buffer and triggers the last callback. LocalPool puts the buffer into availableMemorySegments. Because the current logic of the shouldBeAvailable method is: if there is an overflow buffer, LocalPool is not available. was: In our TPC-DS test, we found that in the case of fierce competition in network memory, some tasks may hanging forever. >From the thread dump information, we can see that the task is waiting for the >{{LocalBufferPool}} to become available. It is strange that other tasks have >finished and released network memory already. Undoubtedly, this is an >unexpected behavior, which implies that there must be a bug in the >{{LocalBufferPool}} or {{{}NetworkBufferPool{}}}. !image-2023-03-02-12-23-50-572.png|width=650,height=153! By dumping the heap memory, we can find a strange phenomenon that there are available buffers in the {{{}LocalBufferPool{}}}, but it was considered to be un-available. Another thing to note is that it now holds an overdraft buffer. !image-2023-03-02-12-28-48-437.png|width=520,height=200! !image-2023-03-02-12-29-03-003.png|width=438,height=84! > Request memory segment from LocalBufferPool may hanging forever. > ---------------------------------------------------------------- > > Key: FLINK-31293 > URL: https://issues.apache.org/jira/browse/FLINK-31293 > Project: Flink > Issue Type: Bug > Affects Versions: 1.17.0 > Reporter: Weijie Guo > Priority: Major > Attachments: image-2023-03-02-12-23-50-572.png, > image-2023-03-02-12-28-48-437.png, image-2023-03-02-12-29-03-003.png > > > In our TPC-DS test, we found that in the case of fierce competition in > network memory, some tasks may hanging forever. > From the thread dump information, we can see that the task is waiting for the > {{LocalBufferPool}} to become available. It is strange that other tasks have > finished and released network memory already. Undoubtedly, this is an > unexpected behavior, which implies that there must be a bug in the > {{LocalBufferPool}} or {{{}NetworkBufferPool{}}}. > !image-2023-03-02-12-23-50-572.png|width=650,height=153! > By dumping the heap memory, we can find a strange phenomenon that there are > available buffers in the {{{}LocalBufferPool{}}}, but it was considered to be > un-available. Another thing to note is that it now holds an overdraft buffer. > !image-2023-03-02-12-28-48-437.png|width=520,height=200! > !image-2023-03-02-12-29-03-003.png|width=438,height=84! > TL;DR: This problem occurred in multi-thread race related to the introduction > of overdraft buffer. > Suppose we have two threads, called A and B. For simplicity, > {{LocalBufferPool}} is called {{LocalPool}} and {{NetworkBufferPool}} is > called {{{}GlobalPool{}}}. > Thread A continuously request buffers blocking from the \{{LocalPool}}. > Thread B continuously return buffers to \{{GlobalPool}}. > 1. If thread A takes the last available buffer of {{{}LocalPool{}}}, but > {{GlobalPool}} does not have a buffer at this time, it will register a > callback function with {{{}GlobalPool{}}}. > 2. Thread B returns one buffer to {{{}GlobalPool{}}}, but has not started to > trigger the callback. > 3. Thread A continues to request buffer. Because the > {{availableMemorySegments}} of {{LocalPool}} is empty, it requests the > overdraftBuffer instead. But there is already a buffer in the > {{{}GlobalPool{}}}, it successfully gets the buffer. > 4. Thread B triggers the callback. Since there is no buffer in {{GlobalPool}} > now, the callback is re-registered. > 5. Thread B continues to return a buffer and triggers the last callback. > LocalPool puts the buffer into availableMemorySegments. Because the current > logic of the shouldBeAvailable method is: if there is an overflow buffer, > LocalPool is not available. -- This message was sent by Atlassian Jira (v8.20.10#820010)