reswqa commented on PR #21209: URL: https://github.com/apache/flink/pull/21209#issuecomment-1303943003
@xintongsong Thanks for the review. > I tried the fix without the questioned hotfix commit. Unfortunately, it doesn't work. As you said, there's a bug that we are releasing number of survive buffers, so even if the third commit set the `releaseBufferRatio` to 0, the buffer that has been spilled will still be released, so the test will may still fail. >I don't quite get the differences between calculating the survive vs. release numbers. What's the purpose? The change lies in the algorithm for surviving buffers. The previous algorithm will make the number of actual release buffers smaller than the expected value, thus causing the release operation to not release enough space in time. As a result, subsequent buffer request will frequently reach the threshold of `onMemoryUsageChanged`, causing a lot of overhead. For full spilling strategy, we set a ratio (0.4 by default) and we exactly hope that triggering the release will drop 40% of the memory space, but the previous algorithm does not meet this requirement. Because we have a precondition: only the buffers in the `spill` state can be released, which makes it easy to see that the total number of buffers that can be released (e.g. `buffersInOrder. size()`) is smaller than the expected number of surviving buffers, thus causing no buffer to be released. The new version will first calculate the number of buffers to be released. If the `expectedSubpartitionReleaseNum` is not 0 and the current subpartition has buffers that can be released, then some buffers must be released.So I changed the algorithm here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
