codenohup commented on PR #3475: URL: https://github.com/apache/celeborn/pull/3475#issuecomment-3281845617
Hi, @SteNicholas . Thanks for your contribution. This is a serious bug. However, I think your fix seems to have some issues. My impression is that the Flink Shuffle's network memory is calculated based on the number of channels. The calculation formula is available at https://nightlies.apache.org/flink/flink-docs-release-2.1/docs/deployment/memory/network_mem_tuning/#network-buffer-lifecycle. The default number of buffers per channel is 2, presumably because a channel requires a buffer to read data, and the deserializer also stores a buffer, so a channel uses two buffers by default. According to this logic, the default network memory size occupied by a channel is 2*32KB, which is much smaller than the 5MB you set. I think we should consider the Celeborn implementation and adjust the calculation formula appropriately. For example, Celeborn's `ResultPartition` caches a buffer in the `BufferPacker` when sending data, while the Flink Netty Shuffle does not. This can lead to inconsistent network memory usage for output data. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@celeborn.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org