codenohup commented on PR #3475:
URL: https://github.com/apache/celeborn/pull/3475#issuecomment-3281845617

   Hi, @SteNicholas .
   Thanks for your contribution. This is a serious bug.
   
   However, I think your fix seems to have some issues. 
   
   My impression is that the Flink Shuffle's network memory is calculated based 
on the number of channels. The calculation formula is available at 
https://nightlies.apache.org/flink/flink-docs-release-2.1/docs/deployment/memory/network_mem_tuning/#network-buffer-lifecycle.
 
   
   The default number of buffers per channel is 2, presumably because a channel 
requires a buffer to read data, and the deserializer also stores a buffer, so a 
channel uses two buffers by default. According to this logic, the default 
network memory size occupied by a channel is 2*32KB, which is much smaller than 
the 5MB you set.
   
   I think we should consider the Celeborn implementation and adjust the 
calculation formula appropriately. For example, Celeborn's `ResultPartition` 
caches a buffer in the `BufferPacker` when sending data, while the Flink Netty 
Shuffle does not. This can lead to inconsistent network memory usage for output 
data.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@celeborn.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to