Weihua Hu created FLINK-32201:
---------------------------------
Summary: Enable the distribution of shuffle descriptors via the
blob server by connection number
Key: FLINK-32201
URL: https://issues.apache.org/jira/browse/FLINK-32201
Project: Flink
Issue Type: Sub-task
Components: Runtime / Coordination
Reporter: Weihua Hu
Flink support distributes shuffle descriptors via the blob server to reduce
JobManager overhead. But the default threshold to enable it is 1MB, which never
reaches. Users need to set a proper value for this, but it requires advanced
knowledge before configuring it.
I would like to enable this feature by the number of connections of a group of
shuffle descriptors. For examples, a simple streaming job with two operators,
each with 10,000 parallelism and connected via all-to-all distribution. In this
job, we only get one set of shuffle descriptors, and this group has 10000 *
10000 connections. This means that JobManager needs to send this set of shuffle
descriptors to 10000 tasks.
Since it is also difficult for users to configure, I would like to give it a
default value. The serialized shuffle descriptors sizes for different
parallelism are shown below.
|| Producer parallelism || serialized shuffle descriptor size || consumer
parallelism || total data size that JM needs to send ||
| 5000 | 100KB | 5000 | 500MB |
| 10000 | 200KB | 10000 | 2GB |
| 20000 | 400Kb | 20000 | 8GB |
So, I would like to set the default value to 10,000 * 10,000.
Any suggestions or concerns are appreciated.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)