[
https://issues.apache.org/jira/browse/FLINK-18203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17129313#comment-17129313
]
Jiayi Liao commented on FLINK-18203:
------------------------------------
[~yunta] You're right. In order to avoid the overhead from serialization during
submitting the tasks, we reduce the threads number of \{{jobmanager-future}} to
4 (this may hurt the performance of deploying tasks but we've done other things
to accelerate this process). This change has already helped us run jobs with
parallelism less than 10k.
And back to the this problem we met recently, even though we improve the memory
footprint during submitting process, the too many objects in
#repartitionUnionState still causes serious full gc. I guess the doubt here is
how much memory overhead the 10k*10k objects will produce. Based on what I
observed, an \{{OperatorStreamStateHandle}} instance will introduce the
overhead of itself and a new HashMap. I'll dig deeper to see exactly how much
it gonna be.
> Reduce objects usage in redistributing union states
> ---------------------------------------------------
>
> Key: FLINK-18203
> URL: https://issues.apache.org/jira/browse/FLINK-18203
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Checkpointing
> Affects Versions: 1.10.1
> Reporter: Jiayi Liao
> Priority: Major
>
> #{{RoundRobinOperatorStateRepartitioner}}#{{repartitionUnionState}} creates a
> new {{OperatorStreamStateHandle}} instance for every {{StreamStateHandle}}
> instance used in every execution, which causes the number of new
> {{OperatorStreamStateHandle}} instances up to m * n (jobvertex parallelism *
> count of all executions' StreamStateHandle).
> But in fact, all executions can share the same collection of
> {{StreamStateHandle}} and the number of {{OperatorStreamStateHandle}} can be
> reduced down to the count of all executions' StreamStateHandle.
> I met this problem on production when we're testing a job with
> parallelism=10k and the memory problem is getting more serious when yarn
> containers go dead and the job starts doing failover.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)