[
https://issues.apache.org/jira/browse/FLINK-17295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17253275#comment-17253275
]
Yangze Guo commented on FLINK-17295:
------------------------------------
Hi, there. Since the 1.12 has been released, I'd like to revive this ticket.
In the beginning, this ticket proposed to make the ExecutionAttemptID being
composed of (ExecutionVertexID, attemptNumber) to improve the log readability.
In FLINK-19264, we found this change broke the assumption that
ExecutionAttemptIDs are unique because there will be a collision of VertexID in
graphs with the same topology. Then, we decided to add the JobID to it.
However, in FLINK-19805, we found it still has some bad cases.
To solve the problem in FLINK-19805, we can:
- Introducing a field to identify the leader session or ensure the attempt
number is monotone increasing across sessions.
- Introducing a truly random element. It seems to be the safest way to prevent
other rare cases.
Considering the serialization overhead, come up with an attempt counter (stored
in ZK/ConfigMap) might be a better choice. Add a truly random element(16bits)
can increase the TDD size ~25% in my experiment(3000 parallelsim WordCount).
However, we can't ensure that there are no new bad cases in the future. If the
increase of TDD size is affordable, I tend to introduce a truly random element.
WDYT?
> Refactor the ExecutionAttemptID to consist of ExecutionVertexID and
> attemptNumber
> ---------------------------------------------------------------------------------
>
> Key: FLINK-17295
> URL: https://issues.apache.org/jira/browse/FLINK-17295
> Project: Flink
> Issue Type: Sub-task
> Components: Runtime / Coordination
> Reporter: Yangze Guo
> Priority: Major
> Labels: pull-request-available
> Fix For: 1.13.0
>
>
> Make the ExecutionAttemptID being composed of (ExecutionVertexID,
> attemptNumber).
--
This message was sent by Atlassian Jira
(v8.3.4#803005)