[
https://issues.apache.org/jira/browse/FLINK-33856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17803201#comment-17803201
]
Piotr Nowojski edited comment on FLINK-33856 at 1/4/24 3:00 PM:
----------------------------------------------------------------
Hi, I second that implementing this as metrics doesn't sound to be
right/correct.
[~hejufang001] , I wouldn't make this a subtask of the FLIP-384, but if needed
a follow up. There are two things worth notting/discussing:
* please check the discussion on the dev mailing list in FLIP-384 about the
current limitations. Namely we are currently only creating a trace with a
single span for the whole checkpoint. Also it's currently very sparsely
populated with metrics. There were discussions/plans (CC [~fanrui] if I
remember correctly you wanted to follow up on this?) about creating children
spans per each subtask/task, to mimic the existing `CheckpointingMetrics`
structure. Probably this FLIP requires that change.
* once we have per subtask spans, or aggregated metrics as in [the recovery
spans from
FLIP-386|https://cwiki.apache.org/confluence/display/FLINK/FLIP-386%3A+Support+adding+custom+metrics+in+Recovery+Spans]
, we might not need some of the metrics, that you are proposing here? For
example `writeRate` should be easily computed from the async duration /
checkpointed state size?
Anyway, I think FLIP will be required here.
was (Author: pnowojski):
Hi, I second that implementing this as metrics doesn't sound to be
right/correct.
[~hejufang001] , I wouldn't make this a subtask of the FLIP-384, but if needed
a follow up. There are two things worth notting/discussing:
* please check the discussion on the dev mailing list in FLIP-384 about the
current limitations. Namely we are currently only creating a trace with a
single span for the whole checkpoint. Also it's currently very sparsely
populated with metrics. There were discussions plans about creating children
spans per each subtask/task, to mimic the existing `CheckpointingMetrics`
structure. Probably this FLIP requires that change.
* once we have per subtask spans, or aggregated metrics as in [the recovery
spans from
FLIP-386|https://cwiki.apache.org/confluence/display/FLINK/FLIP-386%3A+Support+adding+custom+metrics+in+Recovery+Spans]
, we might not need some of the metrics, that you are proposing here? For
example `writeRate` should be easily computed from the async duration /
checkpointed state size?
Anyway, I think FLIP will be required here.
> Add metrics to monitor the interaction performance between task and external
> storage system in the process of checkpoint making
> -------------------------------------------------------------------------------------------------------------------------------
>
> Key: FLINK-33856
> URL: https://issues.apache.org/jira/browse/FLINK-33856
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Checkpointing
> Affects Versions: 1.18.0
> Reporter: Jufang He
> Assignee: Jufang He
> Priority: Major
> Labels: pull-request-available
>
> When Flink makes a checkpoint, the interaction performance with the external
> file system has a great impact on the overall time-consuming. Therefore, it
> is easy to observe the bottleneck point by adding performance indicators when
> the task interacts with the external file storage system. These include: the
> rate of file write , the latency to write the file, the latency to close the
> file.
> In flink side add the above metrics has the following advantages: convenient
> statistical different task E2E time-consuming; do not need to distinguish the
> type of external storage system, can be unified in the
> FsCheckpointStreamFactory.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)