[
https://issues.apache.org/jira/browse/FLINK-33856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17803459#comment-17803459
]
Jufang He edited comment on FLINK-33856 at 1/5/24 8:39 AM:
-----------------------------------------------------------
[~pnowojski] Thanks for your advice.
It seems that we need children spans per each subtask/task, so that we can
statistics more detailed task-level information and more conveniently to locate
the bottleneck of the cp making. such as syncDuration /async duration/ the
latency to write file /the latency to close file, of course 'writeRate' is no
longer needed.
IMO, I prefer to report metrics separately for different TMs. Because our
production environment has a large number of TM and subtasks, if the changelog
checkpoint is enabled, the checkpoint may be frequent. I am worried that a
large amount of data aggregation to JM may have performance problems.
Maybe a new flip that supports task-level trace reporter can builded ? I’m
willing to participate in the development.
was (Author: JIRAUSER302059):
[~pnowojski] Thanks for your advice.
It seems that we need children spans per each subtask/task, so that we can
statistics more detailed task-level information and more conveniently to locate
the bottleneck of the cp making. such as syncDuration /async duration/ the
latency to write file /the latency to close file, of course 'writeRate' is no
longer needed.
IMO, I prefer to report metrics separately for different TMs. Because our
production environment has a large number of TM and subtasks, if the changelog
checkpoint is enabled, the checkpoint may be frequent. I am worried that a
large amount of data aggregation to JM may have performance problems.
Maybe a new flip can builded that supports task-level trace reporter? I’m
willing to participate in the development.
> Add metrics to monitor the interaction performance between task and external
> storage system in the process of checkpoint making
> -------------------------------------------------------------------------------------------------------------------------------
>
> Key: FLINK-33856
> URL: https://issues.apache.org/jira/browse/FLINK-33856
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Checkpointing
> Affects Versions: 1.18.0
> Reporter: Jufang He
> Assignee: Jufang He
> Priority: Major
> Labels: pull-request-available
>
> When Flink makes a checkpoint, the interaction performance with the external
> file system has a great impact on the overall time-consuming. Therefore, it
> is easy to observe the bottleneck point by adding performance indicators when
> the task interacts with the external file storage system. These include: the
> rate of file write , the latency to write the file, the latency to close the
> file.
> In flink side add the above metrics has the following advantages: convenient
> statistical different task E2E time-consuming; do not need to distinguish the
> type of external storage system, can be unified in the
> FsCheckpointStreamFactory.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)