[jira] [Comment Edited] (FLINK-33856) Add metrics to monitor the interaction performance between task and external storage system in the process of checkpoint making

Jufang He (Jira) Fri, 05 Jan 2024 00:40:15 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-33856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17803459#comment-17803459
 ]


Jufang He edited comment on FLINK-33856 at 1/5/24 8:39 AM:
-----------------------------------------------------------

[~pnowojski] Thanks for your advice.

It seems that we need children spans per each subtask/task, so that we can 
statistics more detailed task-level information and more conveniently to locate 
the bottleneck of the cp making. such as syncDuration /async duration/ the 
latency to write file /the latency to close file, of course 'writeRate' is no 
longer needed.

IMO, I prefer to report metrics separately for different TMs. Because our 
production environment has a large number of TM and subtasks, if the changelog 
checkpoint is enabled, the checkpoint may be frequent. I am worried that a 
large amount of data aggregation to JM may have performance problems.

Maybe a new flip that supports task-level trace reporter can builded ?  I’m 
willing to participate in the development.


was (Author: JIRAUSER302059):
[~pnowojski] Thanks for your advice.

It seems that we need children spans per each subtask/task, so that we can 
statistics more detailed task-level information and more conveniently to locate 
the bottleneck of the cp making. such as syncDuration /async duration/ the 
latency to write file /the latency to close file, of course 'writeRate' is no 
longer needed.

IMO, I prefer to report metrics separately for different TMs. Because our 
production environment has a large number of TM and subtasks, if the changelog 
checkpoint is enabled, the checkpoint may be frequent. I am worried that a 
large amount of data aggregation to JM may have performance problems.

Maybe a new flip can builded  that supports task-level trace reporter?  I’m 
willing to participate in the development.

> Add metrics to monitor the interaction performance between task and external 
> storage system in the process of checkpoint making
> -------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-33856
>                 URL: https://issues.apache.org/jira/browse/FLINK-33856
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.18.0
>            Reporter: Jufang He
>            Assignee: Jufang He
>            Priority: Major
>              Labels: pull-request-available
>
> When Flink makes a checkpoint, the interaction performance with the external 
> file system has a great impact on the overall time-consuming. Therefore, it 
> is easy to observe the bottleneck point by adding performance indicators when 
> the task interacts with the external file storage system. These include: the 
> rate of file write , the latency to write the file, the latency to close the 
> file.
> In flink side add the above metrics has the following advantages: convenient 
> statistical different task E2E time-consuming; do not need to distinguish the 
> type of external storage system, can be unified in the 
> FsCheckpointStreamFactory.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (FLINK-33856) Add metrics to monitor the interaction performance between task and external storage system in the process of checkpoint making

Reply via email to