[jira] [Comment Edited] (FLINK-33856) Add metrics to monitor the interaction performance between task and external storage system in the process of checkpoint making

Piotr Nowojski (Jira) Thu, 04 Jan 2024 07:01:14 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-33856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17803201#comment-17803201
 ]


Piotr Nowojski edited comment on FLINK-33856 at 1/4/24 3:00 PM:
----------------------------------------------------------------

Hi, I second that implementing this as metrics doesn't sound to be 
right/correct. 

 

[~hejufang001] , I wouldn't make this a subtask of the FLIP-384, but if needed 
a follow up. There are two things worth notting/discussing:
 * please check the discussion on the dev mailing list in FLIP-384 about the 
current limitations. Namely we are currently only creating a trace with a 
single span for the whole checkpoint. Also it's currently very sparsely 
populated with metrics. There were discussions/plans (CC [~fanrui] if I 
remember correctly you wanted to follow up on this?) about creating children 
spans per each subtask/task, to mimic the existing `CheckpointingMetrics` 
structure. Probably this FLIP requires that change.
 * once we have per subtask spans, or aggregated metrics as in [the recovery 
spans from 
FLIP-386|https://cwiki.apache.org/confluence/display/FLINK/FLIP-386%3A+Support+adding+custom+metrics+in+Recovery+Spans]
 , we might not need some of the metrics, that you are proposing here? For 
example `writeRate` should be easily computed from the async duration / 
checkpointed state size?

Anyway, I think FLIP will be required here. 


was (Author: pnowojski):
Hi, I second that implementing this as metrics doesn't sound to be 
right/correct. 

 

[~hejufang001] , I wouldn't make this a subtask of the FLIP-384, but if needed 
a follow up. There are two things worth notting/discussing:
 * please check the discussion on the dev mailing list in FLIP-384 about the 
current limitations. Namely we are currently only creating a trace with a 
single span for the whole checkpoint. Also it's currently very sparsely 
populated with metrics. There were discussions plans about creating children 
spans per each subtask/task, to mimic the existing `CheckpointingMetrics` 
structure. Probably this FLIP requires that change.
 * once we have per subtask spans, or aggregated metrics as in [the recovery 
spans from 
FLIP-386|https://cwiki.apache.org/confluence/display/FLINK/FLIP-386%3A+Support+adding+custom+metrics+in+Recovery+Spans]
 , we might not need some of the metrics, that you are proposing here? For 
example `writeRate` should be easily computed from the async duration / 
checkpointed state size?

Anyway, I think FLIP will be required here.

> Add metrics to monitor the interaction performance between task and external 
> storage system in the process of checkpoint making
> -------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-33856
>                 URL: https://issues.apache.org/jira/browse/FLINK-33856
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.18.0
>            Reporter: Jufang He
>            Assignee: Jufang He
>            Priority: Major
>              Labels: pull-request-available
>
> When Flink makes a checkpoint, the interaction performance with the external 
> file system has a great impact on the overall time-consuming. Therefore, it 
> is easy to observe the bottleneck point by adding performance indicators when 
> the task interacts with the external file storage system. These include: the 
> rate of file write , the latency to write the file, the latency to close the 
> file.
> In flink side add the above metrics has the following advantages: convenient 
> statistical different task E2E time-consuming; do not need to distinguish the 
> type of external storage system, can be unified in the 
> FsCheckpointStreamFactory.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (FLINK-33856) Add metrics to monitor the interaction performance between task and external storage system in the process of checkpoint making

Reply via email to