[
https://issues.apache.org/jira/browse/FLINK-18662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Piotr Nowojski reassigned FLINK-18662:
--------------------------------------
Assignee: Piotr Nowojski
> Provide more detailed metrics why unaligned checkpoint is taking long time
> --------------------------------------------------------------------------
>
> Key: FLINK-18662
> URL: https://issues.apache.org/jira/browse/FLINK-18662
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Metrics, Runtime / Network
> Affects Versions: 1.11.1
> Reporter: Piotr Nowojski
> Assignee: Piotr Nowojski
> Priority: Critical
> Fix For: 1.12.0
>
> Attachments: Screenshot 2020-07-21 at 11.50.02.png
>
>
> With unaligned checkpoint there can happen situation as in the attached
> screenshot.
> Task reports long end to end checkpoint time (~2h50min), ~0s sync time,
> ~2h50min async time, ~0s start delay. It means that task received first
> checkpoint barrier from one of the channels very quickly (~0s), sync part was
> quick, but we do not know why async part was taking so long. It could be
> because of three things:
> # long operator state IO writes
> # long spilling of in-flight data
> # long time to receive the final checkpoint barrier from the last lagging
> channel
> First and second are probably indistinguishable and the difference between
> them doesn't matter much for analyzing. However the last one is quite
> different. It might be independent of the IO, and we are missing this
> information.
> Maybe we could report it as "alignment duration" and while we are at it, we
> could also report amount of spilled in-flight data for unaligned checkpoints
> as "alignment buffered"?
> Ideally we should report it as new metrics, but that leaves a question how to
> display it in the UI, with limited space available. Maybe it could be
> reported as:
> ||Alignment Buffered||Alignment Duration||
> |0 B (632 MB)|0ms (2h 49m 32s)|
> Where the values in the parenthesis would come from unaligned checkpoints.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)