[
https://issues.apache.org/jira/browse/FLINK-18662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17165485#comment-17165485
]
Zhijiang commented on FLINK-18662:
----------------------------------
I agree that adding the metrics for alignment duration and in-flight buffer
size during alignment is also valuable for analyzing in UC mode.
Regarding the options, if possible I prefer to refactoring and unifying the
metric names to make them have consistent semantics for both aligned and
unaligned mode:
* It is easy for users to understand the unified semantic, not switch the
concept between AC and UC modes, and easy for displaying in Web UI.
* `alignment duration` describes the time took from seeing the first barrier
until the last barrie on received side. So it only reflects the time delay in
barrier transport from upstream side, and it can be interpretable for both AC
and UC modes.
* `in-flight buffer size during alignment` describes the buffer collection size
during above `alignment duration`. The only difference is that for AC mode,
these in-flight buffers need to be processed by OP before executing CP, which
is also valuable to analyze the slow CP caused by large in-flight buffers with
OP bottleneck. For UC mode, these in-flight buffers needs to be spilled during
CP execution and it can also analyze the slow CP for the IO bottleneck.
* The previous `alignment buffered` metric is invalid for both AC and UC modes
since it is always be 0, and we can remove it finally if resolving the
compatible concern.
> Provide more detailed metrics why unaligned checkpoint is taking long time
> --------------------------------------------------------------------------
>
> Key: FLINK-18662
> URL: https://issues.apache.org/jira/browse/FLINK-18662
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Metrics, Runtime / Network
> Affects Versions: 1.11.1
> Reporter: Piotr Nowojski
> Priority: Critical
> Fix For: 1.12.0
>
> Attachments: Screenshot 2020-07-21 at 11.50.02.png
>
>
> With unaligned checkpoint there can happen situation as in the attached
> screenshot.
> Task reports long end to end checkpoint time (~2h50min), ~0s sync time,
> ~2h50min async time, ~0s start delay. It means that task received first
> checkpoint barrier from one of the channels very quickly (~0s), sync part was
> quick, but we do not know why async part was taking so long. It could be
> because of three things:
> # long operator state IO writes
> # long spilling of in-flight data
> # long time to receive the final checkpoint barrier from the last lagging
> channel
> First and second are probably indistinguishable and the difference between
> them doesn't matter much for analyzing. However the last one is quite
> different. It might be independent of the IO, and we are missing this
> information.
> Maybe we could report it as "alignment duration" and while we are at it, we
> could also report amount of spilled in-flight data for unaligned checkpoints
> as "alignment buffered"?
> Ideally we should report it as new metrics, but that leaves a question how to
> display it in the UI, with limited space available. Maybe it could be
> reported as:
> ||Alignment Buffered||Alignment Duration||
> |0 B (632 MB)|0ms (2h 49m 32s)|
> Where the values in the parenthesis would come from unaligned checkpoints.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)