[
https://issues.apache.org/jira/browse/FLINK-36679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mingliang Liu updated FLINK-36679:
----------------------------------
Description:
Currently we expose multiple metrics for the checkpoint size. One specific
interesting data point is the {{_metadata}} file size, which can also be added
as a metric. The {{_metadata}} file has multiple types of data to store
including operator states, coordinator states and properties. Its size should
be scoped to a reasonable range, otherwise job may take too long to restore
from checkpoints and/or fail to start when its size exceeding RPC frame limit.
However, we saw multiple times the {{_metadata}} file bloats causing job slow
and/or fail to start. In FLINK-32658 community reported similar problems.
Tracking the metadata size can be helpful for operations.
was:
Currently we expose multiple metrics for the checkpoint size. One specific
interesting data point is the `_metadata` file size, which can also be added as
a metric. The `_metadata` file has multiple types of data to store including
operator states, coordinator states and properties. Its size should be scoped
to a reasonable range, otherwise job may take too long to restore from
checkpoints and/or fail to start when its size exceeding RPC frame limit.
However, we saw multiple times the `_metadata` file bloats causing job slow
and/or fail to start. In FLINK-32658 community reported similar problems.
Tracking the metadata size can be helpful for operations.
> Add a metric to track checkpoint _metadata size
> -----------------------------------------------
>
> Key: FLINK-36679
> URL: https://issues.apache.org/jira/browse/FLINK-36679
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Checkpointing
> Affects Versions: 1.18.1, 1.20.0, 1.19.1
> Reporter: Mingliang Liu
> Priority: Major
>
> Currently we expose multiple metrics for the checkpoint size. One specific
> interesting data point is the {{_metadata}} file size, which can also be
> added as a metric. The {{_metadata}} file has multiple types of data to store
> including operator states, coordinator states and properties. Its size should
> be scoped to a reasonable range, otherwise job may take too long to restore
> from checkpoints and/or fail to start when its size exceeding RPC frame limit.
> However, we saw multiple times the {{_metadata}} file bloats causing job slow
> and/or fail to start. In FLINK-32658 community reported similar problems.
> Tracking the metadata size can be helpful for operations.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)