Mingliang Liu created FLINK-36679:
-------------------------------------
Summary: Add a metric to track checkpoint _metadata size
Key: FLINK-36679
URL: https://issues.apache.org/jira/browse/FLINK-36679
Project: Flink
Issue Type: Improvement
Components: Runtime / Checkpointing
Affects Versions: 1.19.1, 1.20.0, 1.18.1
Reporter: Mingliang Liu
Currently we expose multiple metrics for the checkpoint size. One specific
interesting data point is the `_metadata` file size, which can also be added as
a metric. The `_metadata` file has multiple types of data to store including
operator states, coordinator states and properties. Its size should be scoped
to a reasonable range, otherwise job may take too long to restore from
checkpoints and/or fail to start when its size exceeding RPC frame limit.
However, we saw multiple times the `_metadata` file bloats causing job slow
and/or fail to start. In FLINK-32658 community reported similar problems.
Tracking the metadata size can be helpful for operations.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)