Keith Lee created FLINK-35853:
---------------------------------
Summary: Regression in checkpoint size when performing full
checkpointing in RocksDB
Key: FLINK-35853
URL: https://issues.apache.org/jira/browse/FLINK-35853
Project: Flink
Issue Type: Bug
Components: Runtime / State Backends
Affects Versions: 1.18.1
Environment: amazon-linux-2023
Reporter: Keith Lee
Attachments: StaticStateSizeGenerator115.java,
StaticStateSizeGenerator118.java
We have an job with small and static state size (states are updated instead of
added), the job is configured to use RocksDB + full checkpointng (incremental
disabled) because the diff between checkpoint is larger than full checkpoint
size.
After migrating to 1.18, we observed significant and steady increase in full
checkpoint size with RocksDB + full checkpointing. The increase was not
observed with hashmap state backend.
I managed to reproduce the issue with following code:
[^StaticStateSizeGenerator115.java]
[^StaticStateSizeGenerator118.java]
Result:
On Flink 1.15, RocksDB + full checkpointing, checkpoint size is constant at
250KiB.
On Flink 1.18, RocksDB + full checkpointing, max checkpoint size got up to
38MiB before dropping (presumably due to compaction?)
On Flink 1.18, Hashmap statebackend, checkpoint size is constant at 219KiB.
Notes:
One observation I have is that the issue is more pronounced with higher
parallelism, the code uses 8 parallelism. The production application that we
first saw the regression got up to GiB of checkpoint size, where only expected
and observed (in 1.15) at most a couple of MiB.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)