[
https://issues.apache.org/jira/browse/FLINK-35853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17866287#comment-17866287
]
Keith Lee commented on FLINK-35853:
-----------------------------------
I dug around code but was unfamiliar with the state backend code base. I noted
a change introduced in FLINK-28699 where IncrementalRemoteKeyedStateHandle is
also used for full checkpointing.
> Regression in checkpoint size when performing full checkpointing in RocksDB
> ---------------------------------------------------------------------------
>
> Key: FLINK-35853
> URL: https://issues.apache.org/jira/browse/FLINK-35853
> Project: Flink
> Issue Type: Bug
> Components: Runtime / State Backends
> Affects Versions: 1.18.1
> Environment: amazon-linux-2023
> Reporter: Keith Lee
> Priority: Major
> Attachments: StaticStateSizeGenerator115.java,
> StaticStateSizeGenerator118.java
>
>
> We have an job with small and static state size (states are updated instead
> of added), the job is configured to use RocksDB + full checkpointng
> (incremental disabled) because the diff between checkpoint is larger than
> full checkpoint size.
> After migrating to 1.18, we observed significant and steady increase in full
> checkpoint size with RocksDB + full checkpointing. The increase was not
> observed with hashmap state backend.
> I managed to reproduce the issue with following code:
> [^StaticStateSizeGenerator115.java]
> [^StaticStateSizeGenerator118.java]
> Result:
> On Flink 1.15, RocksDB + full checkpointing, checkpoint size is constant at
> 250KiB.
> On Flink 1.18, RocksDB + full checkpointing, max checkpoint size got up to
> 38MiB before dropping (presumably due to compaction?)
> On Flink 1.18, Hashmap statebackend, checkpoint size is constant at 219KiB.
> Notes:
> One observation I have is that the issue is more pronounced with higher
> parallelism, the code uses 8 parallelism. The production application that we
> first saw the regression got up to GiB of checkpoint size, where only
> expected and observed (in 1.15) at most a couple of MiB.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)