[
https://issues.apache.org/jira/browse/FLINK-19293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Arvid Heise updated FLINK-19293:
--------------------------------
Component/s: Runtime / State Backends
> RocksDB last_checkpoint.state_size grows endlessly until savepoint/restore
> --------------------------------------------------------------------------
>
> Key: FLINK-19293
> URL: https://issues.apache.org/jira/browse/FLINK-19293
> Project: Flink
> Issue Type: Bug
> Components: Library / CEP, Runtime / Checkpointing, Runtime / State
> Backends
> Affects Versions: 1.10.1
> Reporter: Thomas Wozniakowski
> Priority: Major
> Attachments: Screenshot 2020-09-18 at 13.58.30.png
>
>
> Hi Guys,
> I am seeing some strange behaviour that may be a bug, or may just be intended.
> We are running a Flink job on a 1.10.1 cluster running with 1 JobManager and
> 2 TaskManagers, parallelism 4. The job itself is simple:
> # Source: kinesis connector reading from a single shard stream
> # CEP: ~25 CEP Keyed Pattern operators watching the event stream for
> different kinds of behaviour. They all have ".withinSeconds(xxxx)" applied.
> Nothing is set up to grow endlessly.
> # Sink: Single operator writing messages to SQS (custom code)
> We are seeing the checkpoint size grow constantly until the job is restarted
> using a savepoint/restore. The size continues to grow past the point that the
> ".withinSeconds(xxxx)" limits should cause old data to be discarded. The
> growth is also out of proportion to the general platform growth (which is
> actually trending down at the moment due to COVID).
> I've attached a snapshot from our monitoring dashboard below. You can see the
> huge drops in state_size on a savepoint/restore.
> Our state configuration is as follows:
> Backend: RocksDB
> Mode: EXACTLY_ONCE
> Max Concurrent: 1
> Externalised Checkpoints: RETAIN_ON_CANCELLATION
> Async: TRUE
> Incremental: TRUE
> TTL Compaction Filter enabled: TRUE
> We are worried that the CEP library may be leaking state somewhere, leaving
> some objects not cleaned up. Unfortunately I can't share one of these
> checkpoints with the community due to the sensitive nature of the data
> contained within, but if anyone has any suggestions for how I could analyse
> the checkpoints to look for leaks, please let me know.
> Thanks in advance for the help
--
This message was sent by Atlassian Jira
(v8.3.4#803005)