[
https://issues.apache.org/jira/browse/FLINK-29095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597924#comment-17597924
]
Roman Khachatryan commented on FLINK-29095:
-------------------------------------------
There is some logging already but it could be improved.
For example, SharedStateRegistry [logs on key
duplication|https://github.com/apache/flink/blob/2220f24925ab5146d5771c3782ed8c0837bb0bc4/flink-runtime/src/main/java/org/apache/flink/runtime/state/SharedStateRegistryImpl.java#L134]:
{code:java}
Identified duplicate state registration under key
0d7c41ca-954d-49a2-97b9-4c42e9db1ad8-KeyGroupRange{startKeyGroup=64,
endKeyGroup=127}-000066.sst. New state
org.apache.flink.runtime.state.PlaceholderStreamStateHandle@7fcdb6ee was
determined to be an unnecessary copy of existing state File State: <some path>
<some size> and will be dropped.
{code}
Here,
1. The actual object to discard (scheduledStateDeletion) might not be logged
2. entry.confirmed is not logged
3. The message is the same for two branches
4. Placeholder.toString can be overriden
5. The existing state is less interesting (it must have been already logged
earlier)
Besides that, nothing is logged if a different object representing the same
state is registered twice.
Usually, this isn't a problem; however, in some cases it might indicate a bug
(e.g. SharedStateRegistry and CheckpointStore use different java objects for
the same state on recovery).
So I'm going to change the issue to Improvement.
> logging state file deletion
> ---------------------------
>
> Key: FLINK-29095
> URL: https://issues.apache.org/jira/browse/FLINK-29095
> Project: Flink
> Issue Type: New Feature
> Components: Runtime / Checkpointing, Runtime / State Backends
> Affects Versions: 1.16.0
> Reporter: Jing Ge
> Assignee: Roman Khachatryan
> Priority: Critical
>
> with the incremental checkpoint, conceptually, state files that are never
> used by any checkpoint will be deleted/GC . In practices, state files might
> be deleted when they are still somehow required by the failover which will
> lead to Flink job fails.
> We should add the log for trouble shooting.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)