[ 
https://issues.apache.org/jira/browse/FLINK-29095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597924#comment-17597924
 ] 

Roman Khachatryan commented on FLINK-29095:
-------------------------------------------

There is some logging already but it could be improved.

For example, SharedStateRegistry [logs on key 
duplication|https://github.com/apache/flink/blob/2220f24925ab5146d5771c3782ed8c0837bb0bc4/flink-runtime/src/main/java/org/apache/flink/runtime/state/SharedStateRegistryImpl.java#L134]:
{code:java}
Identified duplicate state registration under key 
0d7c41ca-954d-49a2-97b9-4c42e9db1ad8-KeyGroupRange{startKeyGroup=64, 
endKeyGroup=127}-000066.sst. New state 
org.apache.flink.runtime.state.PlaceholderStreamStateHandle@7fcdb6ee was 
determined to be an unnecessary copy of existing state File State: <some path> 
<some size> and will be dropped.
{code}
Here,
1. The actual object to discard (scheduledStateDeletion) might not be logged 
2. entry.confirmed is not logged
3. The message is the same for two branches
4. Placeholder.toString can be overriden
5. The existing state is less interesting (it must have been already logged 
earlier)

Besides that, nothing is logged if a different object representing the same 
state is registered twice.
Usually, this isn't a problem; however, in some cases it might indicate a bug 
(e.g. SharedStateRegistry and CheckpointStore use different java objects for 
the same state on recovery).

So I'm going to change the issue to Improvement.

> logging state file deletion
> ---------------------------
>
>                 Key: FLINK-29095
>                 URL: https://issues.apache.org/jira/browse/FLINK-29095
>             Project: Flink
>          Issue Type: New Feature
>          Components: Runtime / Checkpointing, Runtime / State Backends
>    Affects Versions: 1.16.0
>            Reporter: Jing Ge
>            Assignee: Roman Khachatryan
>            Priority: Critical
>
> with the incremental checkpoint, conceptually, state files that are never 
> used by any checkpoint will be deleted/GC . In practices, state files might 
> be deleted when they are still somehow required by the failover which will 
> lead to Flink job fails.
> We should add the log for trouble shooting.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to