[
https://issues.apache.org/jira/browse/FLINK-20099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17230469#comment-17230469
]
Jiayi Liao commented on FLINK-20099:
------------------------------------
Is this a duplicate issue of https://issues.apache.org/jira/browse/FLINK-16753?
> HeapStateBackend checkpoint error hidden under cryptic message
> --------------------------------------------------------------
>
> Key: FLINK-20099
> URL: https://issues.apache.org/jira/browse/FLINK-20099
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Checkpointing, Runtime / State Backends
> Affects Versions: 1.11.2
> Reporter: Nico Kruber
> Priority: Major
> Labels: usability
> Attachments: Screenshot_20201112_001331.png
>
>
> When the memory state back-end hits a certain size, it fails to permit
> checkpoints. Even though a very detailed exception is thrown at its source,
> this is neither logged nor shown in the UI:
> * Logs just contain:
> {code:java}
> 00:06:41.462 [jobmanager-future-thread-14] INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Decline
> checkpoint 2 by task 8eb303cd3196310cb2671212f4ed013c of job
> c9b7a410bd3143864ca23ba89595d878 at 6a73bcf2-46b6-4735-a616-fdf09ff1471c @
> localhost (dataPort=-1).
> {code}
> * UI: (also see the attached Screenshot_20201112_001331.png)
> {code:java}
> Failure Message: The job has failed.
> {code}
> -> this isn't even true: the job is still running fine!
>
> Debugging into {{PendingCheckpoint#abort()}} reveals that the causing
> exception is actually still in there but the detailed information from it is
> just never used.
> For reference, this is what is available there and should be logged or shown:
> {code:java}
> java.lang.Exception: Could not materialize checkpoint 2 for operator
> aggregates -> (Sink: sink-agg-365, Sink: sink-agg-180, Sink: sink-agg-45,
> Sink: sink-agg-30) (4/4).
> at
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.handleExecutionException(AsyncCheckpointRunnable.java:191)
> at
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.run(AsyncCheckpointRunnable.java:138)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.util.concurrent.ExecutionException: java.io.IOException: Size
> of the state is larger than the maximum permitted memory-backed state.
> Size=6122737 , maxSize=5242880 . Consider using a different state backend,
> like the File System State backend.
> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> at java.util.concurrent.FutureTask.get(FutureTask.java:192)
> at
> org.apache.flink.runtime.concurrent.FutureUtils.runIfNotDoneAndGet(FutureUtils.java:479)
> at
> org.apache.flink.streaming.api.operators.OperatorSnapshotFinalizer.<init>(OperatorSnapshotFinalizer.java:50)
> at
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.run(AsyncCheckpointRunnable.java:102)
> ... 3 more
> Caused by: java.io.IOException: Size of the state is larger than the maximum
> permitted memory-backed state. Size=6122737 , maxSize=5242880 . Consider
> using a different state backend, like the File System State backend.
> at
> org.apache.flink.runtime.state.memory.MemCheckpointStreamFactory.checkSize(MemCheckpointStreamFactory.java:64)
> at
> org.apache.flink.runtime.state.memory.MemCheckpointStreamFactory$MemoryCheckpointOutputStream.closeAndGetBytes(MemCheckpointStreamFactory.java:145)
> at
> org.apache.flink.runtime.state.memory.MemCheckpointStreamFactory$MemoryCheckpointOutputStream.closeAndGetHandle(MemCheckpointStreamFactory.java:126)
> at
> org.apache.flink.runtime.state.CheckpointStreamWithResultProvider$PrimaryStreamOnly.closeAndFinalizeCheckpointStreamResult(CheckpointStreamWithResultProvider.java:77)
> at
> org.apache.flink.runtime.state.heap.HeapSnapshotStrategy$1.callInternal(HeapSnapshotStrategy.java:199)
> at
> org.apache.flink.runtime.state.heap.HeapSnapshotStrategy$1.callInternal(HeapSnapshotStrategy.java:158)
> at
> org.apache.flink.runtime.state.AsyncSnapshotCallable.call(AsyncSnapshotCallable.java:75)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> org.apache.flink.runtime.concurrent.FutureUtils.runIfNotDoneAndGet(FutureUtils.java:476)
> ... 5 more
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)