[ https://issues.apache.org/jira/browse/FLINK-31903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Zhipeng Zhang closed FLINK-31903. --------------------------------- Resolution: Not A Bug > Caching records fails in BroadcastUtils#withBroadcastStream > ----------------------------------------------------------- > > Key: FLINK-31903 > URL: https://issues.apache.org/jira/browse/FLINK-31903 > Project: Flink > Issue Type: Bug > Components: Library / Machine Learning > Affects Versions: ml-2.3.0 > Reporter: Zhipeng Zhang > Priority: Major > > When caching more than 1,000,000 records using BroadcastUtils#withBroadcast, > it throws exception as follows: > {code:java} > Caused by: org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint > tolerable failure threshold. > at > org.apache.flink.runtime.checkpoint.CheckpointFailureManager.checkFailureAgainstCounter(CheckpointFailureManager.java:206) > at > org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleTaskLevelCheckpointException(CheckpointFailureManager.java:191) > at > org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleCheckpointException(CheckpointFailureManager.java:124) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:2078) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveDeclineMessage(CheckpointCoordinator.java:1038) > at > org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$declineCheckpoint$2(ExecutionGraphHandler.java:103) > at > org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$processCheckpointCoordinatorMessage$3(ExecutionGraphHandler.java:119) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} > It seems that the bug comes from caching too many records when calling > AbstractBroadcastWrapperOperator#snapshot. > > The failed case could be found here: > [https://github.com/zhipeng93/flink-ml/tree/FLINK-31903-fail-case] -- This message was sent by Atlassian Jira (v8.20.10#820010)