Hi ynagireddy4u, >From the exception info, I think your application has met a HDFS file issue during the commit phase of checkpoint. Can u check why 'Staging file does not exist' in the first place?
Regards, Xiangyu Y SREEKARA BHARGAVA REDDY <ynagiredd...@gmail.com> 于2023年8月4日周五 12:21写道: > Hi Xiangyu/Dev Team, > > Thanks for reply. > > In our flink job, we increase the *checkpoint timeout to 30 min.* > And the *checkpoint interval is 10 min.* > > But while running the job we got below exception. > > java.lang.RuntimeException: Error while confirming checkpoint > at org.apache.flink.streaming.runtime.tasks.StreamTask > .notifyCheckpointComplete(StreamTask.java:952) > at org.apache.flink.streaming.runtime.tasks.StreamTask > .lambda$notifyCheckpointCompleteAsync$7(StreamTask.java:924) > at org.apache.flink.util.function.FunctionUtils.lambda$asCallable$5( > FunctionUtils.java:125) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at org.apache.flink.streaming.runtime.tasks. > StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.run( > StreamTaskActionExecutor.java:87) > at org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail > .java:78) > at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor > .processMail(MailboxProcessor.java:261) > at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor > .runMailboxLoop(MailboxProcessor.java:186) > at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop( > StreamTask.java:487) > at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke( > StreamTask.java:470) > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:707) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:532) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.io.IOException: Cannot clean commit: Staging file does not > exist. > at org.apache.flink.runtime.fs.hdfs. > HadoopRecoverableFsDataOutputStream$HadoopFsCommitter.commit( > HadoopRecoverableFsDataOutputStream.java:250) > at org.apache.flink.streaming.api.functions.sink.filesystem.Bucket > .onSuccessfulCompletionOfCheckpoint(Bucket.java:300) > at org.apache.flink.streaming.api.functions.sink.filesystem.Buckets > .commitUpToCheckpoint(Buckets.java:216) > at org.apache.flink.streaming.api.functions.sink.filesystem. > StreamingFileSink.notifyCheckpointComplete(StreamingFileSink.java:415) > at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator > .notifyCheckpointComplete(AbstractUdfStreamOperator.java:130) > at org.apache.flink.streaming.runtime.tasks.StreamTask > .lambda$notifyCheckpointComplete$8(StreamTask.java:936) > at org.apache.flink.streaming.runtime.tasks. > StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.call( > StreamTaskActionExecutor.java:101) > at org.apache.flink.streaming.runtime.tasks.StreamTask > .notifyCheckpointComplete(StreamTask.java:930) > ... 12 more > > It would be great, if you have any workaround for that. > > Regards, > Nagireddy Y. > > > > > > > On Thu, Aug 3, 2023 at 7:24 AM xiangyu feng <xiangyu...@gmail.com> wrote: > >> Hi ynagireddy4u, >> >> We have met this exception before. Usually it is caused by following >> reasons: >> >> 1), TaskManager is too busy with other works to send the heartbeat to >> JobMaster or TaskManager process might already exited; >> 2), There might be a network issues between this TaskManager and >> JobMaster; >> 3), In certain cases, JobMaster actor might also being too busy to >> process the RPC requests from TaskManager; >> >> Pls check if your problem fits the above situations. >> >> Best, >> Xiangyu >> >> >> Y SREEKARA BHARGAVA REDDY <ynagiredd...@gmail.com> 于2023年7月31日周一 20:49写道: >> >>> Hi Team, >>> >>> Did any one face the below exception. >>> If yes, please share the resolution. >>> >>> >>> 2023-07-28 22:04:16 >>> j*ava.util.concurrent.TimeoutException: Heartbeat of TaskManager with id >>> container_e19_1690528962823_0382_01_000005 timed out.* >>> at org.apache.flink.runtime.jobmaster. >>> JobMaster$TaskManagerHeartbeatListener.notifyHeartbeatTimeout(JobMaster >>> .java:1147) >>> at org.apache.flink.runtime.heartbeat.HeartbeatMonitorImpl.run( >>> HeartbeatMonitorImpl.java:109) >>> at >>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java: >>> 511) >>> at java.util.concurrent.FutureTask.run(FutureTask.java:266) >>> at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync( >>> AkkaRpcActor.java:397) >>> at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage( >>> AkkaRpcActor.java:190) >>> at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor >>> .handleRpcMessage(FencedAkkaRpcActor.java:74) >>> at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage( >>> AkkaRpcActor.java:152) >>> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) >>> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) >>> at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) >>> at akka.japi.pf >>> .UnitCaseStatement.applyOrElse(CaseStatements.scala:21) >>> at >>> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) >>> at >>> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) >>> at >>> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) >>> at akka.actor.Actor$class.aroundReceive(Actor.scala:517) >>> at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) >>> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) >>> at akka.actor.ActorCell.invoke(ActorCell.scala:561) >>> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) >>> at akka.dispatch.Mailbox.run(Mailbox.scala:225) >>> at akka.dispatch.Mailbox.exec(Mailbox.scala:235) >>> at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) >>> at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool >>> .java:1339) >>> at >>> akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) >>> at >>> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread >>> .java:107) >>> >>> Any suggestions, please share with me. >>> >>> Regards, >>> Nagireddy Y >>> >>