[ 
https://issues.apache.org/jira/browse/FLINK-14316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16950470#comment-16950470
 ] 

Steven Zhen Wu commented on FLINK-14316:
----------------------------------------

[~trohrmann] we would love to hear your thoughts on two specific questions
 # Piyush's patch clearly fixed whatever bug that caused this issue. Do you see 
any other implication/downside of such a change? if it is good, we can create 
an official PR to upstream.
 # We still haven't been able to identify the root cause for this bug to show 
up. This job has been running stable and there is no code change. Any idea?

Some background for this job
 * 235 containers. each with 8 CPUs/slots. parallelism is 1,880. When running 
into this problem, we also tried 50 containers (with 400 parallelism) and it 
was still failing.
 * it is a large-state job (a few TBs), although we don't think it matters. we 
tried to redeploy the job with empty state and it still suffered the same 
failure loop.

> stuck in "Job leader ... lost leadership" error
> -----------------------------------------------
>
>                 Key: FLINK-14316
>                 URL: https://issues.apache.org/jira/browse/FLINK-14316
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.7.2
>            Reporter: Steven Zhen Wu
>            Priority: Major
>         Attachments: FLINK-14316.tgz, RpcConnection.patch
>
>
> This is the first exception caused restart loop. Later exceptions are the 
> same. Job seems to stuck in this permanent failure state.
> {code}
> 2019-10-03 21:42:46,159 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source: 
> clpevents -> device_filter -> processed_imps -> ios_processed_impression -> i
> mps_ts_assigner (449/1360) (d237f5e99b6a4a580498821473763edb) switched from 
> SCHEDULED to FAILED.
> java.lang.Exception: Job leader for job id ecb9ad9be934edf7b1a4f7b9dd6df365 
> lost leadership.
>         at 
> org.apache.flink.runtime.taskexecutor.TaskExecutor$JobLeaderListenerImpl.lambda$jobManagerLostLeadership$1(TaskExecutor.java:1526)
>         at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
>         at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
>         at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
>         at 
> akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
>         at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
>         at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
>         at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
>         at akka.actor.ActorCell.invoke(ActorCell.scala:495)
>         at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
>         at akka.dispatch.Mailbox.run(Mailbox.scala:224)
>         at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
>         at 
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>         at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>         at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>         at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to