[
https://issues.apache.org/jira/browse/FLINK-24240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Zheren Yu updated FLINK-24240:
------------------------------
Description:
We are using HA with flink on k8s, which will create the configmap like
`xxx-dispatcher-leader`, and put jobGraph inside it, once we update version
from 1.12.4 to 1.13.2 without stopping the job, the jobGraph create from old
version will be deserialized and lacking of the filed of jobType, which cause
the below problem
{code:java}
Caused by: java.lang.NullPointerException
at
org.apache.flink.runtime.deployment.TaskDeploymentDescriptorFactory$PartitionLocationConstraint.fromJobType(TaskDeploymentDescriptorFactory.java:282)
~[flink-dist_2.12-1.13.2.jar:1.13.2]
at
org.apache.flink.runtime.scheduler.SchedulerBase.createAndRestoreExecutionGraph(SchedulerBase.java:347)
~[flink-dist_2.12-1.13.2.jar:1.13.2]
at
org.apache.flink.runtime.scheduler.SchedulerBase.<init>(SchedulerBase.java:190)
~[flink-dist_2.12-1.13.2.jar:1.13.2]
at
org.apache.flink.runtime.scheduler.DefaultScheduler.<init>(DefaultScheduler.java:122)
~[flink-dist_2.12-1.13.2.jar:1.13.2]
at
org.apache.flink.runtime.scheduler.DefaultSchedulerFactory.createInstance(DefaultSchedulerFactory.java:132)
~[flink-dist_2.12-1.13.2.jar:1.13.2]
at
org.apache.flink.runtime.jobmaster.DefaultSlotPoolServiceSchedulerFactory.createScheduler(DefaultSlotPoolServiceSchedulerFactory.java:110)
~[flink-dist_2.12-1.13.2.jar:1.13.2]
at
org.apache.flink.runtime.jobmaster.JobMaster.createScheduler(JobMaster.java:340)
~[flink-dist_2.12-1.13.2.jar:1.13.2]
at
org.apache.flink.runtime.jobmaster.JobMaster.<init>(JobMaster.java:317)
~[flink-dist_2.12-1.13.2.jar:1.13.2]
at
org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.internalCreateJobMasterService(DefaultJobMasterServiceFactory.java:107)
~[flink-dist_2.12-1.13.2.jar:1.13.2]
at
org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.lambda$createJobMasterService$0(DefaultJobMasterServiceFactory.java:95)
~[flink-dist_2.12-1.13.2.jar:1.13.2]
at
org.apache.flink.util.function.FunctionUtils.lambda$uncheckedSupplier$4(FunctionUtils.java:112)
~[flink-dist_2.12-1.13.2.jar:1.13.2]
at
java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
~[?:1.8.0_302]
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
~[?:1.8.0_302]
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
~[?:1.8.0_302]
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
~[?:1.8.0_302]
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
~[?:1.8.0_302]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
~[?:1.8.0_302]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
~[?:1.8.0_302]
at java.lang.Thread.run(Thread.java:748)
{code}
I just wandering do we have any workaround with this?
(although I know manually stopping the job may work)
was:
We are using HA with flink on k8s, which will create the configmap like
`xxx-dispatcher-leader`, and put jobGraph inside it, once we update version
from 1.12.4 to 1.13.2 without stopping the job, the jobGraph create from old
version will be deserialized and lacking of the filed of jobType, which cause
the below problem
```
Caused by: java.lang.NullPointerException
at
org.apache.flink.runtime.deployment.TaskDeploymentDescriptorFactory$PartitionLocationConstraint.fromJobType(TaskDeploymentDescriptorFactory.java:282)
~[flink-dist_2.12-1.13.2.jar:1.13.2]
at
org.apache.flink.runtime.scheduler.SchedulerBase.createAndRestoreExecutionGraph(SchedulerBase.java:347)
~[flink-dist_2.12-1.13.2.jar:1.13.2]
at
org.apache.flink.runtime.scheduler.SchedulerBase.<init>(SchedulerBase.java:190)
~[flink-dist_2.12-1.13.2.jar:1.13.2]
at
org.apache.flink.runtime.scheduler.DefaultScheduler.<init>(DefaultScheduler.java:122)
~[flink-dist_2.12-1.13.2.jar:1.13.2]
at
org.apache.flink.runtime.scheduler.DefaultSchedulerFactory.createInstance(DefaultSchedulerFactory.java:132)
~[flink-dist_2.12-1.13.2.jar:1.13.2]
at
org.apache.flink.runtime.jobmaster.DefaultSlotPoolServiceSchedulerFactory.createScheduler(DefaultSlotPoolServiceSchedulerFactory.java:110)
~[flink-dist_2.12-1.13.2.jar:1.13.2]
at
org.apache.flink.runtime.jobmaster.JobMaster.createScheduler(JobMaster.java:340)
~[flink-dist_2.12-1.13.2.jar:1.13.2]
at
org.apache.flink.runtime.jobmaster.JobMaster.<init>(JobMaster.java:317)
~[flink-dist_2.12-1.13.2.jar:1.13.2]
at
org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.internalCreateJobMasterService(DefaultJobMasterServiceFactory.java:107)
~[flink-dist_2.12-1.13.2.jar:1.13.2]
at
org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.lambda$createJobMasterService$0(DefaultJobMasterServiceFactory.java:95)
~[flink-dist_2.12-1.13.2.jar:1.13.2]
at
org.apache.flink.util.function.FunctionUtils.lambda$uncheckedSupplier$4(FunctionUtils.java:112)
~[flink-dist_2.12-1.13.2.jar:1.13.2]
at
java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
~[?:1.8.0_302]
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
~[?:1.8.0_302]
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
~[?:1.8.0_302]
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
~[?:1.8.0_302]
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
~[?:1.8.0_302]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
~[?:1.8.0_302]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
~[?:1.8.0_302]
at java.lang.Thread.run(Thread.java:748)
```
I just wandering do we have any workaround with this?
(although I know manually stopping the job may work)
> HA JobGraph deserialization problem when migrate 1.12.4 to 1.13.2
> -----------------------------------------------------------------
>
> Key: FLINK-24240
> URL: https://issues.apache.org/jira/browse/FLINK-24240
> Project: Flink
> Issue Type: Bug
> Components: Runtime / State Backends
> Affects Versions: 1.13.2
> Reporter: Zheren Yu
> Priority: Major
>
> We are using HA with flink on k8s, which will create the configmap like
> `xxx-dispatcher-leader`, and put jobGraph inside it, once we update version
> from 1.12.4 to 1.13.2 without stopping the job, the jobGraph create from old
> version will be deserialized and lacking of the filed of jobType, which cause
> the below problem
> {code:java}
> Caused by: java.lang.NullPointerException
> at
> org.apache.flink.runtime.deployment.TaskDeploymentDescriptorFactory$PartitionLocationConstraint.fromJobType(TaskDeploymentDescriptorFactory.java:282)
> ~[flink-dist_2.12-1.13.2.jar:1.13.2]
> at
> org.apache.flink.runtime.scheduler.SchedulerBase.createAndRestoreExecutionGraph(SchedulerBase.java:347)
> ~[flink-dist_2.12-1.13.2.jar:1.13.2]
> at
> org.apache.flink.runtime.scheduler.SchedulerBase.<init>(SchedulerBase.java:190)
> ~[flink-dist_2.12-1.13.2.jar:1.13.2]
> at
> org.apache.flink.runtime.scheduler.DefaultScheduler.<init>(DefaultScheduler.java:122)
> ~[flink-dist_2.12-1.13.2.jar:1.13.2]
> at
> org.apache.flink.runtime.scheduler.DefaultSchedulerFactory.createInstance(DefaultSchedulerFactory.java:132)
> ~[flink-dist_2.12-1.13.2.jar:1.13.2]
> at
> org.apache.flink.runtime.jobmaster.DefaultSlotPoolServiceSchedulerFactory.createScheduler(DefaultSlotPoolServiceSchedulerFactory.java:110)
> ~[flink-dist_2.12-1.13.2.jar:1.13.2]
> at
> org.apache.flink.runtime.jobmaster.JobMaster.createScheduler(JobMaster.java:340)
> ~[flink-dist_2.12-1.13.2.jar:1.13.2]
> at
> org.apache.flink.runtime.jobmaster.JobMaster.<init>(JobMaster.java:317)
> ~[flink-dist_2.12-1.13.2.jar:1.13.2]
> at
> org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.internalCreateJobMasterService(DefaultJobMasterServiceFactory.java:107)
> ~[flink-dist_2.12-1.13.2.jar:1.13.2]
> at
> org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.lambda$createJobMasterService$0(DefaultJobMasterServiceFactory.java:95)
> ~[flink-dist_2.12-1.13.2.jar:1.13.2]
> at
> org.apache.flink.util.function.FunctionUtils.lambda$uncheckedSupplier$4(FunctionUtils.java:112)
> ~[flink-dist_2.12-1.13.2.jar:1.13.2]
> at
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
> ~[?:1.8.0_302]
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> ~[?:1.8.0_302]
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> ~[?:1.8.0_302]
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> ~[?:1.8.0_302]
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> ~[?:1.8.0_302]
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> ~[?:1.8.0_302]
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> ~[?:1.8.0_302]
> at java.lang.Thread.run(Thread.java:748)
> {code}
> I just wandering do we have any workaround with this?
> (although I know manually stopping the job may work)
--
This message was sent by Atlassian Jira
(v8.3.4#803005)