[ 
https://issues.apache.org/jira/browse/FLINK-11997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Anderson updated FLINK-11997:
-----------------------------------
    Attachment: FAILURE

> ConcurrentModificationException: ZooKeeper unexpectedly modified
> ----------------------------------------------------------------
>
>                 Key: FLINK-11997
>                 URL: https://issues.apache.org/jira/browse/FLINK-11997
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.8.0
>         Environment: Flink 1.8.0-rc4, running in a k8s job cluster with 
> checkpointing and savepointing in minio. Zookeeper enabled, also saving to 
> minio.
> jobmanager.rpc.address: localhost
> jobmanager.rpc.port: 6123
> jobmanager.heap.size: 1024m
> taskmanager.heap.size: 1024m
> taskmanager.numberOfTaskSlots: 4
> parallelism.default: 4
> high-availability: zookeeper
> high-availability.jobmanager.port: 6123
> high-availability.storageDir: s3://highavailability/storage
> high-availability.zookeeper.quorum: zoo1:2181
> state.backend: filesystem
> state.checkpoints.dir: s3://state/checkpoints
> state.savepoints.dir: s3://state/savepoints
> rest.port: 8081
> zookeeper.sasl.disable: true
> s3.access-key: minio
> s3.secret-key: minio123
> s3.path-style-access: true
> s3.endpoint: http://minio-service:9000
>  
>            Reporter: David Anderson
>            Priority: Major
>         Attachments: FAILURE
>
>
> Trying to rescale a job running in a k8s job cluster via
> flink modify 00000000000000000000000000000000 -p 2 -m localhost:30081
> Rescaling works fine if HA is off. Taking a savepoint and restarting from one 
> also works fine, even with HA turned on. But rescaling by modifying the job 
> with HA on always fails as shown below:
> Caused by: org.apache.flink.util.FlinkException: Failed to rescale the job 
> 00000000000000000000000000000000.
>         ... 21 more
> Caused by: java.util.concurrent.CompletionException: 
> org.apache.flink.runtime.jobmaster.exceptions.JobModificationException: Could 
> not restore from temporary rescaling savepoint. This might indicate that the 
> savepoint s3://state/savepoints/savepoint-000000-2fa7fd5dabb2 got corrupted. 
> Deleting this savepoint as a precaution.
>         at 
> org.apache.flink.runtime.jobmaster.JobMaster.lambda$rescaleOperators$4(JobMaster.java:470)
>         at 
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:822)
>         at 
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:797)
>         ... 18 more
> Caused by: 
> org.apache.flink.runtime.jobmaster.exceptions.JobModificationException: Could 
> not restore from temporary rescaling savepoint. This might indicate that the 
> savepoint s3://state/savepoints/savepoint-000000-2fa7fd5dabb2 got corrupted. 
> Deleting this savepoint as a precaution.
>         at 
> org.apache.flink.runtime.jobmaster.JobMaster.lambda$restoreExecutionGraphFromRescalingSavepoint$18(JobMaster.java:1433)
>         at 
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:602)
>         at 
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
>         at 
> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>         at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> Caused by: java.util.ConcurrentModificationException: ZooKeeper unexpectedly 
> modified
>         at 
> org.apache.flink.runtime.zookeeper.ZooKeeperStateHandleStore.addAndLock(ZooKeeperStateHandleStore.java:159)
>         at 
> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore.addCheckpoint(ZooKeeperCompletedCheckpointStore.java:216)
>         at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreSavepoint(CheckpointCoordinator.java:1106)
>         at 
> org.apache.flink.runtime.jobmaster.JobMaster.tryRestoreExecutionGraphFromSavepoint(JobMaster.java:1251)
>         at 
> org.apache.flink.runtime.jobmaster.JobMaster.lambda$restoreExecutionGraphFromRescalingSavepoint$18(JobMaster.java:1413)
>         ... 10 more
> Caused by: 
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.KeeperException$NodeExistsException:
>  KeeperErrorCode = NodeExists
>         at 
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.KeeperException.create(KeeperException.java:119)
>         at 
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1006)
>         at 
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:910)
>         at 
> org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorTransactionImpl.doOperation(CuratorTransactionImpl.java:159)
>         at 
> org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorTransactionImpl.access$200(CuratorTransactionImpl.java:44)
>         at 
> org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorTransactionImpl$2.call(CuratorTransactionImpl.java:129)
>         at 
> org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorTransactionImpl$2.call(CuratorTransactionImpl.java:125)
>         at 
> org.apache.flink.shaded.curator.org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:109)
>         at 
> org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorTransactionImpl.commit(CuratorTransactionImpl.java:122)
>         at 
> org.apache.flink.runtime.zookeeper.ZooKeeperStateHandleStore.addAndLock(ZooKeeperStateHandleStore.java:153)
>         ... 14 more



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to