[
https://issues.apache.org/jira/browse/FLINK-17853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17119634#comment-17119634
]
Fritz Budiyanto commented on FLINK-17853:
-----------------------------------------
Thanks. We will migrate to 1.10. Feel free to close this ticket. I'll re-open
if it is still happening in 1.10.
> JobGraph is not getting deleted after Job cancelation
> -----------------------------------------------------
>
> Key: FLINK-17853
> URL: https://issues.apache.org/jira/browse/FLINK-17853
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.9.2
> Environment: Flink 1.9.2
> Zookeeper from AWS MSK
> Reporter: Fritz Budiyanto
> Priority: Major
> Attachments: flinkissue.txt
>
>
> I have been seeing this issue several time where JobGraph are not cleaned up
> properly after Job deletion. Job deletion is performed by using "flink stop"
> command. As a result JobGraph node lingering in ZK, when Flink cluster is
> restarted, it will attempt to do HA restoration on non existing checkpoint
> which prevent the Flink cluster to come up.
> 2020-05-19 19:56:21,471 INFO
> org.apache.flink.runtime.taskexecutor.TaskExecutor - Un-registering task and
> sending final execution state FINISHED to JobManager for task Source:
> kafkaConsumer[update_server] -> (DetectedUpdateMessageConverter -> Sink:
> update_server.detected_updates, DrivenCoordinatesMessageConverter -> Sink:
> update_server.driven_coordinates) 588902a8096f49845b09fa1f595d6065.
> 2020-05-19 19:56:21,622 INFO
> org.apache.flink.runtime.taskexecutor.slot.TaskSlotTable - Free slot
> TaskSlot(index:0, state:ACTIVE, resource profile:
> ResourceProfile\{cpuCores=1.7976931348623157E308, heapMemoryInMB=2147483647,
> directMemoryInMB=2147483647, nativeMemoryInMB=2147483647,
> networkMemoryInMB=2147483647, managedMemoryInMB=642}, allocationId:
> 29f6a5f83c832486f2d7ebe5c779fa32, jobId: 86a028b3f7aada8ffe59859ca71d6385).
> 2020-05-19 19:56:21,622 INFO
> org.apache.flink.runtime.taskexecutor.JobLeaderService - Remove job
> 86a028b3f7aada8ffe59859ca71d6385 from job leader monitoring.
> 2020-05-19 19:56:21,622 INFO
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService -
> Stopping ZooKeeperLeaderRetrievalService
> /leader/86a028b3f7aada8ffe59859ca71d6385/job_manager_lock.
> 2020-05-19 19:56:21,623 INFO
> org.apache.flink.runtime.taskexecutor.TaskExecutor - Close JobManager
> connection for job 86a028b3f7aada8ffe59859ca71d6385.
> 2020-05-19 19:56:21,624 INFO
> org.apache.flink.runtime.taskexecutor.TaskExecutor - Close JobManager
> connection for job 86a028b3f7aada8ffe59859ca71d6385.
> 2020-05-19 19:56:21,624 INFO
> org.apache.flink.runtime.taskexecutor.JobLeaderService - Cannot reconnect to
> job 86a028b3f7aada8ffe59859ca71d6385 because it is not registered.
> ...
> Zookeeper CLI:
> ls /flink/cluster_update/jobgraphs
> [86a028b3f7aada8ffe59859ca71d6385]
>
> Attached is the Flink logs in reverse order
--
This message was sent by Atlassian Jira
(v8.3.4#803005)