[ 
https://issues.apache.org/jira/browse/FLINK-19012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17186036#comment-17186036
 ] 

Roman Khachatryan commented on FLINK-19012:
-------------------------------------------

I think the problem is that AsyncCheckpointRunnable throws an exception when it 
sees that SubtaskCheckpointCoordinator is closed. 

Upon close, SubtaskCheckpointCoordinator closes its runnables but doesn't stop 
their threads. This also changes their statuses.

So AsyncCheckpointRunnable should check its status before throwing an exception.

 

The tests started to fail after increasing the log level in FLINK-18962.

> E2E test fails with "Cannot register Closeable, this 
> subtaskCheckpointCoordinator is already closed. Closing argument."
> -----------------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-19012
>                 URL: https://issues.apache.org/jira/browse/FLINK-19012
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing, Runtime / Task, Tests
>    Affects Versions: 1.12.0
>            Reporter: Robert Metzger
>            Assignee: Roman Khachatryan
>            Priority: Critical
>              Labels: test-stability
>             Fix For: 1.12.0
>
>
> Note: This error occurred in a custom branch with unreviewed changes. I don't 
> believe my changes affect this error, but I would keep this in mind when 
> investigating the error: 
> https://dev.azure.com/rmetzger/Flink/_build/results?buildId=8307&view=logs&j=1f3ed471-1849-5d3c-a34c-19792af4ad16&t=0d2e35fc-a330-5cf2-a012-7267e2667b1d
>  
> {code}
> 2020-08-20T20:55:30.2400645Z 2020-08-20 20:55:22,373 INFO  
> org.apache.flink.runtime.taskmanager.Task                    [] - Registering 
> task at network: Source: Sequence Source -> Flat Map -> Sink: Unnamed (1/1) 
> (cbc357ccb763df2852fee8c4fc7d55f2_0_0) [DEPLOYING].
> 2020-08-20T20:55:30.2402392Z 2020-08-20 20:55:22,401 INFO  
> org.apache.flink.streaming.runtime.tasks.StreamTask          [] - No state 
> backend has been configured, using default (Memory / JobManager) 
> MemoryStateBackend (data in heap memory / checkpoints to JobManager) 
> (checkpoints: 'null', savepoints: 'null', asynchronous: TRUE, maxStateSize: 
> 5242880)
> 2020-08-20T20:55:30.2404297Z 2020-08-20 20:55:22,413 INFO  
> org.apache.flink.runtime.taskmanager.Task                    [] - Source: 
> Sequence Source -> Flat Map -> Sink: Unnamed (1/1) 
> (cbc357ccb763df2852fee8c4fc7d55f2_0_0) switched from DEPLOYING to RUNNING.
> 2020-08-20T20:55:30.2405805Z 2020-08-20 20:55:22,786 INFO  
> org.apache.flink.streaming.connectors.elasticsearch6.Elasticsearch6ApiCallBridge
>  [] - Pinging Elasticsearch cluster via hosts [http://127.0.0.1:9200] ...
> 2020-08-20T20:55:30.2407027Z 2020-08-20 20:55:22,848 INFO  
> org.apache.flink.streaming.connectors.elasticsearch6.Elasticsearch6ApiCallBridge
>  [] - Elasticsearch RestHighLevelClient is connected to 
> [http://127.0.0.1:9200]
> 2020-08-20T20:55:30.2409277Z 2020-08-20 20:55:29,205 INFO  
> org.apache.flink.runtime.checkpoint.channel.ChannelStateWriteRequestExecutorImpl
>  [] - Source: Sequence Source -> Flat Map -> Sink: Unnamed (1/1) discarding 0 
> drained requests
> 2020-08-20T20:55:30.2410690Z 2020-08-20 20:55:29,218 INFO  
> org.apache.flink.runtime.taskmanager.Task                    [] - Source: 
> Sequence Source -> Flat Map -> Sink: Unnamed (1/1) 
> (cbc357ccb763df2852fee8c4fc7d55f2_0_0) switched from RUNNING to FINISHED.
> 2020-08-20T20:55:30.2412187Z 2020-08-20 20:55:29,218 INFO  
> org.apache.flink.runtime.taskmanager.Task                    [] - Freeing 
> task resources for Source: Sequence Source -> Flat Map -> Sink: Unnamed (1/1) 
> (cbc357ccb763df2852fee8c4fc7d55f2_0_0).
> 2020-08-20T20:55:30.2414203Z 2020-08-20 20:55:29,224 INFO  
> org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - 
> Un-registering task and sending final execution state FINISHED to JobManager 
> for task Source: Sequence Source -> Flat Map -> Sink: Unnamed (1/1) 
> cbc357ccb763df2852fee8c4fc7d55f2_0_0.
> 2020-08-20T20:55:30.2415602Z 2020-08-20 20:55:29,219 INFO  
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable [] - Source: 
> Sequence Source -> Flat Map -> Sink: Unnamed (1/1) - asynchronous part of 
> checkpoint 1 could not be completed.
> 2020-08-20T20:55:30.2416411Z java.io.UncheckedIOException: 
> java.io.IOException: Cannot register Closeable, this 
> subtaskCheckpointCoordinator is already closed. Closing argument.
> 2020-08-20T20:55:30.2418956Z  at 
> org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.lambda$registerConsumer$2(SubtaskCheckpointCoordinatorImpl.java:468)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> 2020-08-20T20:55:30.2420100Z  at 
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.run(AsyncCheckpointRunnable.java:91)
>  [flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> 2020-08-20T20:55:30.2420927Z  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  [?:1.8.0_265]
> 2020-08-20T20:55:30.2421455Z  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  [?:1.8.0_265]
> 2020-08-20T20:55:30.2421879Z  at java.lang.Thread.run(Thread.java:748) 
> [?:1.8.0_265]
> 2020-08-20T20:55:30.2422348Z Caused by: java.io.IOException: Cannot register 
> Closeable, this subtaskCheckpointCoordinator is already closed. Closing 
> argument.
> 2020-08-20T20:55:30.2423416Z  at 
> org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.registerAsyncCheckpointRunnable(SubtaskCheckpointCoordinatorImpl.java:378)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> 2020-08-20T20:55:30.2424635Z  at 
> org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.lambda$registerConsumer$2(SubtaskCheckpointCoordinatorImpl.java:466)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> 2020-08-20T20:55:30.2425174Z  ... 4 more
> 2020-08-20T20:55:30.2426945Z 2020-08-20 20:55:29,339 INFO  
> org.apache.flink.runtime.taskexecutor.slot.TaskSlotTableImpl [] - Free slot 
> TaskSlot(index:0, state:ACTIVE, resource profile: 
> ResourceProfile{cpuCores=1.0000000000000000, taskHeapMemory=384.000mb 
> (402653174 bytes), taskOffHeapMemory=0 bytes, managedMemory=512.000mb 
> (536870920 bytes), networkMemory=128.000mb (134217730 bytes)}, allocationId: 
> f5938e3baed0f9564aa17169f68947bd, jobId: 333b1654bd93574471bd44ad45847379).
> 2020-08-20T20:55:30.2428701Z 2020-08-20 20:55:29,354 INFO  
> org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Remove job 
> 333b1654bd93574471bd44ad45847379 from job leader monitoring.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to