[jira] [Comment Edited] (FLINK-19688) Flink job gets into restart loop caused by InterruptedExceptions

Robert Metzger (Jira) Mon, 19 Oct 2020 07:07:04 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-19688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17216742#comment-17216742
 ]


Robert Metzger edited comment on FLINK-19688 at 10/19/20, 2:05 PM:
-------------------------------------------------------------------

I'm trying to investigate what's going on.
Here are some of my notes, without any exciting results (yet)

{code}
We have a test setup with 4 TaskManagers.

JM: 2020-10-16 16:01:31,015: TaskManager 2 is the first where an artificial 
failure is induced through a RuntimeException.

JM: The JobManager then cancels the execution on all TMs:
JM: 2020-10-16 16:01:31,049 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job TPCH 
Query 3 Example (06d656f696bf4ed98831938a7ac2359d) switched from state RUNNING 
to RESTARTING.

(side note)
TM3: 2020-10-16 16:01:31,190: TaskManager 3 is logging a InterruptedException 
(as a WARN from ExternalSorter) on "CHAIN Join (Join at 
main(TPCHQuery3.java:183)) -> Map (Map at 
appendMapper(KillerClientMapper.java:38)) (7/8)#0"
(This WARN will be reduced to a DEBUG in 
https://github.com/apache/flink/pull/13668).

JM: cancellation complete, we are RUNNING again:
JM: 2020-10-16 16:01:31,448 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job TPCH 
Query 3 Example (06d656f696bf4ed98831938a7ac2359d) switched from state 
RESTARTING to RUNNING.

TM3: Now TM3 fails a task, with the root cause being an InterruptedException, 
while connecting to TaskManager 0 in PartitionRequestClientFactory.connect().
TM3: 2020-10-16 16:02:15,653 WARN  org.apache.flink.runtime.taskmanager.Task    
                [] - CHAIN GroupReduce (GroupReduce at 
main(TPCHQuery3.java:199)) -> Map (Map at 
appendMapper(KillerClientMapper.java:38)) (8/8)#1 
(06d656f696bf4ed98831938a7ac2359d_c1c4a56fea0536703d37867c057f0cc8_7_1) 
switched from RUNNING to FAILED.
java.lang.Exception: The data preparation for task 'CHAIN GroupReduce 
(GroupReduce at main(TPCHQuery3.java:199)) -> Map (Map at 
appendMapper(KillerClientMapper.java:38))' , caused an error: 
java.util.concurrent.ExecutionException: java.lang.RuntimeException: Error 
obtaining the sorted input: Thread 'SortMerger Reading Thread' terminated due 
to an exception: Connection for partition 
060d457c4163472f65a4b741993c83f8#0@06d656f696bf4ed98831938a7ac2359d_0bcc9fbf9ac242d5aac540917d980e44_0_1
 not reachable.

JM: receives the report from TM3:
JM: 2020-10-16 16:02:15,860 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - CHAIN 
GroupReduce (GroupReduce at main(TPCHQuery3.java:199)) -> Map (Map at 
appendMapper(KillerClientMapper.java:38)) (8/8) 
(06d656f696bf4ed98831938a7ac2359d_c1c4a56fea0536703d37867c057f0cc8_7_1) 
switched from RUNNING to FAILED on

JM: 2020-10-16 16:02:15,888 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job TPCH 
Query 3 Example (06d656f696bf4ed98831938a7ac2359d) switched from state RUNNING 
to RESTARTING.

What is TM0 doing when TM3 reports the InterruptedException?
Nothing special:
TM0: 2020-10-16 16:02:15,633 INFO  org.apache.flink.runtime.taskmanager.Task    
                [] - CHAIN GroupReduce (GroupReduce at 
main(TPCHQuery3.java:199)) -> Map (Map at 
appendMapper(KillerClientMapper.java:38)) (2/8)#1 
(06d656f696bf4ed98831938a7ac2359d_c1c4a56fea0536703d37867c057f0cc8_1_1) 
switched from DEPLOYING to RUNNING.
TM0: 2020-10-16 16:02:15,894 INFO  org.apache.flink.runtime.taskmanager.Task    
                [] - Attempting to cancel task CHAIN DataSource (at 
getLineitemDataSet(TPCHQuery3.java:347) 
(org.apache.flink.api.java.io.TupleCsvInputFormat)) -> Map (Map at 
appendMapper(KillerClientMapper.java:38)) (2/8)#1 
(06d656f696bf4ed98831938a7ac2359d_7d1f567aae11518e19b6b807800d3af7_1_1).

What is happening afterwards?

2020-10-16 16:02:16,150 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job TPCH 
Query 3 Example (06d656f696bf4ed98831938a7ac2359d) switched from state 
RESTARTING to RUNNING.
2020-10-16 16:02:23,241 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - CHAIN 
DataSource (at getOrdersDataSet(TPCHQuery3.java:363) 
(org.apache.flink.api.java.io.TupleCsvInputFormat)) -> Map (Map at 
appendMapper(KillerClientMapper.java:38)) -> Filter (Filter at 
main(TPCHQuery3.java:145)) (4/8) 
(06d656f696bf4ed98831938a7ac2359d_dbb16db93f572f3995dadfb433c8c1ff_3_2) 
switched from RUNNING to FAILED on 
org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot@2e3effa.
java.lang.RuntimeException: Kill requested
2020-10-16 16:02:23,243 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job TPCH 
Query 3 Example (06d656f696bf4ed98831938a7ac2359d) switched from state RUNNING 
to RESTARTING.
2020-10-16 16:02:23,562 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job TPCH 
Query 3 Example (06d656f696bf4ed98831938a7ac2359d) switched from state 
RESTARTING to RUNNING.
2020-10-16 16:03:06,051 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - CHAIN 
DataSource (at getLineitemDataSet(TPCHQuery3.java:347) 
(org.apache.flink.api.java.io.TupleCsvInputFormat)) -> Map (Map at 
appendMapper(KillerClientMapper.java:38)) (7/8) 
(06d656f696bf4ed98831938a7ac2359d_12527f4a17e3fa06633ddc6f343514ed_6_3) 
switched from RUNNING to FAILED on 
org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot@1de3a10.
java.lang.RuntimeException: Kill requested
2020-10-16 16:03:06,266 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job TPCH 
Query 3 Example (06d656f696bf4ed98831938a7ac2359d) switched from state 
RESTARTING to RUNNING.
2020-10-16 16:03:48,543 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - CHAIN 
GroupReduce (GroupReduce at main(TPCHQuery3.java:199)) -> Map (Map at 
appendMapper(KillerClientMapper.java:38)) (7/8) 
(06d656f696bf4ed98831938a7ac2359d_c1c4a56fea0536703d37867c057f0cc8_6_4) 
switched from RUNNING to FAILED on 
org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot@56d6a66.
java.lang.Exception: The data preparation for task 'CHAIN GroupReduce 
(GroupReduce at main(TPCHQuery3.java:199)) -> Map (Map at 
appendMapper(KillerClientMapper.java:38))' , caused an error: 
java.util.concurrent.ExecutionException: java.lang.RuntimeException: Error 
obtaining the sorted input: Thread 'SortMerger Reading Thread' terminated due 
to an exception: Connection for partition 
060d457c4163472f65a4b741993c83f8#0@06d656f696bf4ed98831938a7ac2359d_0bcc9fbf9ac242d5aac540917d980e44_0_4
 not reachable.
2020-10-16 16:03:48,548 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job TPCH 
Query 3 Example (06d656f696bf4ed98831938a7ac2359d) switched from state RUNNING 
to RESTARTING.
2020-10-16 16:03:48,704 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job TPCH 
Query 3 Example (06d656f696bf4ed98831938a7ac2359d) switched from state 
RESTARTING to RUNNING.
2020-10-16 16:04:31,385 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - CHAIN 
DataSource (at getLineitemDataSet(TPCHQuery3.java:347) 
(org.apache.flink.api.java.io.TupleCsvInputFormat)) -> Map (Map at 
appendMapper(KillerClientMapper.java:38)) (2/8) 
(06d656f696bf4ed98831938a7ac2359d_12527f4a17e3fa06633ddc6f343514ed_1_5) 
switched from RUNNING to FAILED on 
org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot@124567d2.
java.lang.RuntimeException: Kill requested
2020-10-16 16:04:31,387 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job TPCH 
Query 3 Example (06d656f696bf4ed98831938a7ac2359d) switched from state RUNNING 
to RESTARTING.
... and so on ...

It seems that we are not in a restart loop. Rather sometimes the execution gets 
interrupted.

Could it be an interrupt from the operating system?
Unlikely, we are logging signals:
2020-10-16 16:00:04,092 INFO  
org.apache.flink.runtime.taskexecutor.TaskManagerRunner      [] - Registered 
UNIX signal handlers for [TERM, HUP, INT]
there is only one signal logged
2020-10-16 16:10:31,680 INFO  
org.apache.flink.runtime.taskexecutor.TaskManagerRunner      [] - RECEIVED 
SIGNAL 15: SIGTERM. Shutting down as requested.
{code}

I will now try to find out who's interrupting the thread.



was (Author: rmetzger):
We have a test setup with 4 TaskManagers.

JM: 2020-10-16 16:01:31,015: TaskManager 2 is the first where an artificial 
failure is induced through a RuntimeException.

JM: The JobManager then cancels the execution on all TMs:
JM: 2020-10-16 16:01:31,049 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job TPCH 
Query 3 Example (06d656f696bf4ed98831938a7ac2359d) switched from state RUNNING 
to RESTARTING.

(side note)
TM3: 2020-10-16 16:01:31,190: TaskManager 3 is logging a InterruptedException 
(as a WARN from ExternalSorter) on "CHAIN Join (Join at 
main(TPCHQuery3.java:183)) -> Map (Map at 
appendMapper(KillerClientMapper.java:38)) (7/8)#0"
(This WARN will be reduced to a DEBUG in 
https://github.com/apache/flink/pull/13668).

JM: cancellation complete, we are RUNNING again:
JM: 2020-10-16 16:01:31,448 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job TPCH 
Query 3 Example (06d656f696bf4ed98831938a7ac2359d) switched from state 
RESTARTING to RUNNING.

TM3: Now TM3 fails a task, with the root cause being an InterruptedException, 
while connecting to TaskManager 0 in PartitionRequestClientFactory.connect().
TM3: 2020-10-16 16:02:15,653 WARN  org.apache.flink.runtime.taskmanager.Task    
                [] - CHAIN GroupReduce (GroupReduce at 
main(TPCHQuery3.java:199)) -> Map (Map at 
appendMapper(KillerClientMapper.java:38)) (8/8)#1 
(06d656f696bf4ed98831938a7ac2359d_c1c4a56fea0536703d37867c057f0cc8_7_1) 
switched from RUNNING to FAILED.
java.lang.Exception: The data preparation for task 'CHAIN GroupReduce 
(GroupReduce at main(TPCHQuery3.java:199)) -> Map (Map at 
appendMapper(KillerClientMapper.java:38))' , caused an error: 
java.util.concurrent.ExecutionException: java.lang.RuntimeException: Error 
obtaining the sorted input: Thread 'SortMerger Reading Thread' terminated due 
to an exception: Connection for partition 
060d457c4163472f65a4b741993c83f8#0@06d656f696bf4ed98831938a7ac2359d_0bcc9fbf9ac242d5aac540917d980e44_0_1
 not reachable.

JM: receives the report from TM3:
JM: 2020-10-16 16:02:15,860 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - CHAIN 
GroupReduce (GroupReduce at main(TPCHQuery3.java:199)) -> Map (Map at 
appendMapper(KillerClientMapper.java:38)) (8/8) 
(06d656f696bf4ed98831938a7ac2359d_c1c4a56fea0536703d37867c057f0cc8_7_1) 
switched from RUNNING to FAILED on

JM: 2020-10-16 16:02:15,888 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job TPCH 
Query 3 Example (06d656f696bf4ed98831938a7ac2359d) switched from state RUNNING 
to RESTARTING.

What is TM0 doing when TM3 reports the InterruptedException?
Nothing special:
TM0: 2020-10-16 16:02:15,633 INFO  org.apache.flink.runtime.taskmanager.Task    
                [] - CHAIN GroupReduce (GroupReduce at 
main(TPCHQuery3.java:199)) -> Map (Map at 
appendMapper(KillerClientMapper.java:38)) (2/8)#1 
(06d656f696bf4ed98831938a7ac2359d_c1c4a56fea0536703d37867c057f0cc8_1_1) 
switched from DEPLOYING to RUNNING.
TM0: 2020-10-16 16:02:15,894 INFO  org.apache.flink.runtime.taskmanager.Task    
                [] - Attempting to cancel task CHAIN DataSource (at 
getLineitemDataSet(TPCHQuery3.java:347) 
(org.apache.flink.api.java.io.TupleCsvInputFormat)) -> Map (Map at 
appendMapper(KillerClientMapper.java:38)) (2/8)#1 
(06d656f696bf4ed98831938a7ac2359d_7d1f567aae11518e19b6b807800d3af7_1_1).

What is happening afterwards?

2020-10-16 16:02:16,150 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job TPCH 
Query 3 Example (06d656f696bf4ed98831938a7ac2359d) switched from state 
RESTARTING to RUNNING.
2020-10-16 16:02:23,241 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - CHAIN 
DataSource (at getOrdersDataSet(TPCHQuery3.java:363) 
(org.apache.flink.api.java.io.TupleCsvInputFormat)) -> Map (Map at 
appendMapper(KillerClientMapper.java:38)) -> Filter (Filter at 
main(TPCHQuery3.java:145)) (4/8) 
(06d656f696bf4ed98831938a7ac2359d_dbb16db93f572f3995dadfb433c8c1ff_3_2) 
switched from RUNNING to FAILED on 
org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot@2e3effa.
java.lang.RuntimeException: Kill requested
2020-10-16 16:02:23,243 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job TPCH 
Query 3 Example (06d656f696bf4ed98831938a7ac2359d) switched from state RUNNING 
to RESTARTING.
2020-10-16 16:02:23,562 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job TPCH 
Query 3 Example (06d656f696bf4ed98831938a7ac2359d) switched from state 
RESTARTING to RUNNING.
2020-10-16 16:03:06,051 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - CHAIN 
DataSource (at getLineitemDataSet(TPCHQuery3.java:347) 
(org.apache.flink.api.java.io.TupleCsvInputFormat)) -> Map (Map at 
appendMapper(KillerClientMapper.java:38)) (7/8) 
(06d656f696bf4ed98831938a7ac2359d_12527f4a17e3fa06633ddc6f343514ed_6_3) 
switched from RUNNING to FAILED on 
org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot@1de3a10.
java.lang.RuntimeException: Kill requested
2020-10-16 16:03:06,266 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job TPCH 
Query 3 Example (06d656f696bf4ed98831938a7ac2359d) switched from state 
RESTARTING to RUNNING.
2020-10-16 16:03:48,543 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - CHAIN 
GroupReduce (GroupReduce at main(TPCHQuery3.java:199)) -> Map (Map at 
appendMapper(KillerClientMapper.java:38)) (7/8) 
(06d656f696bf4ed98831938a7ac2359d_c1c4a56fea0536703d37867c057f0cc8_6_4) 
switched from RUNNING to FAILED on 
org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot@56d6a66.
java.lang.Exception: The data preparation for task 'CHAIN GroupReduce 
(GroupReduce at main(TPCHQuery3.java:199)) -> Map (Map at 
appendMapper(KillerClientMapper.java:38))' , caused an error: 
java.util.concurrent.ExecutionException: java.lang.RuntimeException: Error 
obtaining the sorted input: Thread 'SortMerger Reading Thread' terminated due 
to an exception: Connection for partition 
060d457c4163472f65a4b741993c83f8#0@06d656f696bf4ed98831938a7ac2359d_0bcc9fbf9ac242d5aac540917d980e44_0_4
 not reachable.
2020-10-16 16:03:48,548 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job TPCH 
Query 3 Example (06d656f696bf4ed98831938a7ac2359d) switched from state RUNNING 
to RESTARTING.
2020-10-16 16:03:48,704 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job TPCH 
Query 3 Example (06d656f696bf4ed98831938a7ac2359d) switched from state 
RESTARTING to RUNNING.
2020-10-16 16:04:31,385 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - CHAIN 
DataSource (at getLineitemDataSet(TPCHQuery3.java:347) 
(org.apache.flink.api.java.io.TupleCsvInputFormat)) -> Map (Map at 
appendMapper(KillerClientMapper.java:38)) (2/8) 
(06d656f696bf4ed98831938a7ac2359d_12527f4a17e3fa06633ddc6f343514ed_1_5) 
switched from RUNNING to FAILED on 
org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot@124567d2.
java.lang.RuntimeException: Kill requested
2020-10-16 16:04:31,387 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job TPCH 
Query 3 Example (06d656f696bf4ed98831938a7ac2359d) switched from state RUNNING 
to RESTARTING.
... and so on ...

It seems that we are not in a restart loop. Rather sometimes the execution gets 
interrupted.

Could it be an interrupt from the operating system?
Unlikely, we are logging signals:
2020-10-16 16:00:04,092 INFO  
org.apache.flink.runtime.taskexecutor.TaskManagerRunner      [] - Registered 
UNIX signal handlers for [TERM, HUP, INT]
there is only one signal logged
2020-10-16 16:10:31,680 INFO  
org.apache.flink.runtime.taskexecutor.TaskManagerRunner      [] - RECEIVED 
SIGNAL 15: SIGTERM. Shutting down as requested.

I will now try to find out who's interrupting the thread.

> Flink job gets into restart loop caused by InterruptedExceptions
> ----------------------------------------------------------------
>
>                 Key: FLINK-19688
>                 URL: https://issues.apache.org/jira/browse/FLINK-19688
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Network, Runtime / Task
>    Affects Versions: 1.12.0
>            Reporter: Robert Metzger
>            Priority: Blocker
>             Fix For: 1.12.0
>
>         Attachments: logs.tgz
>
>
> I have a benchmarking test job, that throws RuntimeExceptions at any operator 
> at a configured, random interval. When using low intervals, such as mean 
> failure rate = 60 s, the job will get into a state where it frequently fails 
> with InterruptedExceptions.
> The same job does not have this problem on Flink 1.11.2 (at least not after 
> running the job for 15 hours, on 1.12-SN, it happens within a few minutes)
> This is the job: 
> https://github.com/rmetzger/flip1-bench/blob/master/flip1-bench-jobs/src/main/java/com/ververica/TPCHQuery3.java
> This is the exception:
> {code}
> 2020-10-16 16:02:15,653 WARN  org.apache.flink.runtime.taskmanager.Task       
>              [] - CHAIN GroupReduce (GroupReduce at 
> main(TPCHQuery3.java:199)) -> Map (Map at 
> appendMapper(KillerClientMapper.java:38)) (8/8)#1 
> (06d656f696bf4ed98831938a7ac2359d_c1c4a56fea0536703d37867c057f0cc8_7_1) 
> switched from RUNNING to FAILED.
> java.lang.Exception: The data preparation for task 'CHAIN GroupReduce 
> (GroupReduce at main(TPCHQuery3.java:199)) -> Map (Map at 
> appendMapper(KillerClientMapper.java:38))' , caused an error: 
> java.util.concurrent.ExecutionException: java.lang.RuntimeException: Error 
> obtaining the sorted input: Thread 'SortMerger Reading Thread' terminated due 
> to an exception: Connection for partition 
> 060d457c4163472f65a4b741993c83f8#0@06d656f696bf4ed98831938a7ac2359d_0bcc9fbf9ac242d5aac540917d980e44_0_1
>  not reachable.
>       at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:481) 
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:370) 
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:722) 
> [flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at org.apache.flink.runtime.taskmanager.Task.run(Task.java:547) 
> [flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at java.lang.Thread.run(Thread.java:748) [?:1.8.0_222]
> Caused by: org.apache.flink.util.WrappingRuntimeException: 
> java.util.concurrent.ExecutionException: java.lang.RuntimeException: Error 
> obtaining the sorted input: Thread 'SortMerger Reading Thread' terminated due 
> to an exception: Connection for partition 
> 060d457c4163472f65a4b741993c83f8#0@06d656f696bf4ed98831938a7ac2359d_0bcc9fbf9ac242d5aac540917d980e44_0_1
>  not reachable.
>       at 
> org.apache.flink.runtime.operators.sort.ExternalSorter.getIterator(ExternalSorter.java:253)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.runtime.operators.BatchTask.getInput(BatchTask.java:1122) 
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.runtime.operators.GroupReduceDriver.prepare(GroupReduceDriver.java:99)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:475) 
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       ... 4 more
> Caused by: java.util.concurrent.ExecutionException: 
> java.lang.RuntimeException: Error obtaining the sorted input: Thread 
> 'SortMerger Reading Thread' terminated due to an exception: Connection for 
> partition 
> 060d457c4163472f65a4b741993c83f8#0@06d656f696bf4ed98831938a7ac2359d_0bcc9fbf9ac242d5aac540917d980e44_0_1
>  not reachable.
>       at 
> java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) 
> ~[?:1.8.0_222]
>       at 
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) 
> ~[?:1.8.0_222]
>       at 
> org.apache.flink.runtime.operators.sort.ExternalSorter.getIterator(ExternalSorter.java:250)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.runtime.operators.BatchTask.getInput(BatchTask.java:1122) 
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.runtime.operators.GroupReduceDriver.prepare(GroupReduceDriver.java:99)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:475) 
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       ... 4 more
> Caused by: java.lang.RuntimeException: Error obtaining the sorted input: 
> Thread 'SortMerger Reading Thread' terminated due to an exception: Connection 
> for partition 
> 060d457c4163472f65a4b741993c83f8#0@06d656f696bf4ed98831938a7ac2359d_0bcc9fbf9ac242d5aac540917d980e44_0_1
>  not reachable.
>       at 
> org.apache.flink.runtime.operators.sort.ExternalSorter.lambda$getIterator$1(ExternalSorter.java:247)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
>  ~[?:1.8.0_222]
>       at 
> java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
>  ~[?:1.8.0_222]
>       at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>  ~[?:1.8.0_222]
>       at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
>  ~[?:1.8.0_222]
>       at 
> org.apache.flink.runtime.operators.sort.ExternalSorterBuilder.lambda$doBuild$1(ExternalSorterBuilder.java:363)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.runtime.operators.sort.ThreadBase.internalHandleException(ThreadBase.java:123)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.runtime.operators.sort.ThreadBase.run(ThreadBase.java:82) 
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> Caused by: java.io.IOException: Thread 'SortMerger Reading Thread' terminated 
> due to an exception: Connection for partition 
> 060d457c4163472f65a4b741993c83f8#0@06d656f696bf4ed98831938a7ac2359d_0bcc9fbf9ac242d5aac540917d980e44_0_1
>  not reachable.
>       at 
> org.apache.flink.runtime.operators.sort.ThreadBase.run(ThreadBase.java:83) 
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> Caused by: 
> org.apache.flink.runtime.io.network.partition.consumer.PartitionConnectionException:
>  Connection for partition 
> 060d457c4163472f65a4b741993c83f8#0@06d656f696bf4ed98831938a7ac2359d_0bcc9fbf9ac242d5aac540917d980e44_0_1
>  not reachable.
>       at 
> org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:160)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.internalRequestPartitions(SingleInputGate.java:305)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.requestPartitions(SingleInputGate.java:277)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.runtime.taskmanager.InputGateWithMetrics.requestPartitions(InputGateWithMetrics.java:93)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.runtime.io.network.api.reader.AbstractRecordReader.getNextRecord(AbstractRecordReader.java:91)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.runtime.io.network.api.reader.MutableRecordReader.next(MutableRecordReader.java:47)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.runtime.operators.util.ReaderIterator.next(ReaderIterator.java:59)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.runtime.operators.sort.ReadingThread.go(ReadingThread.java:68)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.runtime.operators.sort.ThreadBase.run(ThreadBase.java:79) 
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> Caused by: java.io.IOException: java.util.concurrent.ExecutionException: 
> org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: 
> Connecting to remote task manager '/192.168.2.172:52366' has failed. This 
> might indicate that the remote task manager has been lost.
>       at 
> org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.createPartitionRequestClient(PartitionRequestClientFactory.java:85)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.runtime.io.network.netty.NettyConnectionManager.createPartitionRequestClient(NettyConnectionManager.java:67)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:157)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.internalRequestPartitions(SingleInputGate.java:305)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.requestPartitions(SingleInputGate.java:277)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.runtime.taskmanager.InputGateWithMetrics.requestPartitions(InputGateWithMetrics.java:93)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.runtime.io.network.api.reader.AbstractRecordReader.getNextRecord(AbstractRecordReader.java:91)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.runtime.io.network.api.reader.MutableRecordReader.next(MutableRecordReader.java:47)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.runtime.operators.util.ReaderIterator.next(ReaderIterator.java:59)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.runtime.operators.sort.ReadingThread.go(ReadingThread.java:68)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.runtime.operators.sort.ThreadBase.run(ThreadBase.java:79) 
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> Caused by: java.util.concurrent.ExecutionException: 
> org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: 
> Connecting to remote task manager '/192.168.2.172:52366' has failed. This 
> might indicate that the remote task manager has been lost.
>       at 
> java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) 
> ~[?:1.8.0_222]
>       at 
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) 
> ~[?:1.8.0_222]
>       at 
> org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.createPartitionRequestClient(PartitionRequestClientFactory.java:83)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.runtime.io.network.netty.NettyConnectionManager.createPartitionRequestClient(NettyConnectionManager.java:67)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:157)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.internalRequestPartitions(SingleInputGate.java:305)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.requestPartitions(SingleInputGate.java:277)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.runtime.taskmanager.InputGateWithMetrics.requestPartitions(InputGateWithMetrics.java:93)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.runtime.io.network.api.reader.AbstractRecordReader.getNextRecord(AbstractRecordReader.java:91)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.runtime.io.network.api.reader.MutableRecordReader.next(MutableRecordReader.java:47)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.runtime.operators.util.ReaderIterator.next(ReaderIterator.java:59)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.runtime.operators.sort.ReadingThread.go(ReadingThread.java:68)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.runtime.operators.sort.ThreadBase.run(ThreadBase.java:79) 
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> Caused by: 
> org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: 
> Connecting to remote task manager '/192.168.2.172:52366' has failed. This 
> might indicate that the remote task manager has been lost.
>       at 
> org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.connect(PartitionRequestClientFactory.java:122)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.connectWithRetries(PartitionRequestClientFactory.java:101)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.lambda$createPartitionRequestClient$1(PartitionRequestClientFactory.java:78)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.runtime.concurrent.FutureUtils.completeFromCallable(FutureUtils.java:88)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.createPartitionRequestClient(PartitionRequestClientFactory.java:78)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.runtime.io.network.netty.NettyConnectionManager.createPartitionRequestClient(NettyConnectionManager.java:67)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:157)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.internalRequestPartitions(SingleInputGate.java:305)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.requestPartitions(SingleInputGate.java:277)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.runtime.taskmanager.InputGateWithMetrics.requestPartitions(InputGateWithMetrics.java:93)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.runtime.io.network.api.reader.AbstractRecordReader.getNextRecord(AbstractRecordReader.java:91)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.runtime.io.network.api.reader.MutableRecordReader.next(MutableRecordReader.java:47)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.runtime.operators.util.ReaderIterator.next(ReaderIterator.java:59)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.runtime.operators.sort.ReadingThread.go(ReadingThread.java:68)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.runtime.operators.sort.ThreadBase.run(ThreadBase.java:79) 
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> Caused by: java.lang.InterruptedException
>       at java.lang.Object.wait(Native Method) ~[?:1.8.0_222]
>       at java.lang.Object.wait(Object.java:502) ~[?:1.8.0_222]
>       at 
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.await(DefaultPromise.java:252)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPromise.await(DefaultChannelPromise.java:131)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPromise.await(DefaultChannelPromise.java:30)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.connect(PartitionRequestClientFactory.java:114)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.connectWithRetries(PartitionRequestClientFactory.java:101)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.lambda$createPartitionRequestClient$1(PartitionRequestClientFactory.java:78)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.runtime.concurrent.FutureUtils.completeFromCallable(FutureUtils.java:88)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.createPartitionRequestClient(PartitionRequestClientFactory.java:78)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.runtime.io.network.netty.NettyConnectionManager.createPartitionRequestClient(NettyConnectionManager.java:67)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:157)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.internalRequestPartitions(SingleInputGate.java:305)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.requestPartitions(SingleInputGate.java:277)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.runtime.taskmanager.InputGateWithMetrics.requestPartitions(InputGateWithMetrics.java:93)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.runtime.io.network.api.reader.AbstractRecordReader.getNextRecord(AbstractRecordReader.java:91)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.runtime.io.network.api.reader.MutableRecordReader.next(MutableRecordReader.java:47)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.runtime.operators.util.ReaderIterator.next(ReaderIterator.java:59)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.runtime.operators.sort.ReadingThread.go(ReadingThread.java:68)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.runtime.operators.sort.ThreadBase.run(ThreadBase.java:79) 
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> {code}
> I will attach all logs to this ticket.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (FLINK-19688) Flink job gets into restart loop caused by InterruptedExceptions

Reply via email to