[jira] [Assigned] (FLINK-35552) Move CheckpointStatsTracker out of ExecutionGraph into Scheduler
[ https://issues.apache.org/jira/browse/FLINK-35552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthias Pohl reassigned FLINK-35552: - Assignee: Matthias Pohl > Move CheckpointStatsTracker out of ExecutionGraph into Scheduler > > > Key: FLINK-35552 > URL: https://issues.apache.org/jira/browse/FLINK-35552 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Checkpointing, Runtime / Coordination >Reporter: Matthias Pohl >Assignee: Matthias Pohl >Priority: Major > Labels: pull-request-available > > The scheduler needs to know about the CheckpointStatsTracker to allow > listening to checkpoint failures and completion. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (FLINK-35551) Introduces RescaleManager#onTrigger endpoint
[ https://issues.apache.org/jira/browse/FLINK-35551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthias Pohl reassigned FLINK-35551: - Assignee: Matthias Pohl > Introduces RescaleManager#onTrigger endpoint > > > Key: FLINK-35551 > URL: https://issues.apache.org/jira/browse/FLINK-35551 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Coordination >Reporter: Matthias Pohl >Assignee: Matthias Pohl >Priority: Major > Labels: pull-request-available > > The new endpoint would allow use from separating observing change events from > actually triggering the rescale operation. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (FLINK-35553) Integrate newly added trigger interface with checkpointing
[ https://issues.apache.org/jira/browse/FLINK-35553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthias Pohl reassigned FLINK-35553: - Assignee: Matthias Pohl > Integrate newly added trigger interface with checkpointing > -- > > Key: FLINK-35553 > URL: https://issues.apache.org/jira/browse/FLINK-35553 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Checkpointing, Runtime / Coordination >Reporter: Matthias Pohl >Assignee: Matthias Pohl >Priority: Major > Labels: pull-request-available > > This connects the newly introduced trigger logic (FLINK-35551) with the > {{CheckpointStatsTracker}} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (FLINK-35639) upgrade to 1.19 with job in HA state with restart strategy crashes job manager
[ https://issues.apache.org/jira/browse/FLINK-35639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17856230#comment-17856230 ] Matthias Pohl edited comment on FLINK-35639 at 6/19/24 9:53 AM: [~chesnay] pointed me to the actual issue. I was initially wondering why the change in FLINK-32570 was actually "overlooked" by our {{japicmp}} checks. The problem is that you're actually not following the supported process (as documented in the [Flink docs|https://nightlies.apache.org/flink/flink-docs-master/docs/ops/upgrading/#restarting-streaming-applications]). That results in incompatibilities of internal APIs (the constructor in question is package-private). Please use savepoints to migrate jobs. There are other internal APIs (the JobGraph itself isn't a stable API, either) that might cause problems in your upgrade process. # Create a savepoint of the job in the old version. # Start the Flink cluster with the upgraded Flink version. # Submit the job using the created savepoint to restart the job using the job client of the new Flink binaries (to allow for proper JobGraph creation). was (Author: mapohl): [~chesnay] pointed me to the actual issue because I was wondering why the change in FLINK-32570 was actually "overlooked" by our {{japicmp}} checks. The problem is that you're actually not following the supported process (as documented in the [Flink docs|https://nightlies.apache.org/flink/flink-docs-master/docs/ops/upgrading/#restarting-streaming-applications]). That results in incompatibilities of internal APIs (the constructor in question is package-private). Please use savepoints to migrate jobs. There are other internal APIs (the JobGraph itself isn't a stable API, either) that might cause problems in your upgrade process. # Create a savepoint of the job in the old version. # Start the Flink cluster with the upgraded Flink version. # Submit the job using the created savepoint to restart the job using the job client of the new Flink binaries (to allow for proper JobGraph creation). > upgrade to 1.19 with job in HA state with restart strategy crashes job manager > -- > > Key: FLINK-35639 > URL: https://issues.apache.org/jira/browse/FLINK-35639 > Project: Flink > Issue Type: Bug > Components: API / Core >Affects Versions: 1.20.0, 1.19.1 > Environment: Download 1.18 and 1.19 binary releases. Add the > following to flink-1.19.0/conf/config.yaml and > flink-1.18.1/conf/flink-conf.yaml ```yaml high-availability: zookeeper > high-availability.zookeeper.quorum: localhost high-availability.storageDir: > file:///tmp/flink/recovery ``` Launch zookeeper: docker run --network host > zookeeper:latest launch 1.18 task manager: ./flink-1.18.1/bin/taskmanager.sh > start-foreground launch 1.18 job manager: ./flink-1.18.1/bin/jobmanager.sh > start-foreground launch the following job: ```java import > org.apache.flink.api.java.ExecutionEnvironment; import > org.apache.flink.api.java.tuple.Tuple2; import > org.apache.flink.api.common.functions.FlatMapFunction; import > org.apache.flink.util.Collector; import > org.apache.flink.api.common.restartstrategy.RestartStrategies; import > org.apache.flink.api.common.time.Time; import java.util.concurrent.TimeUnit; > public class FlinkJob \{ public static void main(String[] args) throws > Exception { final ExecutionEnvironment env = > ExecutionEnvironment.getExecutionEnvironment(); env.setRestartStrategy( > RestartStrategies.fixedDelayRestart(Integer.MAX_VALUE, Time.of(20, > TimeUnit.SECONDS)) ); env.fromElements("Hello World", "Hello Flink") > .flatMap(new LineSplitter()) .groupBy(0) .sum(1) .print(); } public static > final class LineSplitter implements FlatMapFunction> \{ @Override public void > flatMap(String value, Collector> out) { for (String word : value.split(" ")) > { try { Thread.sleep(12); } catch (InterruptedException e) \{ > e.printStackTrace(); } out.collect(new Tuple2<>(word, 1)); } } } } ``` ```xml > 4.0.0 org.apache.flink myflinkjob 1.0-SNAPSHOT 1.18.1 1.8 org.apache.flink > flink-java ${flink.version} org.apache.flink flink-streaming-java > ${flink.version} org.apache.maven.plugins maven-compiler-plugin 3.8.1 > ${java.version} ${java.version} org.apache.maven.plugins maven-jar-plugin > 3.1.0 true lib/ FlinkJob ``` Launch job: ./flink-1.18.1/bin/flink run > ../flink-job/target/myflinkjob-1.0-SNAPSHOT.jar Job has been submitted with > JobID 5f0898c964a93a47aa480427f3e2c6c0 Kill job manager and task manager. > Then launch job manager 1.19.0 ./flink-1.19.0/bin/jobmanager.sh > start-foreground Root cause == It looks like the type of > delayBetweenAttemptsInterval was changed in 1.19 >
[jira] [Updated] (FLINK-35639) upgrade to 1.19 with job in HA state with restart strategy crashes job manager
[ https://issues.apache.org/jira/browse/FLINK-35639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthias Pohl updated FLINK-35639: -- Priority: Major (was: Blocker) > upgrade to 1.19 with job in HA state with restart strategy crashes job manager > -- > > Key: FLINK-35639 > URL: https://issues.apache.org/jira/browse/FLINK-35639 > Project: Flink > Issue Type: Bug > Components: API / Core >Affects Versions: 1.20.0, 1.19.1 > Environment: Download 1.18 and 1.19 binary releases. Add the > following to flink-1.19.0/conf/config.yaml and > flink-1.18.1/conf/flink-conf.yaml ```yaml high-availability: zookeeper > high-availability.zookeeper.quorum: localhost high-availability.storageDir: > file:///tmp/flink/recovery ``` Launch zookeeper: docker run --network host > zookeeper:latest launch 1.18 task manager: ./flink-1.18.1/bin/taskmanager.sh > start-foreground launch 1.18 job manager: ./flink-1.18.1/bin/jobmanager.sh > start-foreground launch the following job: ```java import > org.apache.flink.api.java.ExecutionEnvironment; import > org.apache.flink.api.java.tuple.Tuple2; import > org.apache.flink.api.common.functions.FlatMapFunction; import > org.apache.flink.util.Collector; import > org.apache.flink.api.common.restartstrategy.RestartStrategies; import > org.apache.flink.api.common.time.Time; import java.util.concurrent.TimeUnit; > public class FlinkJob \{ public static void main(String[] args) throws > Exception { final ExecutionEnvironment env = > ExecutionEnvironment.getExecutionEnvironment(); env.setRestartStrategy( > RestartStrategies.fixedDelayRestart(Integer.MAX_VALUE, Time.of(20, > TimeUnit.SECONDS)) ); env.fromElements("Hello World", "Hello Flink") > .flatMap(new LineSplitter()) .groupBy(0) .sum(1) .print(); } public static > final class LineSplitter implements FlatMapFunction> \{ @Override public void > flatMap(String value, Collector> out) { for (String word : value.split(" ")) > { try { Thread.sleep(12); } catch (InterruptedException e) \{ > e.printStackTrace(); } out.collect(new Tuple2<>(word, 1)); } } } } ``` ```xml > 4.0.0 org.apache.flink myflinkjob 1.0-SNAPSHOT 1.18.1 1.8 org.apache.flink > flink-java ${flink.version} org.apache.flink flink-streaming-java > ${flink.version} org.apache.maven.plugins maven-compiler-plugin 3.8.1 > ${java.version} ${java.version} org.apache.maven.plugins maven-jar-plugin > 3.1.0 true lib/ FlinkJob ``` Launch job: ./flink-1.18.1/bin/flink run > ../flink-job/target/myflinkjob-1.0-SNAPSHOT.jar Job has been submitted with > JobID 5f0898c964a93a47aa480427f3e2c6c0 Kill job manager and task manager. > Then launch job manager 1.19.0 ./flink-1.19.0/bin/jobmanager.sh > start-foreground Root cause == It looks like the type of > delayBetweenAttemptsInterval was changed in 1.19 > https://github.com/apache/flink/pull/22984/files#diff-d174f32ffdea69de610c4f37c545bd22a253b9846434f83397f1bbc2aaa399faR239 > , introducing an incompatibility which is not handled by flink 1.19. In my > opinion, job-maanger should not crash when starting in that case. >Reporter: yazgoo >Assignee: Matthias Pohl >Priority: Major > Labels: pull-request-available > > When trying to upgrade a flink cluster from 1.18 to 1.19, with a 1.18 job in > zookeeper HA state, I have a jobmanager crash with a ClassCastException, see > log below > > {code:java} > 2024-06-18 16:58:14,401 ERROR > org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Fatal error > occurred in the cluster entrypoint. org.apache.flink.util.FlinkException: > JobMaster for job 5f0898c964a93a47aa480427f3e2c6c0 failed. at > org.apache.flink.runtime.dispatcher.Dispatcher.jobMasterFailed(Dispatcher.java:1484) > ~[flink-dist-1.19.0.jar:1.19.0] at > org.apache.flink.runtime.dispatcher.Dispatcher.jobManagerRunnerFailed(Dispatcher.java:775) > ~[flink-dist-1.19.0.jar:1.19.0] at > org.apache.flink.runtime.dispatcher.Dispatcher.handleJobManagerRunnerResult(Dispatcher.java:738) > ~[flink-dist-1.19.0.jar:1.19.0] at > org.apache.flink.runtime.dispatcher.Dispatcher.lambda$runJob$7(Dispatcher.java:693) > ~[flink-dist-1.19.0.jar:1.19.0] at > java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:934) > ~[?:?] at > java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:911) > ~[?:?] at > java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:482) > ~[?:?] at > org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.lambda$handleRunAsync$4(PekkoRpcActor.java:451) > ~[flink-rpc-akka84eb9e64-a1ce-450c-ad53-d9fa579b67e1.jar:1.19.0] at >
[jira] [Closed] (FLINK-35639) upgrade to 1.19 with job in HA state with restart strategy crashes job manager
[ https://issues.apache.org/jira/browse/FLINK-35639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthias Pohl closed FLINK-35639. - Resolution: Not A Problem I'm closing the issue and the related PRs because we're actually not supporting this kind of version upgrades in general. Fixing the {{RestartStrategy}} issue wouldn't necessarily solve the issue. > upgrade to 1.19 with job in HA state with restart strategy crashes job manager > -- > > Key: FLINK-35639 > URL: https://issues.apache.org/jira/browse/FLINK-35639 > Project: Flink > Issue Type: Bug > Components: API / Core >Affects Versions: 1.20.0, 1.19.1 > Environment: Download 1.18 and 1.19 binary releases. Add the > following to flink-1.19.0/conf/config.yaml and > flink-1.18.1/conf/flink-conf.yaml ```yaml high-availability: zookeeper > high-availability.zookeeper.quorum: localhost high-availability.storageDir: > file:///tmp/flink/recovery ``` Launch zookeeper: docker run --network host > zookeeper:latest launch 1.18 task manager: ./flink-1.18.1/bin/taskmanager.sh > start-foreground launch 1.18 job manager: ./flink-1.18.1/bin/jobmanager.sh > start-foreground launch the following job: ```java import > org.apache.flink.api.java.ExecutionEnvironment; import > org.apache.flink.api.java.tuple.Tuple2; import > org.apache.flink.api.common.functions.FlatMapFunction; import > org.apache.flink.util.Collector; import > org.apache.flink.api.common.restartstrategy.RestartStrategies; import > org.apache.flink.api.common.time.Time; import java.util.concurrent.TimeUnit; > public class FlinkJob \{ public static void main(String[] args) throws > Exception { final ExecutionEnvironment env = > ExecutionEnvironment.getExecutionEnvironment(); env.setRestartStrategy( > RestartStrategies.fixedDelayRestart(Integer.MAX_VALUE, Time.of(20, > TimeUnit.SECONDS)) ); env.fromElements("Hello World", "Hello Flink") > .flatMap(new LineSplitter()) .groupBy(0) .sum(1) .print(); } public static > final class LineSplitter implements FlatMapFunction> \{ @Override public void > flatMap(String value, Collector> out) { for (String word : value.split(" ")) > { try { Thread.sleep(12); } catch (InterruptedException e) \{ > e.printStackTrace(); } out.collect(new Tuple2<>(word, 1)); } } } } ``` ```xml > 4.0.0 org.apache.flink myflinkjob 1.0-SNAPSHOT 1.18.1 1.8 org.apache.flink > flink-java ${flink.version} org.apache.flink flink-streaming-java > ${flink.version} org.apache.maven.plugins maven-compiler-plugin 3.8.1 > ${java.version} ${java.version} org.apache.maven.plugins maven-jar-plugin > 3.1.0 true lib/ FlinkJob ``` Launch job: ./flink-1.18.1/bin/flink run > ../flink-job/target/myflinkjob-1.0-SNAPSHOT.jar Job has been submitted with > JobID 5f0898c964a93a47aa480427f3e2c6c0 Kill job manager and task manager. > Then launch job manager 1.19.0 ./flink-1.19.0/bin/jobmanager.sh > start-foreground Root cause == It looks like the type of > delayBetweenAttemptsInterval was changed in 1.19 > https://github.com/apache/flink/pull/22984/files#diff-d174f32ffdea69de610c4f37c545bd22a253b9846434f83397f1bbc2aaa399faR239 > , introducing an incompatibility which is not handled by flink 1.19. In my > opinion, job-maanger should not crash when starting in that case. >Reporter: yazgoo >Assignee: Matthias Pohl >Priority: Blocker > Labels: pull-request-available > > When trying to upgrade a flink cluster from 1.18 to 1.19, with a 1.18 job in > zookeeper HA state, I have a jobmanager crash with a ClassCastException, see > log below > > {code:java} > 2024-06-18 16:58:14,401 ERROR > org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Fatal error > occurred in the cluster entrypoint. org.apache.flink.util.FlinkException: > JobMaster for job 5f0898c964a93a47aa480427f3e2c6c0 failed. at > org.apache.flink.runtime.dispatcher.Dispatcher.jobMasterFailed(Dispatcher.java:1484) > ~[flink-dist-1.19.0.jar:1.19.0] at > org.apache.flink.runtime.dispatcher.Dispatcher.jobManagerRunnerFailed(Dispatcher.java:775) > ~[flink-dist-1.19.0.jar:1.19.0] at > org.apache.flink.runtime.dispatcher.Dispatcher.handleJobManagerRunnerResult(Dispatcher.java:738) > ~[flink-dist-1.19.0.jar:1.19.0] at > org.apache.flink.runtime.dispatcher.Dispatcher.lambda$runJob$7(Dispatcher.java:693) > ~[flink-dist-1.19.0.jar:1.19.0] at > java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:934) > ~[?:?] at > java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:911) > ~[?:?] at > java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:482) > ~[?:?] at >
[jira] [Comment Edited] (FLINK-35639) upgrade to 1.19 with job in HA state with restart strategy crashes job manager
[ https://issues.apache.org/jira/browse/FLINK-35639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17856230#comment-17856230 ] Matthias Pohl edited comment on FLINK-35639 at 6/19/24 9:47 AM: [~chesnay] pointed me to the actual issue because I was wondering why the change in FLINK-32570 was actually "overlooked" by our {{japicmp}} checks. The problem is that you're actually not following the supported process (as documented in the [Flink docs|https://nightlies.apache.org/flink/flink-docs-master/docs/ops/upgrading/#restarting-streaming-applications]). That results in incompatibilities of internal APIs (the constructor in question is package-private). Please use savepoints to migrate jobs. There are other internal APIs (the JobGraph itself isn't a stable API, either) that might cause problems in your upgrade process. # Create a savepoint of the job in the old version. # Start the Flink cluster with the upgraded Flink version. # Submit the job using the created savepoint to restart the job using the job client of the new Flink binaries (to allow for proper JobGraph creation). was (Author: mapohl): [~chesnay] pointed me to the actual issue because I was wondering why the change in FLINK-32570 was actually "overlooked" by our {{japicmp}} checks. The problem is that you're actually not following the supported process (as documented in the [Flink docs|https://nightlies.apache.org/flink/flink-docs-master/docs/ops/upgrading/#restarting-streaming-applications]). That results in incompatibilities of internal APIs (the constructor in question is package-private). Please use savepoints to migrate jobs. There are other internal APIs (the JobGraph itself isn't a stable API, either) that might cause problems in your upgrade process. # Create a savepoint of the job in the old version. # Start the Flink cluster with the upgraded Flink version. # Submit the job using the created savepoint to restart the job. > upgrade to 1.19 with job in HA state with restart strategy crashes job manager > -- > > Key: FLINK-35639 > URL: https://issues.apache.org/jira/browse/FLINK-35639 > Project: Flink > Issue Type: Bug > Components: API / Core >Affects Versions: 1.20.0, 1.19.1 > Environment: Download 1.18 and 1.19 binary releases. Add the > following to flink-1.19.0/conf/config.yaml and > flink-1.18.1/conf/flink-conf.yaml ```yaml high-availability: zookeeper > high-availability.zookeeper.quorum: localhost high-availability.storageDir: > file:///tmp/flink/recovery ``` Launch zookeeper: docker run --network host > zookeeper:latest launch 1.18 task manager: ./flink-1.18.1/bin/taskmanager.sh > start-foreground launch 1.18 job manager: ./flink-1.18.1/bin/jobmanager.sh > start-foreground launch the following job: ```java import > org.apache.flink.api.java.ExecutionEnvironment; import > org.apache.flink.api.java.tuple.Tuple2; import > org.apache.flink.api.common.functions.FlatMapFunction; import > org.apache.flink.util.Collector; import > org.apache.flink.api.common.restartstrategy.RestartStrategies; import > org.apache.flink.api.common.time.Time; import java.util.concurrent.TimeUnit; > public class FlinkJob \{ public static void main(String[] args) throws > Exception { final ExecutionEnvironment env = > ExecutionEnvironment.getExecutionEnvironment(); env.setRestartStrategy( > RestartStrategies.fixedDelayRestart(Integer.MAX_VALUE, Time.of(20, > TimeUnit.SECONDS)) ); env.fromElements("Hello World", "Hello Flink") > .flatMap(new LineSplitter()) .groupBy(0) .sum(1) .print(); } public static > final class LineSplitter implements FlatMapFunction> \{ @Override public void > flatMap(String value, Collector> out) { for (String word : value.split(" ")) > { try { Thread.sleep(12); } catch (InterruptedException e) \{ > e.printStackTrace(); } out.collect(new Tuple2<>(word, 1)); } } } } ``` ```xml > 4.0.0 org.apache.flink myflinkjob 1.0-SNAPSHOT 1.18.1 1.8 org.apache.flink > flink-java ${flink.version} org.apache.flink flink-streaming-java > ${flink.version} org.apache.maven.plugins maven-compiler-plugin 3.8.1 > ${java.version} ${java.version} org.apache.maven.plugins maven-jar-plugin > 3.1.0 true lib/ FlinkJob ``` Launch job: ./flink-1.18.1/bin/flink run > ../flink-job/target/myflinkjob-1.0-SNAPSHOT.jar Job has been submitted with > JobID 5f0898c964a93a47aa480427f3e2c6c0 Kill job manager and task manager. > Then launch job manager 1.19.0 ./flink-1.19.0/bin/jobmanager.sh > start-foreground Root cause == It looks like the type of > delayBetweenAttemptsInterval was changed in 1.19 > https://github.com/apache/flink/pull/22984/files#diff-d174f32ffdea69de610c4f37c545bd22a253b9846434f83397f1bbc2aaa399faR239 > , introducing an incompatibility which is
[jira] [Commented] (FLINK-35639) upgrade to 1.19 with job in HA state with restart strategy crashes job manager
[ https://issues.apache.org/jira/browse/FLINK-35639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17856230#comment-17856230 ] Matthias Pohl commented on FLINK-35639: --- [~chesnay] pointed me to the actual issue because I was wondering why the change in FLINK-32570 was actually "overlooked" by our {{japicmp}} checks. The problem is that you're actually not following the supported process (as documented in the [Flink docs|https://nightlies.apache.org/flink/flink-docs-master/docs/ops/upgrading/#restarting-streaming-applications]). That results in incompatibilities of internal APIs (the constructor in question is package-private). Please use savepoints to migrate jobs. There are other internal APIs (the JobGraph itself isn't a stable API, either) that might cause problems in your upgrade process. # Create a savepoint of the job in the old version. # Start the Flink cluster with the upgraded Flink version. # Submit the job using the created savepoint to restart the job. > upgrade to 1.19 with job in HA state with restart strategy crashes job manager > -- > > Key: FLINK-35639 > URL: https://issues.apache.org/jira/browse/FLINK-35639 > Project: Flink > Issue Type: Bug > Components: API / Core >Affects Versions: 1.20.0, 1.19.1 > Environment: Download 1.18 and 1.19 binary releases. Add the > following to flink-1.19.0/conf/config.yaml and > flink-1.18.1/conf/flink-conf.yaml ```yaml high-availability: zookeeper > high-availability.zookeeper.quorum: localhost high-availability.storageDir: > file:///tmp/flink/recovery ``` Launch zookeeper: docker run --network host > zookeeper:latest launch 1.18 task manager: ./flink-1.18.1/bin/taskmanager.sh > start-foreground launch 1.18 job manager: ./flink-1.18.1/bin/jobmanager.sh > start-foreground launch the following job: ```java import > org.apache.flink.api.java.ExecutionEnvironment; import > org.apache.flink.api.java.tuple.Tuple2; import > org.apache.flink.api.common.functions.FlatMapFunction; import > org.apache.flink.util.Collector; import > org.apache.flink.api.common.restartstrategy.RestartStrategies; import > org.apache.flink.api.common.time.Time; import java.util.concurrent.TimeUnit; > public class FlinkJob \{ public static void main(String[] args) throws > Exception { final ExecutionEnvironment env = > ExecutionEnvironment.getExecutionEnvironment(); env.setRestartStrategy( > RestartStrategies.fixedDelayRestart(Integer.MAX_VALUE, Time.of(20, > TimeUnit.SECONDS)) ); env.fromElements("Hello World", "Hello Flink") > .flatMap(new LineSplitter()) .groupBy(0) .sum(1) .print(); } public static > final class LineSplitter implements FlatMapFunction> \{ @Override public void > flatMap(String value, Collector> out) { for (String word : value.split(" ")) > { try { Thread.sleep(12); } catch (InterruptedException e) \{ > e.printStackTrace(); } out.collect(new Tuple2<>(word, 1)); } } } } ``` ```xml > 4.0.0 org.apache.flink myflinkjob 1.0-SNAPSHOT 1.18.1 1.8 org.apache.flink > flink-java ${flink.version} org.apache.flink flink-streaming-java > ${flink.version} org.apache.maven.plugins maven-compiler-plugin 3.8.1 > ${java.version} ${java.version} org.apache.maven.plugins maven-jar-plugin > 3.1.0 true lib/ FlinkJob ``` Launch job: ./flink-1.18.1/bin/flink run > ../flink-job/target/myflinkjob-1.0-SNAPSHOT.jar Job has been submitted with > JobID 5f0898c964a93a47aa480427f3e2c6c0 Kill job manager and task manager. > Then launch job manager 1.19.0 ./flink-1.19.0/bin/jobmanager.sh > start-foreground Root cause == It looks like the type of > delayBetweenAttemptsInterval was changed in 1.19 > https://github.com/apache/flink/pull/22984/files#diff-d174f32ffdea69de610c4f37c545bd22a253b9846434f83397f1bbc2aaa399faR239 > , introducing an incompatibility which is not handled by flink 1.19. In my > opinion, job-maanger should not crash when starting in that case. >Reporter: yazgoo >Assignee: Matthias Pohl >Priority: Blocker > Labels: pull-request-available > > When trying to upgrade a flink cluster from 1.18 to 1.19, with a 1.18 job in > zookeeper HA state, I have a jobmanager crash with a ClassCastException, see > log below > > {code:java} > 2024-06-18 16:58:14,401 ERROR > org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Fatal error > occurred in the cluster entrypoint. org.apache.flink.util.FlinkException: > JobMaster for job 5f0898c964a93a47aa480427f3e2c6c0 failed. at > org.apache.flink.runtime.dispatcher.Dispatcher.jobMasterFailed(Dispatcher.java:1484) > ~[flink-dist-1.19.0.jar:1.19.0] at > org.apache.flink.runtime.dispatcher.Dispatcher.jobManagerRunnerFailed(Dispatcher.java:775) > ~[flink-dist-1.19.0.jar:1.19.0] at >
[jira] [Assigned] (FLINK-35639) upgrade to 1.19 with job in HA state with restart strategy crashes job manager
[ https://issues.apache.org/jira/browse/FLINK-35639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthias Pohl reassigned FLINK-35639: - Assignee: Matthias Pohl > upgrade to 1.19 with job in HA state with restart strategy crashes job manager > -- > > Key: FLINK-35639 > URL: https://issues.apache.org/jira/browse/FLINK-35639 > Project: Flink > Issue Type: Bug > Components: API / Core >Affects Versions: 1.19.1 > Environment: Download 1.18 and 1.19 binary releases. Add the > following to flink-1.19.0/conf/config.yaml and > flink-1.18.1/conf/flink-conf.yaml ```yaml high-availability: zookeeper > high-availability.zookeeper.quorum: localhost high-availability.storageDir: > file:///tmp/flink/recovery ``` Launch zookeeper: docker run --network host > zookeeper:latest launch 1.18 task manager: ./flink-1.18.1/bin/taskmanager.sh > start-foreground launch 1.18 job manager: ./flink-1.18.1/bin/jobmanager.sh > start-foreground launch the following job: ```java import > org.apache.flink.api.java.ExecutionEnvironment; import > org.apache.flink.api.java.tuple.Tuple2; import > org.apache.flink.api.common.functions.FlatMapFunction; import > org.apache.flink.util.Collector; import > org.apache.flink.api.common.restartstrategy.RestartStrategies; import > org.apache.flink.api.common.time.Time; import java.util.concurrent.TimeUnit; > public class FlinkJob \{ public static void main(String[] args) throws > Exception { final ExecutionEnvironment env = > ExecutionEnvironment.getExecutionEnvironment(); env.setRestartStrategy( > RestartStrategies.fixedDelayRestart(Integer.MAX_VALUE, Time.of(20, > TimeUnit.SECONDS)) ); env.fromElements("Hello World", "Hello Flink") > .flatMap(new LineSplitter()) .groupBy(0) .sum(1) .print(); } public static > final class LineSplitter implements FlatMapFunction> \{ @Override public void > flatMap(String value, Collector> out) { for (String word : value.split(" ")) > { try { Thread.sleep(12); } catch (InterruptedException e) \{ > e.printStackTrace(); } out.collect(new Tuple2<>(word, 1)); } } } } ``` ```xml > 4.0.0 org.apache.flink myflinkjob 1.0-SNAPSHOT 1.18.1 1.8 org.apache.flink > flink-java ${flink.version} org.apache.flink flink-streaming-java > ${flink.version} org.apache.maven.plugins maven-compiler-plugin 3.8.1 > ${java.version} ${java.version} org.apache.maven.plugins maven-jar-plugin > 3.1.0 true lib/ FlinkJob ``` Launch job: ./flink-1.18.1/bin/flink run > ../flink-job/target/myflinkjob-1.0-SNAPSHOT.jar Job has been submitted with > JobID 5f0898c964a93a47aa480427f3e2c6c0 Kill job manager and task manager. > Then launch job manager 1.19.0 ./flink-1.19.0/bin/jobmanager.sh > start-foreground Root cause == It looks like the type of > delayBetweenAttemptsInterval was changed in 1.19 > https://github.com/apache/flink/pull/22984/files#diff-d174f32ffdea69de610c4f37c545bd22a253b9846434f83397f1bbc2aaa399faR239 > , introducing an incompatibility which is not handled by flink 1.19. In my > opinion, job-maanger should not crash when starting in that case. >Reporter: yazgoo >Assignee: Matthias Pohl >Priority: Major > > When trying to upgrade a flink cluster from 1.18 to 1.19, with a 1.18 job in > zookeeper HA state, I have a jobmanager crash with a ClassCastException, see > log below > > {code:java} > 2024-06-18 16:58:14,401 ERROR > org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Fatal error > occurred in the cluster entrypoint. org.apache.flink.util.FlinkException: > JobMaster for job 5f0898c964a93a47aa480427f3e2c6c0 failed. at > org.apache.flink.runtime.dispatcher.Dispatcher.jobMasterFailed(Dispatcher.java:1484) > ~[flink-dist-1.19.0.jar:1.19.0] at > org.apache.flink.runtime.dispatcher.Dispatcher.jobManagerRunnerFailed(Dispatcher.java:775) > ~[flink-dist-1.19.0.jar:1.19.0] at > org.apache.flink.runtime.dispatcher.Dispatcher.handleJobManagerRunnerResult(Dispatcher.java:738) > ~[flink-dist-1.19.0.jar:1.19.0] at > org.apache.flink.runtime.dispatcher.Dispatcher.lambda$runJob$7(Dispatcher.java:693) > ~[flink-dist-1.19.0.jar:1.19.0] at > java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:934) > ~[?:?] at > java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:911) > ~[?:?] at > java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:482) > ~[?:?] at > org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.lambda$handleRunAsync$4(PekkoRpcActor.java:451) > ~[flink-rpc-akka84eb9e64-a1ce-450c-ad53-d9fa579b67e1.jar:1.19.0] at > org.apache.flink.runtime.concurrent.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68) > ~[flink-dist-1.19.0.jar:1.19.0]
[jira] [Comment Edited] (FLINK-35639) upgrade to 1.19 with job in HA state with restart strategy crashes job manager
[ https://issues.apache.org/jira/browse/FLINK-35639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855979#comment-17855979 ] Matthias Pohl edited comment on FLINK-35639 at 6/18/24 3:42 PM: I guess, you're right. Thanks for reporting this. Looks like there was an error being made when deprecating the {{@PublicEvolving}} API of {{{}RestartStrategies#FixedDelayRestartStrategyConfiguration{}}}. I will raise the priority for this one to blocker because it also affects 1.20. was (Author: mapohl): I guess, you're right. Looks like there was an error being made when deprecating the {{@PublicEvolving}} API of {{{}RestartStrategies#FixedDelayRestartStrategyConfiguration{}}}. I will raise the priority for this one to blocker because it also affects 1.20. > upgrade to 1.19 with job in HA state with restart strategy crashes job manager > -- > > Key: FLINK-35639 > URL: https://issues.apache.org/jira/browse/FLINK-35639 > Project: Flink > Issue Type: Bug > Components: API / Core >Affects Versions: 1.20.0, 1.19.1 > Environment: Download 1.18 and 1.19 binary releases. Add the > following to flink-1.19.0/conf/config.yaml and > flink-1.18.1/conf/flink-conf.yaml ```yaml high-availability: zookeeper > high-availability.zookeeper.quorum: localhost high-availability.storageDir: > file:///tmp/flink/recovery ``` Launch zookeeper: docker run --network host > zookeeper:latest launch 1.18 task manager: ./flink-1.18.1/bin/taskmanager.sh > start-foreground launch 1.18 job manager: ./flink-1.18.1/bin/jobmanager.sh > start-foreground launch the following job: ```java import > org.apache.flink.api.java.ExecutionEnvironment; import > org.apache.flink.api.java.tuple.Tuple2; import > org.apache.flink.api.common.functions.FlatMapFunction; import > org.apache.flink.util.Collector; import > org.apache.flink.api.common.restartstrategy.RestartStrategies; import > org.apache.flink.api.common.time.Time; import java.util.concurrent.TimeUnit; > public class FlinkJob \{ public static void main(String[] args) throws > Exception { final ExecutionEnvironment env = > ExecutionEnvironment.getExecutionEnvironment(); env.setRestartStrategy( > RestartStrategies.fixedDelayRestart(Integer.MAX_VALUE, Time.of(20, > TimeUnit.SECONDS)) ); env.fromElements("Hello World", "Hello Flink") > .flatMap(new LineSplitter()) .groupBy(0) .sum(1) .print(); } public static > final class LineSplitter implements FlatMapFunction> \{ @Override public void > flatMap(String value, Collector> out) { for (String word : value.split(" ")) > { try { Thread.sleep(12); } catch (InterruptedException e) \{ > e.printStackTrace(); } out.collect(new Tuple2<>(word, 1)); } } } } ``` ```xml > 4.0.0 org.apache.flink myflinkjob 1.0-SNAPSHOT 1.18.1 1.8 org.apache.flink > flink-java ${flink.version} org.apache.flink flink-streaming-java > ${flink.version} org.apache.maven.plugins maven-compiler-plugin 3.8.1 > ${java.version} ${java.version} org.apache.maven.plugins maven-jar-plugin > 3.1.0 true lib/ FlinkJob ``` Launch job: ./flink-1.18.1/bin/flink run > ../flink-job/target/myflinkjob-1.0-SNAPSHOT.jar Job has been submitted with > JobID 5f0898c964a93a47aa480427f3e2c6c0 Kill job manager and task manager. > Then launch job manager 1.19.0 ./flink-1.19.0/bin/jobmanager.sh > start-foreground Root cause == It looks like the type of > delayBetweenAttemptsInterval was changed in 1.19 > https://github.com/apache/flink/pull/22984/files#diff-d174f32ffdea69de610c4f37c545bd22a253b9846434f83397f1bbc2aaa399faR239 > , introducing an incompatibility which is not handled by flink 1.19. In my > opinion, job-maanger should not crash when starting in that case. >Reporter: yazgoo >Assignee: Matthias Pohl >Priority: Blocker > > When trying to upgrade a flink cluster from 1.18 to 1.19, with a 1.18 job in > zookeeper HA state, I have a jobmanager crash with a ClassCastException, see > log below > > {code:java} > 2024-06-18 16:58:14,401 ERROR > org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Fatal error > occurred in the cluster entrypoint. org.apache.flink.util.FlinkException: > JobMaster for job 5f0898c964a93a47aa480427f3e2c6c0 failed. at > org.apache.flink.runtime.dispatcher.Dispatcher.jobMasterFailed(Dispatcher.java:1484) > ~[flink-dist-1.19.0.jar:1.19.0] at > org.apache.flink.runtime.dispatcher.Dispatcher.jobManagerRunnerFailed(Dispatcher.java:775) > ~[flink-dist-1.19.0.jar:1.19.0] at > org.apache.flink.runtime.dispatcher.Dispatcher.handleJobManagerRunnerResult(Dispatcher.java:738) > ~[flink-dist-1.19.0.jar:1.19.0] at > org.apache.flink.runtime.dispatcher.Dispatcher.lambda$runJob$7(Dispatcher.java:693) >
[jira] [Updated] (FLINK-35639) upgrade to 1.19 with job in HA state with restart strategy crashes job manager
[ https://issues.apache.org/jira/browse/FLINK-35639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthias Pohl updated FLINK-35639: -- Priority: Blocker (was: Major) > upgrade to 1.19 with job in HA state with restart strategy crashes job manager > -- > > Key: FLINK-35639 > URL: https://issues.apache.org/jira/browse/FLINK-35639 > Project: Flink > Issue Type: Bug > Components: API / Core >Affects Versions: 1.20.0, 1.19.1 > Environment: Download 1.18 and 1.19 binary releases. Add the > following to flink-1.19.0/conf/config.yaml and > flink-1.18.1/conf/flink-conf.yaml ```yaml high-availability: zookeeper > high-availability.zookeeper.quorum: localhost high-availability.storageDir: > file:///tmp/flink/recovery ``` Launch zookeeper: docker run --network host > zookeeper:latest launch 1.18 task manager: ./flink-1.18.1/bin/taskmanager.sh > start-foreground launch 1.18 job manager: ./flink-1.18.1/bin/jobmanager.sh > start-foreground launch the following job: ```java import > org.apache.flink.api.java.ExecutionEnvironment; import > org.apache.flink.api.java.tuple.Tuple2; import > org.apache.flink.api.common.functions.FlatMapFunction; import > org.apache.flink.util.Collector; import > org.apache.flink.api.common.restartstrategy.RestartStrategies; import > org.apache.flink.api.common.time.Time; import java.util.concurrent.TimeUnit; > public class FlinkJob \{ public static void main(String[] args) throws > Exception { final ExecutionEnvironment env = > ExecutionEnvironment.getExecutionEnvironment(); env.setRestartStrategy( > RestartStrategies.fixedDelayRestart(Integer.MAX_VALUE, Time.of(20, > TimeUnit.SECONDS)) ); env.fromElements("Hello World", "Hello Flink") > .flatMap(new LineSplitter()) .groupBy(0) .sum(1) .print(); } public static > final class LineSplitter implements FlatMapFunction> \{ @Override public void > flatMap(String value, Collector> out) { for (String word : value.split(" ")) > { try { Thread.sleep(12); } catch (InterruptedException e) \{ > e.printStackTrace(); } out.collect(new Tuple2<>(word, 1)); } } } } ``` ```xml > 4.0.0 org.apache.flink myflinkjob 1.0-SNAPSHOT 1.18.1 1.8 org.apache.flink > flink-java ${flink.version} org.apache.flink flink-streaming-java > ${flink.version} org.apache.maven.plugins maven-compiler-plugin 3.8.1 > ${java.version} ${java.version} org.apache.maven.plugins maven-jar-plugin > 3.1.0 true lib/ FlinkJob ``` Launch job: ./flink-1.18.1/bin/flink run > ../flink-job/target/myflinkjob-1.0-SNAPSHOT.jar Job has been submitted with > JobID 5f0898c964a93a47aa480427f3e2c6c0 Kill job manager and task manager. > Then launch job manager 1.19.0 ./flink-1.19.0/bin/jobmanager.sh > start-foreground Root cause == It looks like the type of > delayBetweenAttemptsInterval was changed in 1.19 > https://github.com/apache/flink/pull/22984/files#diff-d174f32ffdea69de610c4f37c545bd22a253b9846434f83397f1bbc2aaa399faR239 > , introducing an incompatibility which is not handled by flink 1.19. In my > opinion, job-maanger should not crash when starting in that case. >Reporter: yazgoo >Assignee: Matthias Pohl >Priority: Blocker > > When trying to upgrade a flink cluster from 1.18 to 1.19, with a 1.18 job in > zookeeper HA state, I have a jobmanager crash with a ClassCastException, see > log below > > {code:java} > 2024-06-18 16:58:14,401 ERROR > org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Fatal error > occurred in the cluster entrypoint. org.apache.flink.util.FlinkException: > JobMaster for job 5f0898c964a93a47aa480427f3e2c6c0 failed. at > org.apache.flink.runtime.dispatcher.Dispatcher.jobMasterFailed(Dispatcher.java:1484) > ~[flink-dist-1.19.0.jar:1.19.0] at > org.apache.flink.runtime.dispatcher.Dispatcher.jobManagerRunnerFailed(Dispatcher.java:775) > ~[flink-dist-1.19.0.jar:1.19.0] at > org.apache.flink.runtime.dispatcher.Dispatcher.handleJobManagerRunnerResult(Dispatcher.java:738) > ~[flink-dist-1.19.0.jar:1.19.0] at > org.apache.flink.runtime.dispatcher.Dispatcher.lambda$runJob$7(Dispatcher.java:693) > ~[flink-dist-1.19.0.jar:1.19.0] at > java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:934) > ~[?:?] at > java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:911) > ~[?:?] at > java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:482) > ~[?:?] at > org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.lambda$handleRunAsync$4(PekkoRpcActor.java:451) > ~[flink-rpc-akka84eb9e64-a1ce-450c-ad53-d9fa579b67e1.jar:1.19.0] at > org.apache.flink.runtime.concurrent.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68) >
[jira] [Updated] (FLINK-35639) upgrade to 1.19 with job in HA state with restart strategy crashes job manager
[ https://issues.apache.org/jira/browse/FLINK-35639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthias Pohl updated FLINK-35639: -- Affects Version/s: 1.20.0 > upgrade to 1.19 with job in HA state with restart strategy crashes job manager > -- > > Key: FLINK-35639 > URL: https://issues.apache.org/jira/browse/FLINK-35639 > Project: Flink > Issue Type: Bug > Components: API / Core >Affects Versions: 1.20.0, 1.19.1 > Environment: Download 1.18 and 1.19 binary releases. Add the > following to flink-1.19.0/conf/config.yaml and > flink-1.18.1/conf/flink-conf.yaml ```yaml high-availability: zookeeper > high-availability.zookeeper.quorum: localhost high-availability.storageDir: > file:///tmp/flink/recovery ``` Launch zookeeper: docker run --network host > zookeeper:latest launch 1.18 task manager: ./flink-1.18.1/bin/taskmanager.sh > start-foreground launch 1.18 job manager: ./flink-1.18.1/bin/jobmanager.sh > start-foreground launch the following job: ```java import > org.apache.flink.api.java.ExecutionEnvironment; import > org.apache.flink.api.java.tuple.Tuple2; import > org.apache.flink.api.common.functions.FlatMapFunction; import > org.apache.flink.util.Collector; import > org.apache.flink.api.common.restartstrategy.RestartStrategies; import > org.apache.flink.api.common.time.Time; import java.util.concurrent.TimeUnit; > public class FlinkJob \{ public static void main(String[] args) throws > Exception { final ExecutionEnvironment env = > ExecutionEnvironment.getExecutionEnvironment(); env.setRestartStrategy( > RestartStrategies.fixedDelayRestart(Integer.MAX_VALUE, Time.of(20, > TimeUnit.SECONDS)) ); env.fromElements("Hello World", "Hello Flink") > .flatMap(new LineSplitter()) .groupBy(0) .sum(1) .print(); } public static > final class LineSplitter implements FlatMapFunction> \{ @Override public void > flatMap(String value, Collector> out) { for (String word : value.split(" ")) > { try { Thread.sleep(12); } catch (InterruptedException e) \{ > e.printStackTrace(); } out.collect(new Tuple2<>(word, 1)); } } } } ``` ```xml > 4.0.0 org.apache.flink myflinkjob 1.0-SNAPSHOT 1.18.1 1.8 org.apache.flink > flink-java ${flink.version} org.apache.flink flink-streaming-java > ${flink.version} org.apache.maven.plugins maven-compiler-plugin 3.8.1 > ${java.version} ${java.version} org.apache.maven.plugins maven-jar-plugin > 3.1.0 true lib/ FlinkJob ``` Launch job: ./flink-1.18.1/bin/flink run > ../flink-job/target/myflinkjob-1.0-SNAPSHOT.jar Job has been submitted with > JobID 5f0898c964a93a47aa480427f3e2c6c0 Kill job manager and task manager. > Then launch job manager 1.19.0 ./flink-1.19.0/bin/jobmanager.sh > start-foreground Root cause == It looks like the type of > delayBetweenAttemptsInterval was changed in 1.19 > https://github.com/apache/flink/pull/22984/files#diff-d174f32ffdea69de610c4f37c545bd22a253b9846434f83397f1bbc2aaa399faR239 > , introducing an incompatibility which is not handled by flink 1.19. In my > opinion, job-maanger should not crash when starting in that case. >Reporter: yazgoo >Assignee: Matthias Pohl >Priority: Major > > When trying to upgrade a flink cluster from 1.18 to 1.19, with a 1.18 job in > zookeeper HA state, I have a jobmanager crash with a ClassCastException, see > log below > > {code:java} > 2024-06-18 16:58:14,401 ERROR > org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Fatal error > occurred in the cluster entrypoint. org.apache.flink.util.FlinkException: > JobMaster for job 5f0898c964a93a47aa480427f3e2c6c0 failed. at > org.apache.flink.runtime.dispatcher.Dispatcher.jobMasterFailed(Dispatcher.java:1484) > ~[flink-dist-1.19.0.jar:1.19.0] at > org.apache.flink.runtime.dispatcher.Dispatcher.jobManagerRunnerFailed(Dispatcher.java:775) > ~[flink-dist-1.19.0.jar:1.19.0] at > org.apache.flink.runtime.dispatcher.Dispatcher.handleJobManagerRunnerResult(Dispatcher.java:738) > ~[flink-dist-1.19.0.jar:1.19.0] at > org.apache.flink.runtime.dispatcher.Dispatcher.lambda$runJob$7(Dispatcher.java:693) > ~[flink-dist-1.19.0.jar:1.19.0] at > java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:934) > ~[?:?] at > java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:911) > ~[?:?] at > java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:482) > ~[?:?] at > org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.lambda$handleRunAsync$4(PekkoRpcActor.java:451) > ~[flink-rpc-akka84eb9e64-a1ce-450c-ad53-d9fa579b67e1.jar:1.19.0] at > org.apache.flink.runtime.concurrent.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68) > ~[flink-dist-1.19.0.jar:1.19.0]
[jira] [Commented] (FLINK-35639) upgrade to 1.19 with job in HA state with restart strategy crashes job manager
[ https://issues.apache.org/jira/browse/FLINK-35639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855979#comment-17855979 ] Matthias Pohl commented on FLINK-35639: --- I guess, you're right. Looks like there was an error being made when deprecating the {{@PublicEvolving}} API of {{{}RestartStrategies#FixedDelayRestartStrategyConfiguration{}}}. I will raise the priority for this one to blocker because it also affects 1.20. > upgrade to 1.19 with job in HA state with restart strategy crashes job manager > -- > > Key: FLINK-35639 > URL: https://issues.apache.org/jira/browse/FLINK-35639 > Project: Flink > Issue Type: Bug > Components: API / Core >Affects Versions: 1.19.1 > Environment: Download 1.18 and 1.19 binary releases. Add the > following to flink-1.19.0/conf/config.yaml and > flink-1.18.1/conf/flink-conf.yaml ```yaml high-availability: zookeeper > high-availability.zookeeper.quorum: localhost high-availability.storageDir: > file:///tmp/flink/recovery ``` Launch zookeeper: docker run --network host > zookeeper:latest launch 1.18 task manager: ./flink-1.18.1/bin/taskmanager.sh > start-foreground launch 1.18 job manager: ./flink-1.18.1/bin/jobmanager.sh > start-foreground launch the following job: ```java import > org.apache.flink.api.java.ExecutionEnvironment; import > org.apache.flink.api.java.tuple.Tuple2; import > org.apache.flink.api.common.functions.FlatMapFunction; import > org.apache.flink.util.Collector; import > org.apache.flink.api.common.restartstrategy.RestartStrategies; import > org.apache.flink.api.common.time.Time; import java.util.concurrent.TimeUnit; > public class FlinkJob \{ public static void main(String[] args) throws > Exception { final ExecutionEnvironment env = > ExecutionEnvironment.getExecutionEnvironment(); env.setRestartStrategy( > RestartStrategies.fixedDelayRestart(Integer.MAX_VALUE, Time.of(20, > TimeUnit.SECONDS)) ); env.fromElements("Hello World", "Hello Flink") > .flatMap(new LineSplitter()) .groupBy(0) .sum(1) .print(); } public static > final class LineSplitter implements FlatMapFunction> \{ @Override public void > flatMap(String value, Collector> out) { for (String word : value.split(" ")) > { try { Thread.sleep(12); } catch (InterruptedException e) \{ > e.printStackTrace(); } out.collect(new Tuple2<>(word, 1)); } } } } ``` ```xml > 4.0.0 org.apache.flink myflinkjob 1.0-SNAPSHOT 1.18.1 1.8 org.apache.flink > flink-java ${flink.version} org.apache.flink flink-streaming-java > ${flink.version} org.apache.maven.plugins maven-compiler-plugin 3.8.1 > ${java.version} ${java.version} org.apache.maven.plugins maven-jar-plugin > 3.1.0 true lib/ FlinkJob ``` Launch job: ./flink-1.18.1/bin/flink run > ../flink-job/target/myflinkjob-1.0-SNAPSHOT.jar Job has been submitted with > JobID 5f0898c964a93a47aa480427f3e2c6c0 Kill job manager and task manager. > Then launch job manager 1.19.0 ./flink-1.19.0/bin/jobmanager.sh > start-foreground Root cause == It looks like the type of > delayBetweenAttemptsInterval was changed in 1.19 > https://github.com/apache/flink/pull/22984/files#diff-d174f32ffdea69de610c4f37c545bd22a253b9846434f83397f1bbc2aaa399faR239 > , introducing an incompatibility which is not handled by flink 1.19. In my > opinion, job-maanger should not crash when starting in that case. >Reporter: yazgoo >Priority: Major > > When trying to upgrade a flink cluster from 1.18 to 1.19, with a 1.18 job in > zookeeper HA state, I have a jobmanager crash with a ClassCastException, see > log below > > {code:java} > 2024-06-18 16:58:14,401 ERROR > org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Fatal error > occurred in the cluster entrypoint. org.apache.flink.util.FlinkException: > JobMaster for job 5f0898c964a93a47aa480427f3e2c6c0 failed. at > org.apache.flink.runtime.dispatcher.Dispatcher.jobMasterFailed(Dispatcher.java:1484) > ~[flink-dist-1.19.0.jar:1.19.0] at > org.apache.flink.runtime.dispatcher.Dispatcher.jobManagerRunnerFailed(Dispatcher.java:775) > ~[flink-dist-1.19.0.jar:1.19.0] at > org.apache.flink.runtime.dispatcher.Dispatcher.handleJobManagerRunnerResult(Dispatcher.java:738) > ~[flink-dist-1.19.0.jar:1.19.0] at > org.apache.flink.runtime.dispatcher.Dispatcher.lambda$runJob$7(Dispatcher.java:693) > ~[flink-dist-1.19.0.jar:1.19.0] at > java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:934) > ~[?:?] at > java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:911) > ~[?:?] at > java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:482) > ~[?:?] at >
[jira] [Comment Edited] (FLINK-35601) InitOutputPathTest.testErrorOccursUnSynchronized failed due to NoSuchFieldException
[ https://issues.apache.org/jira/browse/FLINK-35601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855405#comment-17855405 ] Matthias Pohl edited comment on FLINK-35601 at 6/16/24 1:29 PM: This seems to be caused by recently added [PR #24481|https://github.com/apache/flink/pull/24881] that is connected to FLINK-25537. I haven't been able to reproduce the error of this Jira issue (I haven't tried it with JDK 17) but running the test repeatedly causes the test to eventually timeout (tried 3x where the test ran into a deadlock (?) after 512, 618 and 3724 repetitions). It might be worth reverting the changes of PR #24481 was (Author: mapohl): This seems to be caused by recently added [PR #24481|https://github.com/apache/flink/pull/24881] that is connected to FLINK-25537. I haven't been able to reproduce the error of this Jira issue but running the test repeatedly causes the test to eventually timeout (tried 3x where the test ran into a deadlock (?) after 512, 618 and 3724 repetitions). It might be worth reverting the changes of PR #24481 > InitOutputPathTest.testErrorOccursUnSynchronized failed due to > NoSuchFieldException > --- > > Key: FLINK-35601 > URL: https://issues.apache.org/jira/browse/FLINK-35601 > Project: Flink > Issue Type: Bug > Components: Build System / CI >Affects Versions: 1.20.0 >Reporter: Weijie Guo >Priority: Major > > {code:java} > Jun 14 02:17:56 02:17:56.037 [ERROR] > org.apache.flink.core.fs.InitOutputPathTest.testErrorOccursUnSynchronized -- > Time elapsed: 0.021 s <<< ERROR! > Jun 14 02:17:56 java.lang.NoSuchFieldException: modifiers > Jun 14 02:17:56 at > java.base/java.lang.Class.getDeclaredField(Class.java:2610) > Jun 14 02:17:56 at > org.apache.flink.core.fs.InitOutputPathTest.testErrorOccursUnSynchronized(InitOutputPathTest.java:59) > Jun 14 02:17:56 at > java.base/java.lang.reflect.Method.invoke(Method.java:568) > Jun 14 02:17:56 at > java.base/java.util.concurrent.RecursiveAction.exec(RecursiveAction.java:194) > Jun 14 02:17:56 at > java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:373) > Jun 14 02:17:56 at > java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1182) > Jun 14 02:17:56 at > java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1655) > Jun 14 02:17:56 at > java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1622) > Jun 14 02:17:56 at > java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:165) > {code} > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=60259=logs=675bf62c-8558-587e-2555-dcad13acefb5=5878eed3-cc1e-5b12-1ed0-9e7139ce0992=6491 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-35601) InitOutputPathTest.testErrorOccursUnSynchronized failed due to NoSuchFieldException
[ https://issues.apache.org/jira/browse/FLINK-35601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855406#comment-17855406 ] Matthias Pohl commented on FLINK-35601: --- [~gongzhongqiang] can you have a look? > InitOutputPathTest.testErrorOccursUnSynchronized failed due to > NoSuchFieldException > --- > > Key: FLINK-35601 > URL: https://issues.apache.org/jira/browse/FLINK-35601 > Project: Flink > Issue Type: Bug > Components: Build System / CI >Affects Versions: 1.20.0 >Reporter: Weijie Guo >Priority: Major > > {code:java} > Jun 14 02:17:56 02:17:56.037 [ERROR] > org.apache.flink.core.fs.InitOutputPathTest.testErrorOccursUnSynchronized -- > Time elapsed: 0.021 s <<< ERROR! > Jun 14 02:17:56 java.lang.NoSuchFieldException: modifiers > Jun 14 02:17:56 at > java.base/java.lang.Class.getDeclaredField(Class.java:2610) > Jun 14 02:17:56 at > org.apache.flink.core.fs.InitOutputPathTest.testErrorOccursUnSynchronized(InitOutputPathTest.java:59) > Jun 14 02:17:56 at > java.base/java.lang.reflect.Method.invoke(Method.java:568) > Jun 14 02:17:56 at > java.base/java.util.concurrent.RecursiveAction.exec(RecursiveAction.java:194) > Jun 14 02:17:56 at > java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:373) > Jun 14 02:17:56 at > java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1182) > Jun 14 02:17:56 at > java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1655) > Jun 14 02:17:56 at > java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1622) > Jun 14 02:17:56 at > java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:165) > {code} > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=60259=logs=675bf62c-8558-587e-2555-dcad13acefb5=5878eed3-cc1e-5b12-1ed0-9e7139ce0992=6491 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-35601) InitOutputPathTest.testErrorOccursUnSynchronized failed due to NoSuchFieldException
[ https://issues.apache.org/jira/browse/FLINK-35601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855405#comment-17855405 ] Matthias Pohl commented on FLINK-35601: --- This seems to be caused by recently added [PR #24481|https://github.com/apache/flink/pull/24881] that is connected to FLINK-25537. I haven't been able to reproduce the error of this Jira issue but running the test repeatedly causes the test to eventually timeout (tried 3x where the test ran into a deadlock (?) after 512, 618 and 3724 repetitions). It might be worth reverting the changes of PR #24481 > InitOutputPathTest.testErrorOccursUnSynchronized failed due to > NoSuchFieldException > --- > > Key: FLINK-35601 > URL: https://issues.apache.org/jira/browse/FLINK-35601 > Project: Flink > Issue Type: Bug > Components: Build System / CI >Affects Versions: 1.20.0 >Reporter: Weijie Guo >Priority: Major > > {code:java} > Jun 14 02:17:56 02:17:56.037 [ERROR] > org.apache.flink.core.fs.InitOutputPathTest.testErrorOccursUnSynchronized -- > Time elapsed: 0.021 s <<< ERROR! > Jun 14 02:17:56 java.lang.NoSuchFieldException: modifiers > Jun 14 02:17:56 at > java.base/java.lang.Class.getDeclaredField(Class.java:2610) > Jun 14 02:17:56 at > org.apache.flink.core.fs.InitOutputPathTest.testErrorOccursUnSynchronized(InitOutputPathTest.java:59) > Jun 14 02:17:56 at > java.base/java.lang.reflect.Method.invoke(Method.java:568) > Jun 14 02:17:56 at > java.base/java.util.concurrent.RecursiveAction.exec(RecursiveAction.java:194) > Jun 14 02:17:56 at > java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:373) > Jun 14 02:17:56 at > java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1182) > Jun 14 02:17:56 at > java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1655) > Jun 14 02:17:56 at > java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1622) > Jun 14 02:17:56 at > java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:165) > {code} > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=60259=logs=675bf62c-8558-587e-2555-dcad13acefb5=5878eed3-cc1e-5b12-1ed0-9e7139ce0992=6491 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-35042) Streaming File Sink s3 end-to-end test failed as TM lost
[ https://issues.apache.org/jira/browse/FLINK-35042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17854944#comment-17854944 ] Matthias Pohl commented on FLINK-35042: --- I noticed that the build failure in the description is unrelated to FLINK-34150 because it appeared on April 8, 2024 whereas FLINK-24150 only was merged on May 10, 2024. But the build failure I shared might be related. So, it could be that these two are actually two different issues. > Streaming File Sink s3 end-to-end test failed as TM lost > > > Key: FLINK-35042 > URL: https://issues.apache.org/jira/browse/FLINK-35042 > Project: Flink > Issue Type: Bug > Components: Build System / CI >Affects Versions: 1.20.0 >Reporter: Weijie Guo >Priority: Major > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58782=logs=fb37c667-81b7-5c22-dd91-846535e99a97=011e961e-597c-5c96-04fe-7941c8b83f23=14344 > FAIL 'Streaming File Sink s3 end-to-end test' failed after 15 minutes and 20 > seconds! Test exited with exit code 1 > I have checked the JM log, it seems that a taskmanager is no longer reachable: > {code:java} > 2024-04-08T01:12:04.3922210Z Apr 08 01:12:04 2024-04-08 00:58:15,517 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Sink: > Unnamed (4/4) > (14b44f534745ffb2f1ef03fca34f7f0d_0a448493b4782967b150582570326227_3_0) > switched from RUNNING to FAILED on localhost:44987-47f5af @ localhost > (dataPort=34489). > 2024-04-08T01:12:04.3924522Z Apr 08 01:12:04 > org.apache.flink.runtime.jobmaster.JobMasterException: TaskManager with id > localhost:44987-47f5af is no longer reachable. > 2024-04-08T01:12:04.3925421Z Apr 08 01:12:04 at > org.apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.notifyTargetUnreachable(JobMaster.java:1511) > ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] > 2024-04-08T01:12:04.3926185Z Apr 08 01:12:04 at > org.apache.flink.runtime.heartbeat.DefaultHeartbeatMonitor.reportHeartbeatRpcFailure(DefaultHeartbeatMonitor.java:126) > ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] > 2024-04-08T01:12:04.3926925Z Apr 08 01:12:04 at > org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl.runIfHeartbeatMonitorExists(HeartbeatManagerImpl.java:275) > ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] > 2024-04-08T01:12:04.3929898Z Apr 08 01:12:04 at > org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl.reportHeartbeatTargetUnreachable(HeartbeatManagerImpl.java:267) > ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] > 2024-04-08T01:12:04.3930692Z Apr 08 01:12:04 at > org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl.handleHeartbeatRpcFailure(HeartbeatManagerImpl.java:262) > ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] > 2024-04-08T01:12:04.3931442Z Apr 08 01:12:04 at > org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl.lambda$handleHeartbeatRpc$0(HeartbeatManagerImpl.java:248) > ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] > 2024-04-08T01:12:04.3931917Z Apr 08 01:12:04 at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) > ~[?:1.8.0_402] > 2024-04-08T01:12:04.3934759Z Apr 08 01:12:04 at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) > ~[?:1.8.0_402] > 2024-04-08T01:12:04.3935252Z Apr 08 01:12:04 at > java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456) > ~[?:1.8.0_402] > 2024-04-08T01:12:04.3935989Z Apr 08 01:12:04 at > org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.lambda$handleRunAsync$4(PekkoRpcActor.java:460) > ~[flink-rpc-akka9681a48a-ca1a-45b0-bb71-4bdb5d2aed93.jar:1.20-SNAPSHOT] > 2024-04-08T01:12:04.3936731Z Apr 08 01:12:04 at > org.apache.flink.runtime.concurrent.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68) > ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] > 2024-04-08T01:12:04.3938103Z Apr 08 01:12:04 at > org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleRunAsync(PekkoRpcActor.java:460) > ~[flink-rpc-akka9681a48a-ca1a-45b0-bb71-4bdb5d2aed93.jar:1.20-SNAPSHOT] > 2024-04-08T01:12:04.3942549Z Apr 08 01:12:04 at > org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleRpcMessage(PekkoRpcActor.java:225) > ~[flink-rpc-akka9681a48a-ca1a-45b0-bb71-4bdb5d2aed93.jar:1.20-SNAPSHOT] > 2024-04-08T01:12:04.3945371Z Apr 08 01:12:04 at > org.apache.flink.runtime.rpc.pekko.FencedPekkoRpcActor.handleRpcMessage(FencedPekkoRpcActor.java:88) > ~[flink-rpc-akka9681a48a-ca1a-45b0-bb71-4bdb5d2aed93.jar:1.20-SNAPSHOT] > 2024-04-08T01:12:04.3946244Z Apr 08 01:12:04 at > org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleMessage(PekkoRpcActor.java:174) > ~[flink-rpc-akka9681a48a-ca1a-45b0-bb71-4bdb5d2aed93.jar:1.20-SNAPSHOT] > 2024-04-08T01:12:04.3946960Z Apr 08 01:12:04
[jira] [Comment Edited] (FLINK-35042) Streaming File Sink s3 end-to-end test failed as TM lost
[ https://issues.apache.org/jira/browse/FLINK-35042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17854944#comment-17854944 ] Matthias Pohl edited comment on FLINK-35042 at 6/14/24 6:37 AM: I noticed that the build failure in the description is unrelated to FLINK-34150 because it appeared on April 8, 2024 whereas FLINK-34150 only was merged on May 10, 2024. But the build failure I shared might be related. So, it could be that these two are actually two different issues. was (Author: mapohl): I noticed that the build failure in the description is unrelated to FLINK-34150 because it appeared on April 8, 2024 whereas FLINK-24150 only was merged on May 10, 2024. But the build failure I shared might be related. So, it could be that these two are actually two different issues. > Streaming File Sink s3 end-to-end test failed as TM lost > > > Key: FLINK-35042 > URL: https://issues.apache.org/jira/browse/FLINK-35042 > Project: Flink > Issue Type: Bug > Components: Build System / CI >Affects Versions: 1.20.0 >Reporter: Weijie Guo >Priority: Major > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58782=logs=fb37c667-81b7-5c22-dd91-846535e99a97=011e961e-597c-5c96-04fe-7941c8b83f23=14344 > FAIL 'Streaming File Sink s3 end-to-end test' failed after 15 minutes and 20 > seconds! Test exited with exit code 1 > I have checked the JM log, it seems that a taskmanager is no longer reachable: > {code:java} > 2024-04-08T01:12:04.3922210Z Apr 08 01:12:04 2024-04-08 00:58:15,517 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Sink: > Unnamed (4/4) > (14b44f534745ffb2f1ef03fca34f7f0d_0a448493b4782967b150582570326227_3_0) > switched from RUNNING to FAILED on localhost:44987-47f5af @ localhost > (dataPort=34489). > 2024-04-08T01:12:04.3924522Z Apr 08 01:12:04 > org.apache.flink.runtime.jobmaster.JobMasterException: TaskManager with id > localhost:44987-47f5af is no longer reachable. > 2024-04-08T01:12:04.3925421Z Apr 08 01:12:04 at > org.apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.notifyTargetUnreachable(JobMaster.java:1511) > ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] > 2024-04-08T01:12:04.3926185Z Apr 08 01:12:04 at > org.apache.flink.runtime.heartbeat.DefaultHeartbeatMonitor.reportHeartbeatRpcFailure(DefaultHeartbeatMonitor.java:126) > ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] > 2024-04-08T01:12:04.3926925Z Apr 08 01:12:04 at > org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl.runIfHeartbeatMonitorExists(HeartbeatManagerImpl.java:275) > ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] > 2024-04-08T01:12:04.3929898Z Apr 08 01:12:04 at > org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl.reportHeartbeatTargetUnreachable(HeartbeatManagerImpl.java:267) > ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] > 2024-04-08T01:12:04.3930692Z Apr 08 01:12:04 at > org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl.handleHeartbeatRpcFailure(HeartbeatManagerImpl.java:262) > ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] > 2024-04-08T01:12:04.3931442Z Apr 08 01:12:04 at > org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl.lambda$handleHeartbeatRpc$0(HeartbeatManagerImpl.java:248) > ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] > 2024-04-08T01:12:04.3931917Z Apr 08 01:12:04 at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) > ~[?:1.8.0_402] > 2024-04-08T01:12:04.3934759Z Apr 08 01:12:04 at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) > ~[?:1.8.0_402] > 2024-04-08T01:12:04.3935252Z Apr 08 01:12:04 at > java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456) > ~[?:1.8.0_402] > 2024-04-08T01:12:04.3935989Z Apr 08 01:12:04 at > org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.lambda$handleRunAsync$4(PekkoRpcActor.java:460) > ~[flink-rpc-akka9681a48a-ca1a-45b0-bb71-4bdb5d2aed93.jar:1.20-SNAPSHOT] > 2024-04-08T01:12:04.3936731Z Apr 08 01:12:04 at > org.apache.flink.runtime.concurrent.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68) > ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] > 2024-04-08T01:12:04.3938103Z Apr 08 01:12:04 at > org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleRunAsync(PekkoRpcActor.java:460) > ~[flink-rpc-akka9681a48a-ca1a-45b0-bb71-4bdb5d2aed93.jar:1.20-SNAPSHOT] > 2024-04-08T01:12:04.3942549Z Apr 08 01:12:04 at > org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleRpcMessage(PekkoRpcActor.java:225) > ~[flink-rpc-akka9681a48a-ca1a-45b0-bb71-4bdb5d2aed93.jar:1.20-SNAPSHOT] > 2024-04-08T01:12:04.3945371Z Apr 08 01:12:04 at >
[jira] [Commented] (FLINK-35042) Streaming File Sink s3 end-to-end test failed as TM lost
[ https://issues.apache.org/jira/browse/FLINK-35042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17854800#comment-17854800 ] Matthias Pohl commented on FLINK-35042: --- This is different to what [~Weijie Guo] observed in his build where the Job never observes the expected 2 TM restart: {code:java} Apr 08 00:57:27 Submitting job. Apr 08 00:57:39 Job (d0bec02a7136e671f764bba2938933db) is not yet running. Apr 08 00:57:49 Job (d0bec02a7136e671f764bba2938933db) is running. Apr 08 00:57:49 Waiting for job (d0bec02a7136e671f764bba2938933db) to have at least 3 completed checkpoints ... Apr 08 00:58:03 Killing TM Apr 08 00:58:04 TaskManager 138601 killed. Apr 08 00:58:04 Starting TM Apr 08 00:58:06 [INFO] 3 instance(s) of taskexecutor are already running on fv-az68-869. Apr 08 00:58:06 Starting taskexecutor daemon on host fv-az68-869. Apr 08 00:58:06 Waiting for restart to happen Apr 08 00:58:06 Still waiting for restarts. Expected: 1 Current: 0 Apr 08 00:58:11 Still waiting for restarts. Expected: 1 Current: 0 Apr 08 00:58:16 Still waiting for restarts. Expected: 1 Current: 0 Apr 08 00:58:21 Killing 2 TMs Apr 08 00:58:21 TaskManager 141400 killed. Apr 08 00:58:21 TaskManager 139144 killed. Apr 08 00:58:21 Starting 2 TMs Apr 08 00:58:24 [INFO] 2 instance(s) of taskexecutor are already running on fv-az68-869. Apr 08 00:58:24 Starting taskexecutor daemon on host fv-az68-869. Apr 08 00:58:29 [INFO] 3 instance(s) of taskexecutor are already running on fv-az68-869. Apr 08 00:58:29 Starting taskexecutor daemon on host fv-az68-869. Apr 08 00:58:29 Waiting for restart to happen Apr 08 00:58:29 Still waiting for restarts. Expected: 2 Current: 1 Apr 08 00:58:34 Still waiting for restarts. Expected: 2 Current: 1 Apr 08 00:58:39 Still waiting for restarts. Expected: 2 Current: 1 [...] Apr 08 01:11:56 Still waiting for restarts. Expected: 2 Current: 1 Apr 08 01:12:01 Still waiting for restarts. Expected: 2 Current: 1 Apr 08 01:12:04 Test (pid: 136749) did not finish after 900 seconds.{code} > Streaming File Sink s3 end-to-end test failed as TM lost > > > Key: FLINK-35042 > URL: https://issues.apache.org/jira/browse/FLINK-35042 > Project: Flink > Issue Type: Bug > Components: Build System / CI >Affects Versions: 1.20.0 >Reporter: Weijie Guo >Priority: Major > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58782=logs=fb37c667-81b7-5c22-dd91-846535e99a97=011e961e-597c-5c96-04fe-7941c8b83f23=14344 > FAIL 'Streaming File Sink s3 end-to-end test' failed after 15 minutes and 20 > seconds! Test exited with exit code 1 > I have checked the JM log, it seems that a taskmanager is no longer reachable: > {code:java} > 2024-04-08T01:12:04.3922210Z Apr 08 01:12:04 2024-04-08 00:58:15,517 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Sink: > Unnamed (4/4) > (14b44f534745ffb2f1ef03fca34f7f0d_0a448493b4782967b150582570326227_3_0) > switched from RUNNING to FAILED on localhost:44987-47f5af @ localhost > (dataPort=34489). > 2024-04-08T01:12:04.3924522Z Apr 08 01:12:04 > org.apache.flink.runtime.jobmaster.JobMasterException: TaskManager with id > localhost:44987-47f5af is no longer reachable. > 2024-04-08T01:12:04.3925421Z Apr 08 01:12:04 at > org.apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.notifyTargetUnreachable(JobMaster.java:1511) > ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] > 2024-04-08T01:12:04.3926185Z Apr 08 01:12:04 at > org.apache.flink.runtime.heartbeat.DefaultHeartbeatMonitor.reportHeartbeatRpcFailure(DefaultHeartbeatMonitor.java:126) > ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] > 2024-04-08T01:12:04.3926925Z Apr 08 01:12:04 at > org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl.runIfHeartbeatMonitorExists(HeartbeatManagerImpl.java:275) > ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] > 2024-04-08T01:12:04.3929898Z Apr 08 01:12:04 at > org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl.reportHeartbeatTargetUnreachable(HeartbeatManagerImpl.java:267) > ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] > 2024-04-08T01:12:04.3930692Z Apr 08 01:12:04 at > org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl.handleHeartbeatRpcFailure(HeartbeatManagerImpl.java:262) > ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] > 2024-04-08T01:12:04.3931442Z Apr 08 01:12:04 at > org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl.lambda$handleHeartbeatRpc$0(HeartbeatManagerImpl.java:248) > ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] > 2024-04-08T01:12:04.3931917Z Apr 08 01:12:04 at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) > ~[?:1.8.0_402] > 2024-04-08T01:12:04.3934759Z Apr 08 01:12:04 at >
[jira] [Commented] (FLINK-35042) Streaming File Sink s3 end-to-end test failed as TM lost
[ https://issues.apache.org/jira/browse/FLINK-35042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17854801#comment-17854801 ] Matthias Pohl commented on FLINK-35042: --- I'm linking FLINK-34150 because we refactored the test to rely on Minio rather than AWS s3 backend. > Streaming File Sink s3 end-to-end test failed as TM lost > > > Key: FLINK-35042 > URL: https://issues.apache.org/jira/browse/FLINK-35042 > Project: Flink > Issue Type: Bug > Components: Build System / CI >Affects Versions: 1.20.0 >Reporter: Weijie Guo >Priority: Major > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58782=logs=fb37c667-81b7-5c22-dd91-846535e99a97=011e961e-597c-5c96-04fe-7941c8b83f23=14344 > FAIL 'Streaming File Sink s3 end-to-end test' failed after 15 minutes and 20 > seconds! Test exited with exit code 1 > I have checked the JM log, it seems that a taskmanager is no longer reachable: > {code:java} > 2024-04-08T01:12:04.3922210Z Apr 08 01:12:04 2024-04-08 00:58:15,517 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Sink: > Unnamed (4/4) > (14b44f534745ffb2f1ef03fca34f7f0d_0a448493b4782967b150582570326227_3_0) > switched from RUNNING to FAILED on localhost:44987-47f5af @ localhost > (dataPort=34489). > 2024-04-08T01:12:04.3924522Z Apr 08 01:12:04 > org.apache.flink.runtime.jobmaster.JobMasterException: TaskManager with id > localhost:44987-47f5af is no longer reachable. > 2024-04-08T01:12:04.3925421Z Apr 08 01:12:04 at > org.apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.notifyTargetUnreachable(JobMaster.java:1511) > ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] > 2024-04-08T01:12:04.3926185Z Apr 08 01:12:04 at > org.apache.flink.runtime.heartbeat.DefaultHeartbeatMonitor.reportHeartbeatRpcFailure(DefaultHeartbeatMonitor.java:126) > ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] > 2024-04-08T01:12:04.3926925Z Apr 08 01:12:04 at > org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl.runIfHeartbeatMonitorExists(HeartbeatManagerImpl.java:275) > ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] > 2024-04-08T01:12:04.3929898Z Apr 08 01:12:04 at > org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl.reportHeartbeatTargetUnreachable(HeartbeatManagerImpl.java:267) > ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] > 2024-04-08T01:12:04.3930692Z Apr 08 01:12:04 at > org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl.handleHeartbeatRpcFailure(HeartbeatManagerImpl.java:262) > ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] > 2024-04-08T01:12:04.3931442Z Apr 08 01:12:04 at > org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl.lambda$handleHeartbeatRpc$0(HeartbeatManagerImpl.java:248) > ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] > 2024-04-08T01:12:04.3931917Z Apr 08 01:12:04 at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) > ~[?:1.8.0_402] > 2024-04-08T01:12:04.3934759Z Apr 08 01:12:04 at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) > ~[?:1.8.0_402] > 2024-04-08T01:12:04.3935252Z Apr 08 01:12:04 at > java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456) > ~[?:1.8.0_402] > 2024-04-08T01:12:04.3935989Z Apr 08 01:12:04 at > org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.lambda$handleRunAsync$4(PekkoRpcActor.java:460) > ~[flink-rpc-akka9681a48a-ca1a-45b0-bb71-4bdb5d2aed93.jar:1.20-SNAPSHOT] > 2024-04-08T01:12:04.3936731Z Apr 08 01:12:04 at > org.apache.flink.runtime.concurrent.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68) > ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] > 2024-04-08T01:12:04.3938103Z Apr 08 01:12:04 at > org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleRunAsync(PekkoRpcActor.java:460) > ~[flink-rpc-akka9681a48a-ca1a-45b0-bb71-4bdb5d2aed93.jar:1.20-SNAPSHOT] > 2024-04-08T01:12:04.3942549Z Apr 08 01:12:04 at > org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleRpcMessage(PekkoRpcActor.java:225) > ~[flink-rpc-akka9681a48a-ca1a-45b0-bb71-4bdb5d2aed93.jar:1.20-SNAPSHOT] > 2024-04-08T01:12:04.3945371Z Apr 08 01:12:04 at > org.apache.flink.runtime.rpc.pekko.FencedPekkoRpcActor.handleRpcMessage(FencedPekkoRpcActor.java:88) > ~[flink-rpc-akka9681a48a-ca1a-45b0-bb71-4bdb5d2aed93.jar:1.20-SNAPSHOT] > 2024-04-08T01:12:04.3946244Z Apr 08 01:12:04 at > org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleMessage(PekkoRpcActor.java:174) > ~[flink-rpc-akka9681a48a-ca1a-45b0-bb71-4bdb5d2aed93.jar:1.20-SNAPSHOT] > 2024-04-08T01:12:04.3946960Z Apr 08 01:12:04 at > org.apache.pekko.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:33) > [flink-rpc-akka9681a48a-ca1a-45b0-bb71-4bdb5d2aed93.jar:1.20-SNAPSHOT] > 2024-04-08T01:12:04.3947664Z
[jira] [Comment Edited] (FLINK-35042) Streaming File Sink s3 end-to-end test failed as TM lost
[ https://issues.apache.org/jira/browse/FLINK-35042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17854789#comment-17854789 ] Matthias Pohl edited comment on FLINK-35042 at 6/13/24 4:16 PM: [https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=60237=logs=ef799394-2d67-5ff4-b2e5-410b80c9c0af=9e5768bc-daae-5f5f-1861-e58617922c7a=9817] for that one it looks like the test never reached the expected processed values: {code:java} Jun 13 13:04:25 Waiting for Dispatcher REST endpoint to come up... Jun 13 13:04:26 Dispatcher REST endpoint is up. Jun 13 13:04:28 [INFO] 1 instance(s) of taskexecutor are already running on fv-az209-180. Jun 13 13:04:28 Starting taskexecutor daemon on host fv-az209-180. Jun 13 13:04:32 [INFO] 2 instance(s) of taskexecutor are already running on fv-az209-180. Jun 13 13:04:32 Starting taskexecutor daemon on host fv-az209-180. Jun 13 13:04:37 [INFO] 3 instance(s) of taskexecutor are already running on fv-az209-180. Jun 13 13:04:37 Starting taskexecutor daemon on host fv-az209-180. Jun 13 13:04:37 Submitting job. Jun 13 13:04:57 Job (be9bc06a08a4c0fc3bf2c9e1c92219d4) is running. Jun 13 13:04:57 Waiting for job (be9bc06a08a4c0fc3bf2c9e1c92219d4) to have at least 3 completed checkpoints ... Jun 13 13:05:06 Killing TM Jun 13 13:05:06 TaskManager 122377 killed. Jun 13 13:05:06 Starting TM Jun 13 13:05:08 [INFO] 3 instance(s) of taskexecutor are already running on fv-az209-180. Jun 13 13:05:08 Starting taskexecutor daemon on host fv-az209-180. Jun 13 13:05:08 Waiting for restart to happen Jun 13 13:05:08 Still waiting for restarts. Expected: 1 Current: 0 Jun 13 13:05:13 Still waiting for restarts. Expected: 1 Current: 0 Jun 13 13:05:18 Still waiting for restarts. Expected: 1 Current: 0 Jun 13 13:05:23 Killing 2 TMs Jun 13 13:05:24 TaskManager 121771 killed. Jun 13 13:05:24 TaskManager 122908 killed. Jun 13 13:05:24 Starting 2 TMs Jun 13 13:05:26 [INFO] 2 instance(s) of taskexecutor are already running on fv-az209-180. Jun 13 13:05:26 Starting taskexecutor daemon on host fv-az209-180. Jun 13 13:05:31 [INFO] 3 instance(s) of taskexecutor are already running on fv-az209-180. Jun 13 13:05:31 Starting taskexecutor daemon on host fv-az209-180. Jun 13 13:05:31 Waiting for restart to happen Jun 13 13:05:31 Still waiting for restarts. Expected: 2 Current: 1 Jun 13 13:05:36 Still waiting for restarts. Expected: 2 Current: 1 Jun 13 13:05:41 Waiting until all values have been produced Jun 13 13:05:43 Number of produced values 0/6 [...] {code} was (Author: mapohl): https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=60237=logs=ef799394-2d67-5ff4-b2e5-410b80c9c0af=9e5768bc-daae-5f5f-1861-e58617922c7a=9817 > Streaming File Sink s3 end-to-end test failed as TM lost > > > Key: FLINK-35042 > URL: https://issues.apache.org/jira/browse/FLINK-35042 > Project: Flink > Issue Type: Bug > Components: Build System / CI >Affects Versions: 1.20.0 >Reporter: Weijie Guo >Priority: Major > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58782=logs=fb37c667-81b7-5c22-dd91-846535e99a97=011e961e-597c-5c96-04fe-7941c8b83f23=14344 > FAIL 'Streaming File Sink s3 end-to-end test' failed after 15 minutes and 20 > seconds! Test exited with exit code 1 > I have checked the JM log, it seems that a taskmanager is no longer reachable: > {code:java} > 2024-04-08T01:12:04.3922210Z Apr 08 01:12:04 2024-04-08 00:58:15,517 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Sink: > Unnamed (4/4) > (14b44f534745ffb2f1ef03fca34f7f0d_0a448493b4782967b150582570326227_3_0) > switched from RUNNING to FAILED on localhost:44987-47f5af @ localhost > (dataPort=34489). > 2024-04-08T01:12:04.3924522Z Apr 08 01:12:04 > org.apache.flink.runtime.jobmaster.JobMasterException: TaskManager with id > localhost:44987-47f5af is no longer reachable. > 2024-04-08T01:12:04.3925421Z Apr 08 01:12:04 at > org.apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.notifyTargetUnreachable(JobMaster.java:1511) > ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] > 2024-04-08T01:12:04.3926185Z Apr 08 01:12:04 at > org.apache.flink.runtime.heartbeat.DefaultHeartbeatMonitor.reportHeartbeatRpcFailure(DefaultHeartbeatMonitor.java:126) > ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] > 2024-04-08T01:12:04.3926925Z Apr 08 01:12:04 at > org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl.runIfHeartbeatMonitorExists(HeartbeatManagerImpl.java:275) > ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] > 2024-04-08T01:12:04.3929898Z Apr 08 01:12:04 at > org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl.reportHeartbeatTargetUnreachable(HeartbeatManagerImpl.java:267) >
[jira] [Commented] (FLINK-35042) Streaming File Sink s3 end-to-end test failed as TM lost
[ https://issues.apache.org/jira/browse/FLINK-35042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17854789#comment-17854789 ] Matthias Pohl commented on FLINK-35042: --- https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=60237=logs=ef799394-2d67-5ff4-b2e5-410b80c9c0af=9e5768bc-daae-5f5f-1861-e58617922c7a=9817 > Streaming File Sink s3 end-to-end test failed as TM lost > > > Key: FLINK-35042 > URL: https://issues.apache.org/jira/browse/FLINK-35042 > Project: Flink > Issue Type: Bug > Components: Build System / CI >Affects Versions: 1.20.0 >Reporter: Weijie Guo >Priority: Major > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58782=logs=fb37c667-81b7-5c22-dd91-846535e99a97=011e961e-597c-5c96-04fe-7941c8b83f23=14344 > FAIL 'Streaming File Sink s3 end-to-end test' failed after 15 minutes and 20 > seconds! Test exited with exit code 1 > I have checked the JM log, it seems that a taskmanager is no longer reachable: > {code:java} > 2024-04-08T01:12:04.3922210Z Apr 08 01:12:04 2024-04-08 00:58:15,517 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Sink: > Unnamed (4/4) > (14b44f534745ffb2f1ef03fca34f7f0d_0a448493b4782967b150582570326227_3_0) > switched from RUNNING to FAILED on localhost:44987-47f5af @ localhost > (dataPort=34489). > 2024-04-08T01:12:04.3924522Z Apr 08 01:12:04 > org.apache.flink.runtime.jobmaster.JobMasterException: TaskManager with id > localhost:44987-47f5af is no longer reachable. > 2024-04-08T01:12:04.3925421Z Apr 08 01:12:04 at > org.apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.notifyTargetUnreachable(JobMaster.java:1511) > ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] > 2024-04-08T01:12:04.3926185Z Apr 08 01:12:04 at > org.apache.flink.runtime.heartbeat.DefaultHeartbeatMonitor.reportHeartbeatRpcFailure(DefaultHeartbeatMonitor.java:126) > ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] > 2024-04-08T01:12:04.3926925Z Apr 08 01:12:04 at > org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl.runIfHeartbeatMonitorExists(HeartbeatManagerImpl.java:275) > ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] > 2024-04-08T01:12:04.3929898Z Apr 08 01:12:04 at > org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl.reportHeartbeatTargetUnreachable(HeartbeatManagerImpl.java:267) > ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] > 2024-04-08T01:12:04.3930692Z Apr 08 01:12:04 at > org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl.handleHeartbeatRpcFailure(HeartbeatManagerImpl.java:262) > ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] > 2024-04-08T01:12:04.3931442Z Apr 08 01:12:04 at > org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl.lambda$handleHeartbeatRpc$0(HeartbeatManagerImpl.java:248) > ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] > 2024-04-08T01:12:04.3931917Z Apr 08 01:12:04 at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) > ~[?:1.8.0_402] > 2024-04-08T01:12:04.3934759Z Apr 08 01:12:04 at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) > ~[?:1.8.0_402] > 2024-04-08T01:12:04.3935252Z Apr 08 01:12:04 at > java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456) > ~[?:1.8.0_402] > 2024-04-08T01:12:04.3935989Z Apr 08 01:12:04 at > org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.lambda$handleRunAsync$4(PekkoRpcActor.java:460) > ~[flink-rpc-akka9681a48a-ca1a-45b0-bb71-4bdb5d2aed93.jar:1.20-SNAPSHOT] > 2024-04-08T01:12:04.3936731Z Apr 08 01:12:04 at > org.apache.flink.runtime.concurrent.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68) > ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT] > 2024-04-08T01:12:04.3938103Z Apr 08 01:12:04 at > org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleRunAsync(PekkoRpcActor.java:460) > ~[flink-rpc-akka9681a48a-ca1a-45b0-bb71-4bdb5d2aed93.jar:1.20-SNAPSHOT] > 2024-04-08T01:12:04.3942549Z Apr 08 01:12:04 at > org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleRpcMessage(PekkoRpcActor.java:225) > ~[flink-rpc-akka9681a48a-ca1a-45b0-bb71-4bdb5d2aed93.jar:1.20-SNAPSHOT] > 2024-04-08T01:12:04.3945371Z Apr 08 01:12:04 at > org.apache.flink.runtime.rpc.pekko.FencedPekkoRpcActor.handleRpcMessage(FencedPekkoRpcActor.java:88) > ~[flink-rpc-akka9681a48a-ca1a-45b0-bb71-4bdb5d2aed93.jar:1.20-SNAPSHOT] > 2024-04-08T01:12:04.3946244Z Apr 08 01:12:04 at > org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleMessage(PekkoRpcActor.java:174) > ~[flink-rpc-akka9681a48a-ca1a-45b0-bb71-4bdb5d2aed93.jar:1.20-SNAPSHOT] > 2024-04-08T01:12:04.3946960Z Apr 08 01:12:04 at > org.apache.pekko.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:33) >
[jira] [Commented] (FLINK-35549) FLIP-461: Synchronize rescaling with checkpoint creation to minimize reprocessing for the AdaptiveScheduler
[ https://issues.apache.org/jira/browse/FLINK-35549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17853218#comment-17853218 ] Matthias Pohl commented on FLINK-35549: --- cc [~fanrui] I added the code change for FLIP-461. ...if you want to have a look. > FLIP-461: Synchronize rescaling with checkpoint creation to minimize > reprocessing for the AdaptiveScheduler > --- > > Key: FLINK-35549 > URL: https://issues.apache.org/jira/browse/FLINK-35549 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing, Runtime / Coordination >Affects Versions: 1.20.0 >Reporter: Matthias Pohl >Assignee: Matthias Pohl >Priority: Major > > This is the umbrella issue for implementing > [FLIP-461|https://cwiki.apache.org/confluence/display/FLINK/FLIP-461%3A+Synchronize+rescaling+with+checkpoint+creation+to+minimize+reprocessing+for+the+AdaptiveScheduler] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (FLINK-35550) Introduce new component RescaleManager
[ https://issues.apache.org/jira/browse/FLINK-35550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthias Pohl reassigned FLINK-35550: - Assignee: Matthias Pohl > Introduce new component RescaleManager > -- > > Key: FLINK-35550 > URL: https://issues.apache.org/jira/browse/FLINK-35550 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Coordination >Reporter: Matthias Pohl >Assignee: Matthias Pohl >Priority: Major > > The goal here is to collect the rescaling logic in a single component to > improve testability. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (FLINK-35549) FLIP-461: Synchronize rescaling with checkpoint creation to minimize reprocessing for the AdaptiveScheduler
[ https://issues.apache.org/jira/browse/FLINK-35549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthias Pohl reassigned FLINK-35549: - Assignee: Matthias Pohl > FLIP-461: Synchronize rescaling with checkpoint creation to minimize > reprocessing for the AdaptiveScheduler > --- > > Key: FLINK-35549 > URL: https://issues.apache.org/jira/browse/FLINK-35549 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing, Runtime / Coordination >Affects Versions: 1.20.0 >Reporter: Matthias Pohl >Assignee: Matthias Pohl >Priority: Major > > This is the umbrella issue for implementing > [FLIP-461|https://cwiki.apache.org/confluence/display/FLINK/FLIP-461%3A+Synchronize+rescaling+with+checkpoint+creation+to+minimize+reprocessing+for+the+AdaptiveScheduler] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-35553) Integrate newly added trigger interface with checkpointing
[ https://issues.apache.org/jira/browse/FLINK-35553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthias Pohl updated FLINK-35553: -- Description: This connects the newly introduced trigger logic (FLINK-35551) with the {{CheckpointStatsTracker}} (was: This connects the newly introduced trigger logic (FLINK-35551) with the newly added checkpoint lifecycle listening feature (FLINK-35552).) > Integrate newly added trigger interface with checkpointing > -- > > Key: FLINK-35553 > URL: https://issues.apache.org/jira/browse/FLINK-35553 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Checkpointing, Runtime / Coordination >Reporter: Matthias Pohl >Priority: Major > > This connects the newly introduced trigger logic (FLINK-35551) with the > {{CheckpointStatsTracker}} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-35553) Integrate newly added trigger interface with checkpointing
Matthias Pohl created FLINK-35553: - Summary: Integrate newly added trigger interface with checkpointing Key: FLINK-35553 URL: https://issues.apache.org/jira/browse/FLINK-35553 Project: Flink Issue Type: Sub-task Components: Runtime / Checkpointing, Runtime / Coordination Reporter: Matthias Pohl This connects the newly introduced trigger logic (FLINK-35551) with the newly added checkpoint lifecycle listening feature (FLINK-35552). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-35552) Move CheckpointStatsTracker out of ExecutionGraph into Scheduler
Matthias Pohl created FLINK-35552: - Summary: Move CheckpointStatsTracker out of ExecutionGraph into Scheduler Key: FLINK-35552 URL: https://issues.apache.org/jira/browse/FLINK-35552 Project: Flink Issue Type: Sub-task Components: Runtime / Checkpointing, Runtime / Coordination Reporter: Matthias Pohl The scheduler needs to know about the CheckpointStatsTracker to allow listening to checkpoint failures and completion. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-35551) Introduces RescaleManager#onTrigger endpoint
Matthias Pohl created FLINK-35551: - Summary: Introduces RescaleManager#onTrigger endpoint Key: FLINK-35551 URL: https://issues.apache.org/jira/browse/FLINK-35551 Project: Flink Issue Type: Sub-task Reporter: Matthias Pohl The new endpoint would allow use from separating observing change events from actually triggering the rescale operation. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-35550) Introduce new component RescaleManager
Matthias Pohl created FLINK-35550: - Summary: Introduce new component RescaleManager Key: FLINK-35550 URL: https://issues.apache.org/jira/browse/FLINK-35550 Project: Flink Issue Type: Sub-task Components: Runtime / Coordination Reporter: Matthias Pohl The goal here is to collect the rescaling logic in a single component to improve testability. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-35551) Introduces RescaleManager#onTrigger endpoint
[ https://issues.apache.org/jira/browse/FLINK-35551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthias Pohl updated FLINK-35551: -- Component/s: Runtime / Coordination > Introduces RescaleManager#onTrigger endpoint > > > Key: FLINK-35551 > URL: https://issues.apache.org/jira/browse/FLINK-35551 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Coordination >Reporter: Matthias Pohl >Priority: Major > > The new endpoint would allow use from separating observing change events from > actually triggering the rescale operation. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-35549) FLIP-461: Synchronize rescaling with checkpoint creation to minimize reprocessing for the AdaptiveScheduler
Matthias Pohl created FLINK-35549: - Summary: FLIP-461: Synchronize rescaling with checkpoint creation to minimize reprocessing for the AdaptiveScheduler Key: FLINK-35549 URL: https://issues.apache.org/jira/browse/FLINK-35549 Project: Flink Issue Type: Improvement Components: Runtime / Checkpointing, Runtime / Coordination Affects Versions: 1.20.0 Reporter: Matthias Pohl This is the umbrella issue for implementing [FLIP-461|https://cwiki.apache.org/confluence/display/FLINK/FLIP-461%3A+Synchronize+rescaling+with+checkpoint+creation+to+minimize+reprocessing+for+the+AdaptiveScheduler] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (FLINK-35035) Reduce job pause time when cluster resources are expanded in adaptive mode
[ https://issues.apache.org/jira/browse/FLINK-35035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17852748#comment-17852748 ] Matthias Pohl edited comment on FLINK-35035 at 6/6/24 11:47 AM: Thanks for the pointer, [~dmvk]. We looked into this issue while working on [FLIP-461|https://cwiki.apache.org/confluence/display/FLINK/FLIP-461%3A+Synchronize+rescaling+with+checkpoint+creation+to+minimize+reprocessing+for+the+AdaptiveScheduler] (which is kind of related) and plan to do a follow-up FLIP that will align the resource controlling mechanism of the {{{}AdaptiveScheduler{}}}'s {{WaitingForResources}} and {{Executing}} states. Currently, we have parameters intervening in the rescaling in different places ([j.a.scaling-interval.min|https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#jobmanager-adaptive-scheduler-scaling-interval-min], [j.a.scaling-interval.max|https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#jobmanager-adaptive-scheduler-scaling-interval-max] being utilized in {{Executing}} and [j.a.resource-stabilization-timeout|https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#jobmanager-adaptive-scheduler-resource-stabilization-timeout] being utilized in {{{}WaitingForResources){}}}. Having a {{resource-stabilization}} phase in {{Executing}} should resolve the problem described in this Jira issue here. was (Author: mapohl): Thanks for the pointer, [~dmvk]. We looked into this issue while working on [FLIP-461|https://cwiki.apache.org/confluence/display/FLINK/FLIP-461%3A+Synchronize+rescaling+with+checkpoint+creation+to+minimize+reprocessing+for+the+AdaptiveScheduler] (which is kind of related) and plan to do a follow-up FLIP that will align the resource controlling mechanism of the {{AdaptiveScheduler}}'s {{WaitingForResources}} and {{Executing}} states. Currently, we have parameters intervening in the rescaling in different places ([j.a.scaling-interval.min|https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#jobmanager-adaptive-scheduler-scaling-interval-min], [j.a.scaling-interval.max|https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#jobmanager-adaptive-scheduler-scaling-interval-max] being utilized in {{Executing}} and [j.a.resource-stabilization-timeout|https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#jobmanager-adaptive-scheduler-resource-stabilization-timeout) being utilized in {{WaitingForResources}}). Having a {{resource-stabilization}} phase in {{Executing}} should resolve the problem described in this Jira issue here. > Reduce job pause time when cluster resources are expanded in adaptive mode > -- > > Key: FLINK-35035 > URL: https://issues.apache.org/jira/browse/FLINK-35035 > Project: Flink > Issue Type: Improvement > Components: Runtime / Task >Affects Versions: 1.19.0 >Reporter: yuanfenghu >Priority: Minor > > When 'jobmanager.scheduler = adaptive' , job graph changes triggered by > cluster expansion will cause long-term task stagnation. We should reduce this > impact. > As an example: > I have jobgraph for : [v1 (maxp=10 minp = 1)] -> [v2 (maxp=10, minp=1)] > When my cluster has 5 slots, the job will be executed as [v1 p5]->[v2 p5] > When I add slots the task will trigger jobgraph changes,by > org.apache.flink.runtime.scheduler.adaptive.ResourceListener#onNewResourcesAvailable, > However, the five new slots I added were not discovered at the same time (for > convenience, I assume that a taskmanager has one slot), because no matter > what environment we add, we cannot guarantee that the new slots will be added > at once, so this will cause onNewResourcesAvailable triggers repeatedly > ,If each new slot action has a certain interval, then the jobgraph will > continue to change during this period. What I hope is that there will be a > stable time to configure the cluster resources and then go to it after the > number of cluster slots has been stable for a certain period of time. Trigger > jobgraph changes to avoid this situation -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-35035) Reduce job pause time when cluster resources are expanded in adaptive mode
[ https://issues.apache.org/jira/browse/FLINK-35035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17852748#comment-17852748 ] Matthias Pohl commented on FLINK-35035: --- Thanks for the pointer, [~dmvk]. We looked into this issue while working on [FLIP-461|https://cwiki.apache.org/confluence/display/FLINK/FLIP-461%3A+Synchronize+rescaling+with+checkpoint+creation+to+minimize+reprocessing+for+the+AdaptiveScheduler] (which is kind of related) and plan to do a follow-up FLIP that will align the resource controlling mechanism of the {{AdaptiveScheduler}}'s {{WaitingForResources}} and {{Executing}} states. Currently, we have parameters intervening in the rescaling in different places ([j.a.scaling-interval.min|https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#jobmanager-adaptive-scheduler-scaling-interval-min], [j.a.scaling-interval.max|https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#jobmanager-adaptive-scheduler-scaling-interval-max] being utilized in {{Executing}} and [j.a.resource-stabilization-timeout|https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#jobmanager-adaptive-scheduler-resource-stabilization-timeout) being utilized in {{WaitingForResources}}). Having a {{resource-stabilization}} phase in {{Executing}} should resolve the problem described in this Jira issue here. > Reduce job pause time when cluster resources are expanded in adaptive mode > -- > > Key: FLINK-35035 > URL: https://issues.apache.org/jira/browse/FLINK-35035 > Project: Flink > Issue Type: Improvement > Components: Runtime / Task >Affects Versions: 1.19.0 >Reporter: yuanfenghu >Priority: Minor > > When 'jobmanager.scheduler = adaptive' , job graph changes triggered by > cluster expansion will cause long-term task stagnation. We should reduce this > impact. > As an example: > I have jobgraph for : [v1 (maxp=10 minp = 1)] -> [v2 (maxp=10, minp=1)] > When my cluster has 5 slots, the job will be executed as [v1 p5]->[v2 p5] > When I add slots the task will trigger jobgraph changes,by > org.apache.flink.runtime.scheduler.adaptive.ResourceListener#onNewResourcesAvailable, > However, the five new slots I added were not discovered at the same time (for > convenience, I assume that a taskmanager has one slot), because no matter > what environment we add, we cannot guarantee that the new slots will be added > at once, so this will cause onNewResourcesAvailable triggers repeatedly > ,If each new slot action has a certain interval, then the jobgraph will > continue to change during this period. What I hope is that there will be a > stable time to configure the cluster resources and then go to it after the > number of cluster slots has been stable for a certain period of time. Trigger > jobgraph changes to avoid this situation -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-33278) RemotePekkoRpcActorTest.failsRpcResultImmediatelyIfRemoteRpcServiceIsNotAvailable fails on AZP
[ https://issues.apache.org/jira/browse/FLINK-33278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17852700#comment-17852700 ] Matthias Pohl commented on FLINK-33278: --- What are you seeing on Flink 1.18.x? Are you referring to the test instability or some error? I guess, we haven't continued working on the issue for now because that test instability was only observed once so far. > RemotePekkoRpcActorTest.failsRpcResultImmediatelyIfRemoteRpcServiceIsNotAvailable > fails on AZP > -- > > Key: FLINK-33278 > URL: https://issues.apache.org/jira/browse/FLINK-33278 > Project: Flink > Issue Type: Bug > Components: Runtime / RPC >Affects Versions: 1.19.0 >Reporter: Sergey Nuyanzin >Priority: Critical > Labels: test-stability > Attachments: screenshot-1.png, screenshot-2.png, screenshot-3.png, > screenshot-4.png > > > This build > [https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=53740=logs=0e7be18f-84f2-53f0-a32d-4a5e4a174679=7c1d86e3-35bd-5fd5-3b7c-30c126a78702=6563] > fails as > {noformat} > Oct 15 01:02:20 Multiple Failures (1 failure) > Oct 15 01:02:20 -- failure 1 -- > Oct 15 01:02:20 [Any cause is instance of class 'class > org.apache.flink.runtime.rpc.exceptions.RecipientUnreachableException'] > Oct 15 01:02:20 Expecting any element of: > Oct 15 01:02:20 [java.util.concurrent.CompletionException: > java.util.concurrent.TimeoutException: Invocation of > [RemoteRpcInvocation(SerializedValueRespondingGateway.getSerializedValue())] > at recipient > [pekko.tcp://flink@localhost:38231/user/rpc/8c211f34-41e5-4efe-93bd-8eca6c590a7f] > timed out. This is usually caused by: 1) Pekko failed sending the message > silently, due to problems like oversized payload or serialization failures. > In that case, you should find detailed error information in the logs. 2) The > recipient needs more time for responding, due to problems like slow machines > or network jitters. In that case, you can try to increase pekko.ask.timeout. > Oct 15 01:02:20 at > java.util.concurrent.CompletableFuture.reportJoin(CompletableFuture.java:375) > Oct 15 01:02:20 at > java.util.concurrent.CompletableFuture.join(CompletableFuture.java:1947) > Oct 15 01:02:20 at > org.apache.flink.runtime.rpc.pekko.RemotePekkoRpcActorTest.lambda$failsRpcResultImmediatelyIfRemoteRpcServiceIsNotAvailable$1(RemotePekkoRpcActorTest.java:168) > Oct 15 01:02:20 ...(63 remaining lines not displayed - this can be > changed with Assertions.setMaxStackTraceElementsDisplayed), > ... > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-34989) Apache Infra requests to reduce the runner usage for a project
[ https://issues.apache.org/jira/browse/FLINK-34989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17852519#comment-17852519 ] Matthias Pohl commented on FLINK-34989: --- Quote from today's Infra roundtable: * The job concurrency policies are not enforced for now * The FT runner policy items are monitored and enforced by Infra > Apache Infra requests to reduce the runner usage for a project > -- > > Key: FLINK-34989 > URL: https://issues.apache.org/jira/browse/FLINK-34989 > Project: Flink > Issue Type: Sub-task > Components: Build System / CI >Affects Versions: 1.19.0, 1.18.1, 1.20.0 >Reporter: Matthias Pohl >Priority: Major > Labels: pull-request-available > > The GitHub Actions CI utilizes runners that are hosted by Apache Infra right > now. These runners are limited. The runner usage can be monitored via the > following links: > * [Flink-specific > report|https://infra-reports.apache.org/#ghactions=flink=168] > (needs ASF committer rights) This project-specific report can only be > modified through the HTTP GET parameters of the URL. > * [Global report|https://infra-reports.apache.org/#ghactions] (needs ASF > membership) > There was a policy change announced recently: > {quote} > Policy change on use of GitHub Actions > Due to misconfigurations in their builds, some projects have been using > unsupportable numbers of GitHub Actions. As part of fixing this situation, > Infra has added a 'resource use' section to the policy on GitHub Actions. > This section of the policy will come into effect on April 20, 2024: > All workflows MUST have a job concurrency level less than or equal to 20. > This means a workflow cannot have more than 20 jobs running at the same time > across all matrices. > All workflows SHOULD have a job concurrency level less than or equal to 15. > Just because 20 is the max, doesn't mean you should strive for 20. > The average number of minutes a project uses per calendar week MUST NOT > exceed the equivalent of 25 full-time runners (250,000 minutes, or 4,200 > hours). > The average number of minutes a project uses in any consecutive five-day > period MUST NOT exceed the equivalent of 30 full-time runners (216,000 > minutes, or 3,600 hours). > Projects whose builds consistently cross the maximum use limits will lose > their access to GitHub Actions until they fix their build configurations. > The full policy is at > https://infra.apache.org/github-actions-policy.html. > {quote} > Currently (last week of March 2024) Flink was ranked at #19 of projects that > used the Apache Infra runner resources the most which doesn't seem too bad. > This contained not only Apache Flink but also the Kubernetes operator, > connectors and other resources. According to [this > source|https://infra.apache.org/github-actions-secrets.html] Apache Infra > manages 180 runners right now. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-35485) JobMaster failed with "the job xx has not been finished"
[ https://issues.apache.org/jira/browse/FLINK-35485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17850600#comment-17850600 ] Matthias Pohl commented on FLINK-35485: --- Hi [~xccui], thanks for reporting the issue. Can you provide the JobManager logs for this case? That would help getting a better understanding of what was going on. > JobMaster failed with "the job xx has not been finished" > > > Key: FLINK-35485 > URL: https://issues.apache.org/jira/browse/FLINK-35485 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.18.1 >Reporter: Xingcan Cui >Priority: Major > > We ran a session cluster on K8s and used Flink SQL gateway to submit queries. > Hit the following rare exception once which caused the job manager to restart. > {code:java} > org.apache.flink.util.FlinkException: JobMaster for job > 50d681ae1e8170f77b4341dda6aba9bc failed. > at > org.apache.flink.runtime.dispatcher.Dispatcher.jobMasterFailed(Dispatcher.java:1454) > at > org.apache.flink.runtime.dispatcher.Dispatcher.jobManagerRunnerFailed(Dispatcher.java:776) > at > org.apache.flink.runtime.dispatcher.Dispatcher.lambda$runJob$6(Dispatcher.java:698) > at java.base/java.util.concurrent.CompletableFuture.uniHandle(Unknown > Source) > at > java.base/java.util.concurrent.CompletableFuture$UniHandle.tryFire(Unknown > Source) > at java.base/java.util.concurrent.CompletableFuture$Completion.run(Unknown > Source) > at > org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.lambda$handleRunAsync$4(PekkoRpcActor.java:451) > at > org.apache.flink.runtime.concurrent.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68) > at > org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleRunAsync(PekkoRpcActor.java:451) > at > org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleRpcMessage(PekkoRpcActor.java:218) > at > org.apache.flink.runtime.rpc.pekko.FencedPekkoRpcActor.handleRpcMessage(FencedPekkoRpcActor.java:85) > at > org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleMessage(PekkoRpcActor.java:168) > at org.apache.pekko.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:33) > at org.apache.pekko.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:29) > at scala.PartialFunction.applyOrElse(PartialFunction.scala:127) > at scala.PartialFunction.applyOrElse$(PartialFunction.scala:126) > at > org.apache.pekko.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:29) > at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:175) > at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:176) > at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:176) > at org.apache.pekko.actor.Actor.aroundReceive(Actor.scala:547) > at org.apache.pekko.actor.Actor.aroundReceive$(Actor.scala:545) > at > org.apache.pekko.actor.AbstractActor.aroundReceive(AbstractActor.scala:229) > at org.apache.pekko.actor.ActorCell.receiveMessage(ActorCell.scala:590) > at org.apache.pekko.actor.ActorCell.invoke(ActorCell.scala:557) > at org.apache.pekko.dispatch.Mailbox.processMailbox(Mailbox.scala:280) > at org.apache.pekko.dispatch.Mailbox.run(Mailbox.scala:241) > at org.apache.pekko.dispatch.Mailbox.exec(Mailbox.scala:253) > at java.base/java.util.concurrent.ForkJoinTask.doExec(Unknown Source) > at > java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(Unknown > Source) > at java.base/java.util.concurrent.ForkJoinPool.scan(Unknown Source) > at java.base/java.util.concurrent.ForkJoinPool.runWorker(Unknown Source) > at java.base/java.util.concurrent.ForkJoinWorkerThread.run(Unknown Source) > Caused by: org.apache.flink.runtime.jobmaster.JobNotFinishedException: The > job (50d681ae1e8170f77b4341dda6aba9bc) has not been finished. > at > org.apache.flink.runtime.jobmaster.DefaultJobMasterServiceProcess.closeAsync(DefaultJobMasterServiceProcess.java:157) > at > org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.stopJobMasterServiceProcess(JobMasterServiceLeadershipRunner.java:431) > at > org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.callIfRunning(JobMasterServiceLeadershipRunner.java:476) > at > org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.lambda$stopJobMasterServiceProcessAsync$12(JobMasterServiceLeadershipRunner.java:407) > at java.base/java.util.concurrent.CompletableFuture.uniComposeStage(Unknown > Source) > at java.base/java.util.concurrent.CompletableFuture.thenCompose(Unknown > Source) > at > org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.stopJobMasterServiceProcessAsync(JobMasterServiceLeadershipRunner.java:405) > at >
[jira] [Commented] (FLINK-34513) GroupAggregateRestoreTest.testRestore fails
[ https://issues.apache.org/jira/browse/FLINK-34513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17850387#comment-17850387 ] Matthias Pohl commented on FLINK-34513: --- * 1.20 (Java 8): https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=59935=logs=0c940707-2659-5648-cbe6-a1ad63045f0a=075c2716-8010-5565-fe08-3c4bb45824a4=11686 > GroupAggregateRestoreTest.testRestore fails > --- > > Key: FLINK-34513 > URL: https://issues.apache.org/jira/browse/FLINK-34513 > Project: Flink > Issue Type: Bug > Components: Table SQL / Planner >Affects Versions: 1.20.0 >Reporter: Matthias Pohl >Assignee: Bonnie Varghese >Priority: Critical > Labels: pull-request-available, test-stability > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=57828=logs=26b84117-e436-5720-913e-3e280ce55cae=77cc7e77-39a0-5007-6d65-4137ac13a471=10881 > {code} > Feb 24 01:12:01 01:12:01.384 [ERROR] Tests run: 10, Failures: 1, Errors: 0, > Skipped: 1, Time elapsed: 2.957 s <<< FAILURE! -- in > org.apache.flink.table.planner.plan.nodes.exec.stream.GroupAggregateRestoreTest > Feb 24 01:12:01 01:12:01.384 [ERROR] > org.apache.flink.table.planner.plan.nodes.exec.stream.GroupAggregateRestoreTest.testRestore(TableTestProgram, > ExecNodeMetadata)[4] -- Time elapsed: 0.653 s <<< FAILURE! > Feb 24 01:12:01 java.lang.AssertionError: > Feb 24 01:12:01 > Feb 24 01:12:01 Expecting actual: > Feb 24 01:12:01 ["+I[3, 1, 2, 8, 31, 10.0, 3]", > Feb 24 01:12:01 "+I[2, 1, 4, 14, 42, 7.0, 6]", > Feb 24 01:12:01 "+I[1, 1, 4, 12, 24, 6.0, 4]", > Feb 24 01:12:01 "+U[2, 1, 4, 14, 57, 8.0, 7]", > Feb 24 01:12:01 "+U[1, 1, 4, 12, 32, 6.0, 5]", > Feb 24 01:12:01 "+I[7, 0, 1, 7, 7, 7.0, 1]", > Feb 24 01:12:01 "+U[2, 1, 4, 14, 57, 7.0, 7]", > Feb 24 01:12:01 "+U[1, 1, 4, 12, 32, 5.0, 5]", > Feb 24 01:12:01 "+U[3, 1, 2, 8, 31, 9.0, 3]", > Feb 24 01:12:01 "+U[7, 0, 1, 7, 7, 7.0, 2]"] > Feb 24 01:12:01 to contain exactly in any order: > Feb 24 01:12:01 ["+I[3, 1, 2, 8, 31, 10.0, 3]", > Feb 24 01:12:01 "+I[2, 1, 4, 14, 42, 7.0, 6]", > Feb 24 01:12:01 "+I[1, 1, 4, 12, 24, 6.0, 4]", > Feb 24 01:12:01 "+U[2, 1, 4, 14, 57, 8.0, 7]", > Feb 24 01:12:01 "+U[1, 1, 4, 12, 32, 6.0, 5]", > Feb 24 01:12:01 "+U[3, 1, 2, 8, 31, 9.0, 3]", > Feb 24 01:12:01 "+U[2, 1, 4, 14, 57, 7.0, 7]", > Feb 24 01:12:01 "+I[7, 0, 1, 7, 7, 7.0, 2]", > Feb 24 01:12:01 "+U[1, 1, 4, 12, 32, 5.0, 5]"] > Feb 24 01:12:01 elements not found: > Feb 24 01:12:01 ["+I[7, 0, 1, 7, 7, 7.0, 2]"] > Feb 24 01:12:01 and elements not expected: > Feb 24 01:12:01 ["+I[7, 0, 1, 7, 7, 7.0, 1]", "+U[7, 0, 1, 7, 7, 7.0, 2]"] > Feb 24 01:12:01 > Feb 24 01:12:01 at > org.apache.flink.table.planner.plan.nodes.exec.testutils.RestoreTestBase.testRestore(RestoreTestBase.java:313) > Feb 24 01:12:01 at > java.base/java.lang.reflect.Method.invoke(Method.java:580) > [...] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-34582) release build tools lost the newly added py3.11 packages for mac
[ https://issues.apache.org/jira/browse/FLINK-34582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849185#comment-17849185 ] Matthias Pohl commented on FLINK-34582: --- You're checking [~hxb]'s fork where the {{master}} branch doesn't seem to be up-to-date. [apache/flink:flink-python/dev/build-wheels.sh|https://github.com/apache/flink/blob/master/flink-python/dev/build-wheels.sh#L19-L26] does, indeed, have 3.11 added to the python version list. > release build tools lost the newly added py3.11 packages for mac > > > Key: FLINK-34582 > URL: https://issues.apache.org/jira/browse/FLINK-34582 > Project: Flink > Issue Type: Bug >Affects Versions: 1.19.0, 1.20.0 >Reporter: lincoln lee >Assignee: Xingbo Huang >Priority: Blocker > Labels: pull-request-available > Fix For: 1.19.0, 1.20.0 > > Attachments: image-2024-03-07-10-39-49-341.png > > > during 1.19.0-rc1 building binaries via > tools/releasing/create_binary_release.sh > lost the newly added py3.11 2 packages for mac -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-34672) HA deadlock between JobMasterServiceLeadershipRunner and DefaultLeaderElectionService
[ https://issues.apache.org/jira/browse/FLINK-34672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17848648#comment-17848648 ] Matthias Pohl commented on FLINK-34672: --- I'm still trying to find a reviewer. It's on my plate. But it's not a blocker because the issue already existed in older versions of Flink: {quote} I also verified that this is not something that was introduced in Flink 1.18 with the FLIP-285 changes. AFAIS, it can also happen in 1.17- (I didn't check the pre-FLINK-24038 code but only looked into release-1.17). {quote} > HA deadlock between JobMasterServiceLeadershipRunner and > DefaultLeaderElectionService > - > > Key: FLINK-34672 > URL: https://issues.apache.org/jira/browse/FLINK-34672 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.17.2, 1.19.0, 1.18.1, 1.20.0 >Reporter: Chesnay Schepler >Assignee: Matthias Pohl >Priority: Major > Labels: pull-request-available > Fix For: 1.18.2, 1.20.0, 1.19.1 > > > We recently observed a deadlock in the JM within the HA system. > (see below for the thread dump) > [~mapohl] and I looked a bit into it and there appears to be a race condition > when leadership is revoked while a JobMaster is being started. > It appears to be caused by > {{JobMasterServiceLeadershipRunner#createNewJobMasterServiceProcess}} > forwarding futures while holding a lock; depending on whether the forwarded > future is already complete the next stage may or may not run while holding > that same lock. > We haven't determined yet whether we should be holding that lock or not. > {code} > "DefaultLeaderElectionService-leadershipOperationExecutor-thread-1" #131 > daemon prio=5 os_prio=0 cpu=157.44ms elapsed=78749.65s tid=0x7f531f43d000 > nid=0x19d waiting for monitor entry [0x7f53084fd000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.runIfStateRunning(JobMasterServiceLeadershipRunner.java:462) > - waiting to lock <0xf1c0e088> (a java.lang.Object) > at > org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.revokeLeadership(JobMasterServiceLeadershipRunner.java:397) > at > org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService.notifyLeaderContenderOfLeadershipLoss(DefaultLeaderElectionService.java:484) > at > org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService$$Lambda$1252/0x000840ddec40.accept(Unknown > Source) > at java.util.HashMap.forEach(java.base@11.0.22/HashMap.java:1337) > at > org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService.onRevokeLeadershipInternal(DefaultLeaderElectionService.java:452) > at > org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService$$Lambda$1251/0x000840dcf840.run(Unknown > Source) > at > org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService.lambda$runInLeaderEventThread$3(DefaultLeaderElectionService.java:549) > - locked <0xf0e3f4d8> (a java.lang.Object) > at > org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService$$Lambda$1075/0x000840c23040.run(Unknown > Source) > at > java.util.concurrent.CompletableFuture$AsyncRun.run(java.base@11.0.22/CompletableFuture.java:1736) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.22/ThreadPoolExecutor.java:1128) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.22/ThreadPoolExecutor.java:628) > at java.lang.Thread.run(java.base@11.0.22/Thread.java:829) > {code} > {code} > "jobmanager-io-thread-1" #636 daemon prio=5 os_prio=0 cpu=125.56ms > elapsed=78699.01s tid=0x7f5321c6e800 nid=0x396 waiting for monitor entry > [0x7f530567d000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService.hasLeadership(DefaultLeaderElectionService.java:366) > - waiting to lock <0xf0e3f4d8> (a java.lang.Object) > at > org.apache.flink.runtime.leaderelection.DefaultLeaderElection.hasLeadership(DefaultLeaderElection.java:52) > at > org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.isValidLeader(JobMasterServiceLeadershipRunner.java:509) > at > org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.lambda$forwardIfValidLeader$15(JobMasterServiceLeadershipRunner.java:520) > - locked <0xf1c0e088> (a java.lang.Object) > at >
[jira] [Assigned] (FLINK-20402) Migrate test_tpch.sh
[ https://issues.apache.org/jira/browse/FLINK-20402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthias Pohl reassigned FLINK-20402: - Assignee: Muhammet Orazov > Migrate test_tpch.sh > > > Key: FLINK-20402 > URL: https://issues.apache.org/jira/browse/FLINK-20402 > Project: Flink > Issue Type: Sub-task > Components: Table SQL / Ecosystem, Tests >Reporter: Jark Wu >Assignee: Muhammet Orazov >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-20392) Migrating bash e2e tests to Java/Docker
[ https://issues.apache.org/jira/browse/FLINK-20392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846924#comment-17846924 ] Matthias Pohl commented on FLINK-20392: --- Sure, sounds reasonable. Feel free to update it. > Migrating bash e2e tests to Java/Docker > --- > > Key: FLINK-20392 > URL: https://issues.apache.org/jira/browse/FLINK-20392 > Project: Flink > Issue Type: Technical Debt > Components: Test Infrastructure, Tests >Reporter: Matthias Pohl >Priority: Minor > Labels: auto-deprioritized-major, auto-deprioritized-minor, > starter > > This Jira issue serves as an umbrella ticket for single e2e test migration > tasks. This should enable us to migrate all bash-based e2e tests step-by-step. > The goal is to utilize the e2e test framework (see > [flink-end-to-end-tests-common|https://github.com/apache/flink/tree/master/flink-end-to-end-tests/flink-end-to-end-tests-common]). > Ideally, the test should use Docker containers as much as possible > disconnect the execution from the environment. A good source to achieve that > is [testcontainers.org|https://www.testcontainers.org/]. > The related ML discussion is [Stop adding new bash-based e2e tests to > Flink|http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Stop-adding-new-bash-based-e2e-tests-to-Flink-td46607.html]. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-20392) Migrating bash e2e tests to Java/Docker
[ https://issues.apache.org/jira/browse/FLINK-20392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846838#comment-17846838 ] Matthias Pohl commented on FLINK-20392: --- This discussion feels similar to our efforts around migrating to JUnit5 and assertj as the standard JUnit tests. It costed (and is still costing) quite a bit of resources with the risk of missing things when reviewing the tests. That is why I still see value in just keeping both options around. That requires less resources and we're not losing much. The pros and cons are still a good guideline for developers to decide on which technology to use if they are planning to create a new e2e test in Java. WDYT? > Migrating bash e2e tests to Java/Docker > --- > > Key: FLINK-20392 > URL: https://issues.apache.org/jira/browse/FLINK-20392 > Project: Flink > Issue Type: Technical Debt > Components: Test Infrastructure, Tests >Reporter: Matthias Pohl >Priority: Minor > Labels: auto-deprioritized-major, auto-deprioritized-minor, > starter > > This Jira issue serves as an umbrella ticket for single e2e test migration > tasks. This should enable us to migrate all bash-based e2e tests step-by-step. > The goal is to utilize the e2e test framework (see > [flink-end-to-end-tests-common|https://github.com/apache/flink/tree/master/flink-end-to-end-tests/flink-end-to-end-tests-common]). > Ideally, the test should use Docker containers as much as possible > disconnect the execution from the environment. A good source to achieve that > is [testcontainers.org|https://www.testcontainers.org/]. > The related ML discussion is [Stop adding new bash-based e2e tests to > Flink|http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Stop-adding-new-bash-based-e2e-tests-to-Flink-td46607.html]. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-20392) Migrating bash e2e tests to Java/Docker
[ https://issues.apache.org/jira/browse/FLINK-20392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846540#comment-17846540 ] Matthias Pohl commented on FLINK-20392: --- Thanks for the write-up. I'm just wondering whether we gain anything from only allowing one of the two approaches. What about allowing both options? > Migrating bash e2e tests to Java/Docker > --- > > Key: FLINK-20392 > URL: https://issues.apache.org/jira/browse/FLINK-20392 > Project: Flink > Issue Type: Technical Debt > Components: Test Infrastructure, Tests >Reporter: Matthias Pohl >Priority: Minor > Labels: auto-deprioritized-major, auto-deprioritized-minor, > starter > > This Jira issue serves as an umbrella ticket for single e2e test migration > tasks. This should enable us to migrate all bash-based e2e tests step-by-step. > The goal is to utilize the e2e test framework (see > [flink-end-to-end-tests-common|https://github.com/apache/flink/tree/master/flink-end-to-end-tests/flink-end-to-end-tests-common]). > Ideally, the test should use Docker containers as much as possible > disconnect the execution from the environment. A good source to achieve that > is [testcontainers.org|https://www.testcontainers.org/]. > The related ML discussion is [Stop adding new bash-based e2e tests to > Flink|http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Stop-adding-new-bash-based-e2e-tests-to-Flink-td46607.html]. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (FLINK-34324) s3_setup is called in test_file_sink.sh even if the common_s3.sh is not sourced
[ https://issues.apache.org/jira/browse/FLINK-34324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845238#comment-17845238 ] Matthias Pohl edited comment on FLINK-34324 at 5/10/24 8:07 AM: * master: [93526c2f3247598ce80854cf65dd4440eb5aaa43|https://github.com/apache/flink/commit/93526c2f3247598ce80854cf65dd4440eb5aaa43] * 1.19: [8707c63ee147085671a9ae1b294854bac03fc914|https://github.com/apache/flink/commit/8707c63ee147085671a9ae1b294854bac03fc914] * 1.18: [7d98ab060be82fe3684d15501b9eb83373303d18|https://github.com/apache/flink/commit/7d98ab060be82fe3684d15501b9eb83373303d18] was (Author: mapohl): * master ** [93526c2f3247598ce80854cf65dd4440eb5aaa43|https://github.com/apache/flink/commit/93526c2f3247598ce80854cf65dd4440eb5aaa43] * 1.19 ** [8707c63ee147085671a9ae1b294854bac03fc914|https://github.com/apache/flink/commit/8707c63ee147085671a9ae1b294854bac03fc914] * 1.18 ** [7d98ab060be82fe3684d15501b9eb83373303d18|https://github.com/apache/flink/commit/7d98ab060be82fe3684d15501b9eb83373303d18] > s3_setup is called in test_file_sink.sh even if the common_s3.sh is not > sourced > --- > > Key: FLINK-34324 > URL: https://issues.apache.org/jira/browse/FLINK-34324 > Project: Flink > Issue Type: Bug > Components: Connectors / Hadoop Compatibility, Tests >Affects Versions: 1.17.2, 1.19.0, 1.18.1 >Reporter: Matthias Pohl >Assignee: Matthias Pohl >Priority: Major > Labels: pull-request-available, test-stability > Fix For: 1.18.2, 1.20.0, 1.19.1 > > > See example CI run from the FLINK-34150 PR: > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=56570=logs=af184cdd-c6d8-5084-0b69-7e9c67b35f7a=0f3adb59-eefa-51c6-2858-3654d9e0749d=3191 > {code} > /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/test_file_sink.sh: > line 38: s3_setup: command not found > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (FLINK-34324) s3_setup is called in test_file_sink.sh even if the common_s3.sh is not sourced
[ https://issues.apache.org/jira/browse/FLINK-34324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthias Pohl resolved FLINK-34324. --- Fix Version/s: 1.18.2 1.20.0 1.19.1 Resolution: Fixed * master ** [93526c2f3247598ce80854cf65dd4440eb5aaa43|https://github.com/apache/flink/commit/93526c2f3247598ce80854cf65dd4440eb5aaa43] * 1.19 ** [8707c63ee147085671a9ae1b294854bac03fc914|https://github.com/apache/flink/commit/8707c63ee147085671a9ae1b294854bac03fc914] * 1.18 ** [7d98ab060be82fe3684d15501b9eb83373303d18|https://github.com/apache/flink/commit/7d98ab060be82fe3684d15501b9eb83373303d18] > s3_setup is called in test_file_sink.sh even if the common_s3.sh is not > sourced > --- > > Key: FLINK-34324 > URL: https://issues.apache.org/jira/browse/FLINK-34324 > Project: Flink > Issue Type: Bug > Components: Connectors / Hadoop Compatibility, Tests >Affects Versions: 1.17.2, 1.19.0, 1.18.1 >Reporter: Matthias Pohl >Assignee: Matthias Pohl >Priority: Major > Labels: pull-request-available, test-stability > Fix For: 1.18.2, 1.20.0, 1.19.1 > > > See example CI run from the FLINK-34150 PR: > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=56570=logs=af184cdd-c6d8-5084-0b69-7e9c67b35f7a=0f3adb59-eefa-51c6-2858-3654d9e0749d=3191 > {code} > /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/test_file_sink.sh: > line 38: s3_setup: command not found > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (FLINK-34937) Apache Infra GHA policy update
[ https://issues.apache.org/jira/browse/FLINK-34937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthias Pohl reassigned FLINK-34937: - Assignee: Matthias Pohl > Apache Infra GHA policy update > -- > > Key: FLINK-34937 > URL: https://issues.apache.org/jira/browse/FLINK-34937 > Project: Flink > Issue Type: Sub-task > Components: Build System / CI >Affects Versions: 1.19.0, 1.18.1, 1.20.0 >Reporter: Matthias Pohl >Assignee: Matthias Pohl >Priority: Major > Labels: pull-request-available > > There is a policy update [announced in the infra > ML|https://www.mail-archive.com/jdo-dev@db.apache.org/msg13638.html] which > asked Apache projects to limit the number of runners per job. Additionally, > the [GHA policy|https://infra.apache.org/github-actions-policy.html] is > referenced which I wasn't aware of when working on the action workflow. > This issue is about applying the policy to the Flink GHA workflows. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-34989) Apache Infra requests to reduce the runner usage for a project
[ https://issues.apache.org/jira/browse/FLINK-34989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833910#comment-17833910 ] Matthias Pohl commented on FLINK-34989: --- [~martijnvisser] pointed out that we might need to fix this in the connector repos as well. > Apache Infra requests to reduce the runner usage for a project > -- > > Key: FLINK-34989 > URL: https://issues.apache.org/jira/browse/FLINK-34989 > Project: Flink > Issue Type: Sub-task > Components: Build System / CI >Affects Versions: 1.19.0, 1.18.1, 1.20.0 >Reporter: Matthias Pohl >Priority: Major > Labels: pull-request-available > > The GitHub Actions CI utilizes runners that are hosted by Apache Infra right > now. These runners are limited. The runner usage can be monitored via the > following links: > * [Flink-specific > report|https://infra-reports.apache.org/#ghactions=flink=168] > (needs ASF committer rights) This project-specific report can only be > modified through the HTTP GET parameters of the URL. > * [Global report|https://infra-reports.apache.org/#ghactions] (needs ASF > membership) > There was a policy change announced recently: > {quote} > Policy change on use of GitHub Actions > Due to misconfigurations in their builds, some projects have been using > unsupportable numbers of GitHub Actions. As part of fixing this situation, > Infra has added a 'resource use' section to the policy on GitHub Actions. > This section of the policy will come into effect on April 20, 2024: > All workflows MUST have a job concurrency level less than or equal to 20. > This means a workflow cannot have more than 20 jobs running at the same time > across all matrices. > All workflows SHOULD have a job concurrency level less than or equal to 15. > Just because 20 is the max, doesn't mean you should strive for 20. > The average number of minutes a project uses per calendar week MUST NOT > exceed the equivalent of 25 full-time runners (250,000 minutes, or 4,200 > hours). > The average number of minutes a project uses in any consecutive five-day > period MUST NOT exceed the equivalent of 30 full-time runners (216,000 > minutes, or 3,600 hours). > Projects whose builds consistently cross the maximum use limits will lose > their access to GitHub Actions until they fix their build configurations. > The full policy is at > https://infra.apache.org/github-actions-policy.html. > {quote} > Currently (last week of March 2024) Flink was ranked at #19 of projects that > used the Apache Infra runner resources the most which doesn't seem too bad. > This contained not only Apache Flink but also the Kubernetes operator, > connectors and other resources. According to [this > source|https://infra.apache.org/github-actions-secrets.html] Apache Infra > manages 180 runners right now. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (FLINK-34999) PR CI stopped operating
[ https://issues.apache.org/jira/browse/FLINK-34999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthias Pohl resolved FLINK-34999. --- Resolution: Fixed Thanks for working on it. I verified that [PR CI|https://dev.azure.com/apache-flink/apache-flink/_build?definitionId=2] is picked up again. (y) > PR CI stopped operating > --- > > Key: FLINK-34999 > URL: https://issues.apache.org/jira/browse/FLINK-34999 > Project: Flink > Issue Type: Bug > Components: Build System / CI >Affects Versions: 1.19.0, 1.18.1, 1.20.0 >Reporter: Matthias Pohl >Priority: Blocker > > There are no [new PR CI > runs|https://dev.azure.com/apache-flink/apache-flink/_build?definitionId=2] > being picked up anymore. [Recently updated > PRs|https://github.com/apache/flink/pulls?q=sort%3Aupdated-desc] are not > picked up by the @flinkbot. > In the meantime there was a notification sent from GitHub that the password > of the [@flinkbot|https://github.com/flinkbot] was reset for security > reasons. It's quite likely that these two events are related. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-35005) SqlClientITCase Failed to build JobManager image
[ https://issues.apache.org/jira/browse/FLINK-35005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthias Pohl updated FLINK-35005: -- Component/s: Test Infrastructure > SqlClientITCase Failed to build JobManager image > > > Key: FLINK-35005 > URL: https://issues.apache.org/jira/browse/FLINK-35005 > Project: Flink > Issue Type: Bug > Components: Test Infrastructure >Affects Versions: 1.20.0 >Reporter: Ryan Skraba >Priority: Critical > Labels: test-stability > > jdk21 > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58708=logs=dc1bf4ed-4646-531a-f094-e103042be549=fb3d654d-52f8-5b98-fe9d-b18dd2e2b790=15140 > {code} > Apr 03 02:59:16 02:59:16.247 [INFO] > --- > Apr 03 02:59:16 02:59:16.248 [INFO] T E S T S > Apr 03 02:59:16 02:59:16.248 [INFO] > --- > Apr 03 02:59:17 02:59:17.841 [INFO] Running SqlClientITCase > Apr 03 03:03:15 at > java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1312) > Apr 03 03:03:15 at > java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1843) > Apr 03 03:03:15 at > java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1808) > Apr 03 03:03:15 at > java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:188) > Apr 03 03:03:15 Caused by: > org.apache.flink.connector.testframe.container.ImageBuildException: Failed to > build image "flink-configured-jobmanager" > Apr 03 03:03:15 at > org.apache.flink.connector.testframe.container.FlinkImageBuilder.build(FlinkImageBuilder.java:234) > Apr 03 03:03:15 at > org.apache.flink.connector.testframe.container.FlinkTestcontainersConfigurator.configureJobManagerContainer(FlinkTestcontainersConfigurator.java:65) > Apr 03 03:03:15 ... 12 more > Apr 03 03:03:15 Caused by: java.lang.RuntimeException: > com.github.dockerjava.api.exception.DockerClientException: Could not build > image: Head > "https://registry-1.docker.io/v2/library/eclipse-temurin/manifests/21-jre-jammy": > received unexpected HTTP status: 500 Internal Server Error > Apr 03 03:03:15 at > org.rnorth.ducttape.timeouts.Timeouts.callFuture(Timeouts.java:68) > Apr 03 03:03:15 at > org.rnorth.ducttape.timeouts.Timeouts.getWithTimeout(Timeouts.java:43) > Apr 03 03:03:15 at > org.testcontainers.utility.LazyFuture.get(LazyFuture.java:47) > Apr 03 03:03:15 at > org.apache.flink.connector.testframe.container.FlinkImageBuilder.buildBaseImage(FlinkImageBuilder.java:255) > Apr 03 03:03:15 at > org.apache.flink.connector.testframe.container.FlinkImageBuilder.build(FlinkImageBuilder.java:206) > Apr 03 03:03:15 ... 13 more > Apr 03 03:03:15 Caused by: > com.github.dockerjava.api.exception.DockerClientException: Could not build > image: Head > "https://registry-1.docker.io/v2/library/eclipse-temurin/manifests/21-jre-jammy": > received unexpected HTTP status: 500 Internal Server Error > Apr 03 03:03:15 at > com.github.dockerjava.api.command.BuildImageResultCallback.getImageId(BuildImageResultCallback.java:78) > Apr 03 03:03:15 at > com.github.dockerjava.api.command.BuildImageResultCallback.awaitImageId(BuildImageResultCallback.java:50) > Apr 03 03:03:15 at > org.testcontainers.images.builder.ImageFromDockerfile.resolve(ImageFromDockerfile.java:159) > Apr 03 03:03:15 at > org.testcontainers.images.builder.ImageFromDockerfile.resolve(ImageFromDockerfile.java:40) > Apr 03 03:03:15 at > org.testcontainers.utility.LazyFuture.getResolvedValue(LazyFuture.java:19) > Apr 03 03:03:15 at > org.testcontainers.utility.LazyFuture.get(LazyFuture.java:41) > Apr 03 03:03:15 at > java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317) > Apr 03 03:03:15 at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) > Apr 03 03:03:15 at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) > Apr 03 03:03:15 at java.base/java.lang.Thread.run(Thread.java:1583) > Apr 03 03:03:15 > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-35004) SqlGatewayE2ECase could not start container
[ https://issues.apache.org/jira/browse/FLINK-35004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthias Pohl updated FLINK-35004: -- Component/s: Test Infrastructure > SqlGatewayE2ECase could not start container > --- > > Key: FLINK-35004 > URL: https://issues.apache.org/jira/browse/FLINK-35004 > Project: Flink > Issue Type: Bug > Components: Test Infrastructure >Affects Versions: 1.20.0 >Reporter: Ryan Skraba >Priority: Critical > Labels: github-actions, test-stability > > 1.20, jdk17: > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58708=logs=e8e46ef5-75cc-564f-c2bd-1797c35cbebe=60c49903-2505-5c25-7e46-de91b1737bea=15078 > There is an error: "Process failed due to timeout" in > {{SqlGatewayE2ECase.testSqlClientExecuteStatement}}. In the maven logs, we > can see: > {code:java} > 02:57:26,979 [main] INFO tc.prestodb/hdp2.6-hive:10 > [] - Image prestodb/hdp2.6-hive:10 pull took > PT43.59218S > 02:57:26,991 [main] INFO tc.prestodb/hdp2.6-hive:10 > [] - Creating container for image: > prestodb/hdp2.6-hive:10 > 02:57:27,032 [main] INFO tc.prestodb/hdp2.6-hive:10 > [] - Container prestodb/hdp2.6-hive:10 is starting: > 162069678c7d03252a42ed81ca43e1911ca7357c476a4a5de294ffe55bd83145 > 02:57:42,846 [main] INFO tc.prestodb/hdp2.6-hive:10 > [] - Container prestodb/hdp2.6-hive:10 started in > PT15.855339866S > 02:57:53,447 [main] ERROR tc.prestodb/hdp2.6-hive:10 > [] - Could not start container > java.lang.RuntimeException: java.net.SocketTimeoutException: timeout > at > org.apache.flink.table.gateway.containers.HiveContainer.containerIsStarted(HiveContainer.java:94) > ~[test-classes/:?] > at > org.testcontainers.containers.GenericContainer.containerIsStarted(GenericContainer.java:723) > ~[testcontainers-1.19.1.jar:1.19.1] > at > org.testcontainers.containers.GenericContainer.tryStart(GenericContainer.java:543) > ~[testcontainers-1.19.1.jar:1.19.1] > at > org.testcontainers.containers.GenericContainer.lambda$doStart$0(GenericContainer.java:354) > ~[testcontainers-1.19.1.jar:1.19.1] > at > org.rnorth.ducttape.unreliables.Unreliables.retryUntilSuccess(Unreliables.java:81) > ~[duct-tape-1.0.8.jar:?] > at > org.testcontainers.containers.GenericContainer.doStart(GenericContainer.java:344) > ~[testcontainers-1.19.1.jar:1.19.1] > at > org.apache.flink.table.gateway.containers.HiveContainer.doStart(HiveContainer.java:69) > ~[test-classes/:?] > at > org.testcontainers.containers.GenericContainer.start(GenericContainer.java:334) > ~[testcontainers-1.19.1.jar:1.19.1] > at > org.testcontainers.containers.GenericContainer.starting(GenericContainer.java:1144) > ~[testcontainers-1.19.1.jar:1.19.1] > at > org.testcontainers.containers.FailureDetectingExternalResource$1.evaluate(FailureDetectingExternalResource.java:28) > ~[testcontainers-1.19.1.jar:1.19.1] > at org.junit.rules.RunRules.evaluate(RunRules.java:20) > ~[junit-4.13.2.jar:4.13.2] > at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) > ~[junit-4.13.2.jar:4.13.2] > at org.junit.runners.ParentRunner.run(ParentRunner.java:413) > ~[junit-4.13.2.jar:4.13.2] > at org.junit.runner.JUnitCore.run(JUnitCore.java:137) > ~[junit-4.13.2.jar:4.13.2] > at org.junit.runner.JUnitCore.run(JUnitCore.java:115) > ~[junit-4.13.2.jar:4.13.2] > at > org.junit.vintage.engine.execution.RunnerExecutor.execute(RunnerExecutor.java:42) > ~[junit-vintage-engine-5.10.1.jar:5.10.1] > at > org.junit.vintage.engine.VintageTestEngine.executeAllChildren(VintageTestEngine.java:80) > ~[junit-vintage-engine-5.10.1.jar:5.10.1] > at > org.junit.vintage.engine.VintageTestEngine.execute(VintageTestEngine.java:72) > ~[junit-vintage-engine-5.10.1.jar:5.10.1] > at > org.junit.platform.launcher.core.EngineExecutionOrchestrator.execute(EngineExecutionOrchestrator.java:198) > ~[junit-platform-launcher-1.10.1.jar:1.10.1] > at > org.junit.platform.launcher.core.EngineExecutionOrchestrator.execute(EngineExecutionOrchestrator.java:169) > ~[junit-platform-launcher-1.10.1.jar:1.10.1] > at > org.junit.platform.launcher.core.EngineExecutionOrchestrator.execute(EngineExecutionOrchestrator.java:93) > ~[junit-platform-launcher-1.10.1.jar:1.10.1] > at > org.junit.platform.launcher.core.EngineExecutionOrchestrator.lambda$execute$0(EngineExecutionOrchestrator.java:58) > ~[junit-platform-launcher-1.10.1.jar:1.10.1] > at >
[jira] [Resolved] (FLINK-35000) PullRequest template doesn't use the correct format to refer to the testing code convention
[ https://issues.apache.org/jira/browse/FLINK-35000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthias Pohl resolved FLINK-35000. --- Fix Version/s: 1.18.2 1.20.0 1.19.1 Resolution: Fixed master: [d301839dfe2ed9b1313d23f8307bda76868a0c0a|https://github.com/apache/flink/commit/d301839dfe2ed9b1313d23f8307bda76868a0c0a] 1.19: [eb58599b434b6c5fe86f6e487ce88315c98b4ec3|https://github.com/apache/flink/commit/eb58599b434b6c5fe86f6e487ce88315c98b4ec3] 1.18: [9150f93b18b8694646092a6ed24a14e3653f613f|https://github.com/apache/flink/commit/9150f93b18b8694646092a6ed24a14e3653f613f] > PullRequest template doesn't use the correct format to refer to the testing > code convention > --- > > Key: FLINK-35000 > URL: https://issues.apache.org/jira/browse/FLINK-35000 > Project: Flink > Issue Type: Bug > Components: Build System / CI, Project Website >Affects Versions: 1.19.0, 1.18.1, 1.20.0 >Reporter: Matthias Pohl >Assignee: Matthias Pohl >Priority: Minor > Labels: pull-request-available > Fix For: 1.18.2, 1.20.0, 1.19.1 > > > The PR template refers to > https://flink.apache.org/contributing/code-style-and-quality-common.html#testing > rather than > https://flink.apache.org/how-to-contribute/code-style-and-quality-common/#7-testing -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-35002) GitHub action/upload-artifact@v4 can timeout
[ https://issues.apache.org/jira/browse/FLINK-35002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthias Pohl updated FLINK-35002: -- Labels: github-actions test-stability (was: test-stability) > GitHub action/upload-artifact@v4 can timeout > > > Key: FLINK-35002 > URL: https://issues.apache.org/jira/browse/FLINK-35002 > Project: Flink > Issue Type: Bug > Components: Build System >Reporter: Ryan Skraba >Priority: Major > Labels: github-actions, test-stability > > A timeout can occur when uploading a successfully built artifact: > * [https://github.com/apache/flink/actions/runs/8516411871/job/23325392650] > {code:java} > 2024-04-02T02:20:15.6355368Z With the provided path, there will be 1 file > uploaded > 2024-04-02T02:20:15.6360133Z Artifact name is valid! > 2024-04-02T02:20:15.6362872Z Root directory input is valid! > 2024-04-02T02:20:20.6975036Z Attempt 1 of 5 failed with error: Request > timeout: /twirp/github.actions.results.api.v1.ArtifactService/CreateArtifact. > Retrying request in 3000 ms... > 2024-04-02T02:20:28.7084937Z Attempt 2 of 5 failed with error: Request > timeout: /twirp/github.actions.results.api.v1.ArtifactService/CreateArtifact. > Retrying request in 4785 ms... > 2024-04-02T02:20:38.5015936Z Attempt 3 of 5 failed with error: Request > timeout: /twirp/github.actions.results.api.v1.ArtifactService/CreateArtifact. > Retrying request in 7375 ms... > 2024-04-02T02:20:50.8901508Z Attempt 4 of 5 failed with error: Request > timeout: /twirp/github.actions.results.api.v1.ArtifactService/CreateArtifact. > Retrying request in 14988 ms... > 2024-04-02T02:21:10.9028438Z ##[error]Failed to CreateArtifact: Failed to > make request after 5 attempts: Request timeout: > /twirp/github.actions.results.api.v1.ArtifactService/CreateArtifact > 2024-04-02T02:22:59.9893296Z Post job cleanup. > 2024-04-02T02:22:59.9958844Z Post job cleanup. {code} > (This is unlikely to be something we can fix, but we can track it.) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-34999) PR CI stopped operating
[ https://issues.apache.org/jira/browse/FLINK-34999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthias Pohl updated FLINK-34999: -- Description: There are no [new PR CI runs|https://dev.azure.com/apache-flink/apache-flink/_build?definitionId=2] being picked up anymore. [Recently updated PRs|https://github.com/apache/flink/pulls?q=sort%3Aupdated-desc] are not picked up by the @flinkbot. In the meantime there was a notification sent from GitHub that the password of the [@flinkbot|https://github.com/flinkbot] was reset for security reasons. It's quite likely that these two events are related. was: There are no [new PR CI runs|https://dev.azure.com/apache-flink/apache-flink/_build?definitionId=2] being picked up anymore. [Recently updated PRs|https://github.com/apache/flink/pulls?q=sort%3Aupdated-desc] are not picked up by the @flinkbot. In the meantime there was a notification sent from GitHub that the password of the @flinkbot was reset for security reasons. It's quite likely that these two events are related. > PR CI stopped operating > --- > > Key: FLINK-34999 > URL: https://issues.apache.org/jira/browse/FLINK-34999 > Project: Flink > Issue Type: Bug > Components: Build System / CI >Affects Versions: 1.19.0, 1.18.1, 1.20.0 >Reporter: Matthias Pohl >Priority: Blocker > > There are no [new PR CI > runs|https://dev.azure.com/apache-flink/apache-flink/_build?definitionId=2] > being picked up anymore. [Recently updated > PRs|https://github.com/apache/flink/pulls?q=sort%3Aupdated-desc] are not > picked up by the @flinkbot. > In the meantime there was a notification sent from GitHub that the password > of the [@flinkbot|https://github.com/flinkbot] was reset for security > reasons. It's quite likely that these two events are related. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-35000) PullRequest template doesn't use the correct format to refer to the testing code convention
Matthias Pohl created FLINK-35000: - Summary: PullRequest template doesn't use the correct format to refer to the testing code convention Key: FLINK-35000 URL: https://issues.apache.org/jira/browse/FLINK-35000 Project: Flink Issue Type: Bug Components: Build System / CI, Project Website Affects Versions: 1.18.1, 1.19.0, 1.20.0 Reporter: Matthias Pohl The PR template refers to https://flink.apache.org/contributing/code-style-and-quality-common.html#testing rather than https://flink.apache.org/how-to-contribute/code-style-and-quality-common/#7-testing -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (FLINK-35000) PullRequest template doesn't use the correct format to refer to the testing code convention
[ https://issues.apache.org/jira/browse/FLINK-35000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthias Pohl reassigned FLINK-35000: - Assignee: Matthias Pohl > PullRequest template doesn't use the correct format to refer to the testing > code convention > --- > > Key: FLINK-35000 > URL: https://issues.apache.org/jira/browse/FLINK-35000 > Project: Flink > Issue Type: Bug > Components: Build System / CI, Project Website >Affects Versions: 1.19.0, 1.18.1, 1.20.0 >Reporter: Matthias Pohl >Assignee: Matthias Pohl >Priority: Minor > > The PR template refers to > https://flink.apache.org/contributing/code-style-and-quality-common.html#testing > rather than > https://flink.apache.org/how-to-contribute/code-style-and-quality-common/#7-testing -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-34999) PR CI stopped operating
[ https://issues.apache.org/jira/browse/FLINK-34999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833523#comment-17833523 ] Matthias Pohl commented on FLINK-34999: --- CC [~uce] [~Weijie Guo] [~fanrui] [~rmetzger] CC [~jingge] since it might be Ververica infrastructure-related > PR CI stopped operating > --- > > Key: FLINK-34999 > URL: https://issues.apache.org/jira/browse/FLINK-34999 > Project: Flink > Issue Type: Bug > Components: Build System / CI >Affects Versions: 1.19.0, 1.18.1, 1.20.0 >Reporter: Matthias Pohl >Priority: Blocker > > There are no [new PR CI > runs|https://dev.azure.com/apache-flink/apache-flink/_build?definitionId=2] > being picked up anymore. [Recently updated > PRs|https://github.com/apache/flink/pulls?q=sort%3Aupdated-desc] are not > picked up by the @flinkbot. > In the meantime there was a notification sent from GitHub that the password > of the @flinkbot was reset for security reasons. It's quite likely that these > two events are related. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-34999) PR CI stopped operating
Matthias Pohl created FLINK-34999: - Summary: PR CI stopped operating Key: FLINK-34999 URL: https://issues.apache.org/jira/browse/FLINK-34999 Project: Flink Issue Type: Bug Components: Build System / CI Affects Versions: 1.18.1, 1.19.0, 1.20.0 Reporter: Matthias Pohl There are no [new PR CI runs|https://dev.azure.com/apache-flink/apache-flink/_build?definitionId=2] being picked up anymore. [Recently updated PRs|https://github.com/apache/flink/pulls?q=sort%3Aupdated-desc] are not picked up by the @flinkbot. In the meantime there was a notification sent from GitHub that the password of the @flinkbot was reset for security reasons. It's quite likely that these two events are related. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-34997) PyFlink YARN per-job on Docker test failed on azure
[ https://issues.apache.org/jira/browse/FLINK-34997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833505#comment-17833505 ] Matthias Pohl commented on FLINK-34997: --- The issue seems to be that {{docker-compose}} binaries are missing in the Azure VMs. > PyFlink YARN per-job on Docker test failed on azure > --- > > Key: FLINK-34997 > URL: https://issues.apache.org/jira/browse/FLINK-34997 > Project: Flink > Issue Type: Bug > Components: Build System / CI >Affects Versions: 1.20.0 >Reporter: Weijie Guo >Priority: Blocker > Labels: test-stability > > {code} > Apr 03 03:12:37 > == > Apr 03 03:12:37 Running 'PyFlink YARN per-job on Docker test' > Apr 03 03:12:37 > == > Apr 03 03:12:37 TEST_DATA_DIR: > /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-37046085202 > Apr 03 03:12:37 Flink dist directory: > /home/vsts/work/1/s/flink-dist/target/flink-1.19-SNAPSHOT-bin/flink-1.19-SNAPSHOT > Apr 03 03:12:38 Flink dist directory: > /home/vsts/work/1/s/flink-dist/target/flink-1.19-SNAPSHOT-bin/flink-1.19-SNAPSHOT > Apr 03 03:12:38 Docker version 24.0.9, build 2936816 > /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/common_docker.sh: > line 24: docker-compose: command not found > Apr 03 03:12:38 [FAIL] Test script contains errors. > Apr 03 03:12:38 Checking of logs skipped. > Apr 03 03:12:38 > Apr 03 03:12:38 [FAIL] 'PyFlink YARN per-job on Docker test' failed after 0 > minutes and 1 seconds! Test exited with exit code 1 > {code} > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58709=logs=f8e16326-dc75-5ba0-3e95-6178dd55bf6c=94ccd692-49fc-5c64-8775-d427c6e65440=10226 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-34997) PyFlink YARN per-job on Docker test failed on azure
[ https://issues.apache.org/jira/browse/FLINK-34997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthias Pohl updated FLINK-34997: -- Labels: test-stability (was: ) > PyFlink YARN per-job on Docker test failed on azure > --- > > Key: FLINK-34997 > URL: https://issues.apache.org/jira/browse/FLINK-34997 > Project: Flink > Issue Type: Bug > Components: Build System / CI >Affects Versions: 1.20.0 >Reporter: Weijie Guo >Priority: Major > Labels: test-stability > > {code} > Apr 03 03:12:37 > == > Apr 03 03:12:37 Running 'PyFlink YARN per-job on Docker test' > Apr 03 03:12:37 > == > Apr 03 03:12:37 TEST_DATA_DIR: > /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-37046085202 > Apr 03 03:12:37 Flink dist directory: > /home/vsts/work/1/s/flink-dist/target/flink-1.19-SNAPSHOT-bin/flink-1.19-SNAPSHOT > Apr 03 03:12:38 Flink dist directory: > /home/vsts/work/1/s/flink-dist/target/flink-1.19-SNAPSHOT-bin/flink-1.19-SNAPSHOT > Apr 03 03:12:38 Docker version 24.0.9, build 2936816 > /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/common_docker.sh: > line 24: docker-compose: command not found > Apr 03 03:12:38 [FAIL] Test script contains errors. > Apr 03 03:12:38 Checking of logs skipped. > Apr 03 03:12:38 > Apr 03 03:12:38 [FAIL] 'PyFlink YARN per-job on Docker test' failed after 0 > minutes and 1 seconds! Test exited with exit code 1 > {code} > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58709=logs=f8e16326-dc75-5ba0-3e95-6178dd55bf6c=94ccd692-49fc-5c64-8775-d427c6e65440=10226 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-34998) Wordcount on Docker test failed on azure
[ https://issues.apache.org/jira/browse/FLINK-34998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833504#comment-17833504 ] Matthias Pohl commented on FLINK-34998: --- I guess, this one is a duplicate of FLINK-34997. In the end, the error happens due to the missing {{docker-compose}} binaries in the Azure VMs. WDYT? > Wordcount on Docker test failed on azure > > > Key: FLINK-34998 > URL: https://issues.apache.org/jira/browse/FLINK-34998 > Project: Flink > Issue Type: Bug > Components: Build System / CI >Affects Versions: 1.20.0 >Reporter: Weijie Guo >Priority: Major > > /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/test_docker_embedded_job.sh: > line 65: docker-compose: command not found > /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/test_docker_embedded_job.sh: > line 66: docker-compose: command not found > /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/test_docker_embedded_job.sh: > line 67: docker-compose: command not found > sort: cannot read: > '/home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-24250435151/out/docker_wc_out*': > No such file or directory > Apr 03 02:08:14 FAIL WordCount: Output hash mismatch. Got > d41d8cd98f00b204e9800998ecf8427e, expected 0e5bd0a3dd7d5a7110aa85ff70adb54b. > Apr 03 02:08:14 head hexdump of actual: > head: cannot open > '/home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-24250435151/out/docker_wc_out*' > for reading: No such file or directory > Apr 03 02:08:14 Stopping job timeout watchdog (with pid=244913) > Apr 03 02:08:14 [FAIL] Test script contains errors. > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58709=logs=e9d3d34f-3d15-59f4-0e3e-35067d100dfe=5d91035e-8022-55f2-2d4f-ab121508bf7e=6043 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-34997) PyFlink YARN per-job on Docker test failed on azure
[ https://issues.apache.org/jira/browse/FLINK-34997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthias Pohl updated FLINK-34997: -- Description: {code} Apr 03 03:12:37 == Apr 03 03:12:37 Running 'PyFlink YARN per-job on Docker test' Apr 03 03:12:37 == Apr 03 03:12:37 TEST_DATA_DIR: /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-37046085202 Apr 03 03:12:37 Flink dist directory: /home/vsts/work/1/s/flink-dist/target/flink-1.19-SNAPSHOT-bin/flink-1.19-SNAPSHOT Apr 03 03:12:38 Flink dist directory: /home/vsts/work/1/s/flink-dist/target/flink-1.19-SNAPSHOT-bin/flink-1.19-SNAPSHOT Apr 03 03:12:38 Docker version 24.0.9, build 2936816 /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/common_docker.sh: line 24: docker-compose: command not found Apr 03 03:12:38 [FAIL] Test script contains errors. Apr 03 03:12:38 Checking of logs skipped. Apr 03 03:12:38 Apr 03 03:12:38 [FAIL] 'PyFlink YARN per-job on Docker test' failed after 0 minutes and 1 seconds! Test exited with exit code 1 {code} https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58709=logs=f8e16326-dc75-5ba0-3e95-6178dd55bf6c=94ccd692-49fc-5c64-8775-d427c6e65440=10226 was: Apr 03 03:12:37 == Apr 03 03:12:37 Running 'PyFlink YARN per-job on Docker test' Apr 03 03:12:37 == Apr 03 03:12:37 TEST_DATA_DIR: /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-37046085202 Apr 03 03:12:37 Flink dist directory: /home/vsts/work/1/s/flink-dist/target/flink-1.19-SNAPSHOT-bin/flink-1.19-SNAPSHOT Apr 03 03:12:38 Flink dist directory: /home/vsts/work/1/s/flink-dist/target/flink-1.19-SNAPSHOT-bin/flink-1.19-SNAPSHOT Apr 03 03:12:38 Docker version 24.0.9, build 2936816 /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/common_docker.sh: line 24: docker-compose: command not found Apr 03 03:12:38 [FAIL] Test script contains errors. Apr 03 03:12:38 Checking of logs skipped. Apr 03 03:12:38 Apr 03 03:12:38 [FAIL] 'PyFlink YARN per-job on Docker test' failed after 0 minutes and 1 seconds! Test exited with exit code 1 https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58709=logs=f8e16326-dc75-5ba0-3e95-6178dd55bf6c=94ccd692-49fc-5c64-8775-d427c6e65440=10226 > PyFlink YARN per-job on Docker test failed on azure > --- > > Key: FLINK-34997 > URL: https://issues.apache.org/jira/browse/FLINK-34997 > Project: Flink > Issue Type: Bug > Components: Build System / CI >Affects Versions: 1.20.0 >Reporter: Weijie Guo >Priority: Major > > {code} > Apr 03 03:12:37 > == > Apr 03 03:12:37 Running 'PyFlink YARN per-job on Docker test' > Apr 03 03:12:37 > == > Apr 03 03:12:37 TEST_DATA_DIR: > /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-37046085202 > Apr 03 03:12:37 Flink dist directory: > /home/vsts/work/1/s/flink-dist/target/flink-1.19-SNAPSHOT-bin/flink-1.19-SNAPSHOT > Apr 03 03:12:38 Flink dist directory: > /home/vsts/work/1/s/flink-dist/target/flink-1.19-SNAPSHOT-bin/flink-1.19-SNAPSHOT > Apr 03 03:12:38 Docker version 24.0.9, build 2936816 > /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/common_docker.sh: > line 24: docker-compose: command not found > Apr 03 03:12:38 [FAIL] Test script contains errors. > Apr 03 03:12:38 Checking of logs skipped. > Apr 03 03:12:38 > Apr 03 03:12:38 [FAIL] 'PyFlink YARN per-job on Docker test' failed after 0 > minutes and 1 seconds! Test exited with exit code 1 > {code} > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58709=logs=f8e16326-dc75-5ba0-3e95-6178dd55bf6c=94ccd692-49fc-5c64-8775-d427c6e65440=10226 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-34997) PyFlink YARN per-job on Docker test failed on azure
[ https://issues.apache.org/jira/browse/FLINK-34997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthias Pohl updated FLINK-34997: -- Priority: Blocker (was: Major) > PyFlink YARN per-job on Docker test failed on azure > --- > > Key: FLINK-34997 > URL: https://issues.apache.org/jira/browse/FLINK-34997 > Project: Flink > Issue Type: Bug > Components: Build System / CI >Affects Versions: 1.20.0 >Reporter: Weijie Guo >Priority: Blocker > Labels: test-stability > > {code} > Apr 03 03:12:37 > == > Apr 03 03:12:37 Running 'PyFlink YARN per-job on Docker test' > Apr 03 03:12:37 > == > Apr 03 03:12:37 TEST_DATA_DIR: > /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-37046085202 > Apr 03 03:12:37 Flink dist directory: > /home/vsts/work/1/s/flink-dist/target/flink-1.19-SNAPSHOT-bin/flink-1.19-SNAPSHOT > Apr 03 03:12:38 Flink dist directory: > /home/vsts/work/1/s/flink-dist/target/flink-1.19-SNAPSHOT-bin/flink-1.19-SNAPSHOT > Apr 03 03:12:38 Docker version 24.0.9, build 2936816 > /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/common_docker.sh: > line 24: docker-compose: command not found > Apr 03 03:12:38 [FAIL] Test script contains errors. > Apr 03 03:12:38 Checking of logs skipped. > Apr 03 03:12:38 > Apr 03 03:12:38 [FAIL] 'PyFlink YARN per-job on Docker test' failed after 0 > minutes and 1 seconds! Test exited with exit code 1 > {code} > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58709=logs=f8e16326-dc75-5ba0-3e95-6178dd55bf6c=94ccd692-49fc-5c64-8775-d427c6e65440=10226 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-34643) JobIDLoggingITCase failed
[ https://issues.apache.org/jira/browse/FLINK-34643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833414#comment-17833414 ] Matthias Pohl commented on FLINK-34643: --- I guess, reopening the issue would be fine. But for the sake of not putting too much into a single ticket, it wouldn't be wrong to create a new ticket and linking FLINK-34643 as the cause, either. I personally would go for the latter option. > JobIDLoggingITCase failed > - > > Key: FLINK-34643 > URL: https://issues.apache.org/jira/browse/FLINK-34643 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.20.0 >Reporter: Matthias Pohl >Assignee: Roman Khachatryan >Priority: Major > Labels: pull-request-available, test-stability > Fix For: 1.20.0 > > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58187=logs=8fd9202e-fd17-5b26-353c-ac1ff76c8f28=ea7cf968-e585-52cb-e0fc-f48de023a7ca=7897 > {code} > Mar 09 01:24:23 01:24:23.498 [ERROR] Tests run: 1, Failures: 0, Errors: 1, > Skipped: 0, Time elapsed: 4.209 s <<< FAILURE! -- in > org.apache.flink.test.misc.JobIDLoggingITCase > Mar 09 01:24:23 01:24:23.498 [ERROR] > org.apache.flink.test.misc.JobIDLoggingITCase.testJobIDLogging(ClusterClient) > -- Time elapsed: 1.459 s <<< ERROR! > Mar 09 01:24:23 java.lang.IllegalStateException: Too few log events recorded > for org.apache.flink.runtime.jobmaster.JobMaster (12) - this must be a bug in > the test code > Mar 09 01:24:23 at > org.apache.flink.util.Preconditions.checkState(Preconditions.java:215) > Mar 09 01:24:23 at > org.apache.flink.test.misc.JobIDLoggingITCase.assertJobIDPresent(JobIDLoggingITCase.java:148) > Mar 09 01:24:23 at > org.apache.flink.test.misc.JobIDLoggingITCase.testJobIDLogging(JobIDLoggingITCase.java:132) > Mar 09 01:24:23 at java.lang.reflect.Method.invoke(Method.java:498) > Mar 09 01:24:23 at > java.util.concurrent.RecursiveAction.exec(RecursiveAction.java:189) > Mar 09 01:24:23 at > java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) > Mar 09 01:24:23 at > java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056) > Mar 09 01:24:23 at > java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692) > Mar 09 01:24:23 at > java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175) > Mar 09 01:24:23 > {code} > The other test failures of this build were also caused by the same test: > * > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58187=logs=2c3cbe13-dee0-5837-cf47-3053da9a8a78=b78d9d30-509a-5cea-1fef-db7abaa325ae=8349 > * > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58187=logs=a596f69e-60d2-5a4b-7d39-dc69e4cdaed3=712ade8c-ca16-5b76-3acd-14df33bc1cb1=8209 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (FLINK-34989) Apache Infra requests to reduce the runner usage for a project
[ https://issues.apache.org/jira/browse/FLINK-34989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833154#comment-17833154 ] Matthias Pohl edited comment on FLINK-34989 at 4/2/24 12:18 PM: This Jira issue is about adding job concurrency support. Ideally, we should make it configurable in an easy way and set it to a concurrency level >20 as requested by Apache Infra. This affects the nightly builds which run per branch with 5 different test profiles and each test profile having 11 runners (10 stages + a short-running license check) being occupied in parallel. Generally, we should make CI be more selective anyway. Apache Infra constantly criticizes projects for running heavy-load CI on changes like simple doc changes (see [here|https://infra.apache.org/github-actions-secrets.html]). was (Author: mapohl): This Jira issue is about adding job concurrency support. Ideally, we should make it configurable in an easy way and set it to a concurrency level >20 as requested by Apache Infra. This affects the nightly builds which run per branch with 5 different test profiles and each test profile having 11 runners (10 stages + a short-running license check) being occupied in parallel. Generally, we should make CI be more selective anyway. Apache Infra constantly criticizes projects to run heavy-load CI for things like simple doc changes. > Apache Infra requests to reduce the runner usage for a project > -- > > Key: FLINK-34989 > URL: https://issues.apache.org/jira/browse/FLINK-34989 > Project: Flink > Issue Type: Sub-task > Components: Build System / CI >Affects Versions: 1.19.0, 1.18.1, 1.20.0 >Reporter: Matthias Pohl >Priority: Major > Labels: pull-request-available > > The GitHub Actions CI utilizes runners that are hosted by Apache Infra right > now. These runners are limited. The runner usage can be monitored via the > following links: > * [Flink-specific > report|https://infra-reports.apache.org/#ghactions=flink=168] > (needs ASF committer rights) This project-specific report can only be > modified through the HTTP GET parameters of the URL. > * [Global report|https://infra-reports.apache.org/#ghactions] (needs ASF > membership) > There was a policy change announced recently: > {quote} > Policy change on use of GitHub Actions > Due to misconfigurations in their builds, some projects have been using > unsupportable numbers of GitHub Actions. As part of fixing this situation, > Infra has added a 'resource use' section to the policy on GitHub Actions. > This section of the policy will come into effect on April 20, 2024: > All workflows MUST have a job concurrency level less than or equal to 20. > This means a workflow cannot have more than 20 jobs running at the same time > across all matrices. > All workflows SHOULD have a job concurrency level less than or equal to 15. > Just because 20 is the max, doesn't mean you should strive for 20. > The average number of minutes a project uses per calendar week MUST NOT > exceed the equivalent of 25 full-time runners (250,000 minutes, or 4,200 > hours). > The average number of minutes a project uses in any consecutive five-day > period MUST NOT exceed the equivalent of 30 full-time runners (216,000 > minutes, or 3,600 hours). > Projects whose builds consistently cross the maximum use limits will lose > their access to GitHub Actions until they fix their build configurations. > The full policy is at > https://infra.apache.org/github-actions-policy.html. > {quote} > Currently (last week of March 2024) Flink was ranked at #19 of projects that > used the Apache Infra runner resources the most which doesn't seem too bad. > This contained not only Apache Flink but also the Kubernetes operator, > connectors and other resources. According to [this > source|https://infra.apache.org/github-actions-secrets.html] Apache Infra > manages 180 runners right now. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-34989) Apache Infra requests to reduce the runner usage for a project
[ https://issues.apache.org/jira/browse/FLINK-34989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthias Pohl updated FLINK-34989: -- Description: The GitHub Actions CI utilizes runners that are hosted by Apache Infra right now. These runners are limited. The runner usage can be monitored via the following links: * [Flink-specific report|https://infra-reports.apache.org/#ghactions=flink=168] (needs ASF committer rights) This project-specific report can only be modified through the HTTP GET parameters of the URL. * [Global report|https://infra-reports.apache.org/#ghactions] (needs ASF membership) There was a policy change announced recently: {quote} Policy change on use of GitHub Actions Due to misconfigurations in their builds, some projects have been using unsupportable numbers of GitHub Actions. As part of fixing this situation, Infra has added a 'resource use' section to the policy on GitHub Actions. This section of the policy will come into effect on April 20, 2024: All workflows MUST have a job concurrency level less than or equal to 20. This means a workflow cannot have more than 20 jobs running at the same time across all matrices. All workflows SHOULD have a job concurrency level less than or equal to 15. Just because 20 is the max, doesn't mean you should strive for 20. The average number of minutes a project uses per calendar week MUST NOT exceed the equivalent of 25 full-time runners (250,000 minutes, or 4,200 hours). The average number of minutes a project uses in any consecutive five-day period MUST NOT exceed the equivalent of 30 full-time runners (216,000 minutes, or 3,600 hours). Projects whose builds consistently cross the maximum use limits will lose their access to GitHub Actions until they fix their build configurations. The full policy is at https://infra.apache.org/github-actions-policy.html. {quote} Currently (last week of March 2024) Flink was ranked at #19 of projects that used the Apache Infra runner resources the most which doesn't seem too bad. This contained not only Apache Flink but also the Kubernetes operator, connectors and other resources. According to [this source|https://infra.apache.org/github-actions-secrets.html] Apache Infra manages 180 runners right now. was: The GitHub Actions CI utilizes runners that are hosted by Apache Infra right now. These runners are limited. The runner usage can be monitored via the following links: * [Flink-specific report|https://infra-reports.apache.org/#ghactions=flink=168] (needs ASF committer rights) This project-specific report can only be modified through the HTTP GET parameters of the URL. * [Global report|https://infra-reports.apache.org/#ghactions] (needs ASF membership) There was a policy change announced recently: {quote} Policy change on use of GitHub Actions Due to misconfigurations in their builds, some projects have been using unsupportable numbers of GitHub Actions. As part of fixing this situation, Infra has added a 'resource use' section to the policy on GitHub Actions. This section of the policy will come into effect on April 20, 2024: All workflows MUST have a job concurrency level less than or equal to 20. This means a workflow cannot have more than 20 jobs running at the same time across all matrices. All workflows SHOULD have a job concurrency level less than or equal to 15. Just because 20 is the max, doesn't mean you should strive for 20. The average number of minutes a project uses per calendar week MUST NOT exceed the equivalent of 25 full-time runners (250,000 minutes, or 4,200 hours). The average number of minutes a project uses in any consecutive five-day period MUST NOT exceed the equivalent of 30 full-time runners (216,000 minutes, or 3,600 hours). Projects whose builds consistently cross the maximum use limits will lose their access to GitHub Actions until they fix their build configurations. The full policy is at https://infra.apache.org/github-actions-policy.html. {quote} Currently (last week of March 2024) Flink was ranked at #19 of projects that used the Apache Infra runner resources the most which doesn't seem too bad. This contained not only Apache Flink but also the Kubernetes operator, connectors and other resources. > Apache Infra requests to reduce the runner usage for a project > -- > > Key: FLINK-34989 > URL: https://issues.apache.org/jira/browse/FLINK-34989 > Project: Flink > Issue Type: Sub-task > Components: Build System / CI >Affects Versions: 1.19.0, 1.18.1, 1.20.0 >Reporter: Matthias Pohl >Priority: Major > Labels: pull-request-available > > The GitHub Actions CI utilizes runners that are hosted by Apache Infra right > now. These runners are limited. The runner usage can be monitored via the >
[jira] [Commented] (FLINK-34989) Apache Infra requests to reduce the runner usage for a project
[ https://issues.apache.org/jira/browse/FLINK-34989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833155#comment-17833155 ] Matthias Pohl commented on FLINK-34989: --- For this issue, we should keep in mind that it is only affecting the non-ephemeral runners. FLINK-34331 works on enabling ephemeral runners for Apache Flink. Ephemeral runners would allow us to donate project specific runners, i.e. someone could donate hardware to allow Flink to have its own runners and not to worry to much about blocking other projects with CI. > Apache Infra requests to reduce the runner usage for a project > -- > > Key: FLINK-34989 > URL: https://issues.apache.org/jira/browse/FLINK-34989 > Project: Flink > Issue Type: Sub-task > Components: Build System / CI >Affects Versions: 1.19.0, 1.18.1, 1.20.0 >Reporter: Matthias Pohl >Priority: Major > > The GitHub Actions CI utilizes runners that are hosted by Apache Infra right > now. These runners are limited. The runner usage can be monitored via the > following links: > * [Flink-specific > report|https://infra-reports.apache.org/#ghactions=flink=168] > (needs ASF committer rights) This project-specific report can only be > modified through the HTTP GET parameters of the URL. > * [Global report|https://infra-reports.apache.org/#ghactions] (needs ASF > membership) > There was a policy change announced recently: > {quote} > Policy change on use of GitHub Actions > Due to misconfigurations in their builds, some projects have been using > unsupportable numbers of GitHub Actions. As part of fixing this situation, > Infra has added a 'resource use' section to the policy on GitHub Actions. > This section of the policy will come into effect on April 20, 2024: > All workflows MUST have a job concurrency level less than or equal to 20. > This means a workflow cannot have more than 20 jobs running at the same time > across all matrices. > All workflows SHOULD have a job concurrency level less than or equal to 15. > Just because 20 is the max, doesn't mean you should strive for 20. > The average number of minutes a project uses per calendar week MUST NOT > exceed the equivalent of 25 full-time runners (250,000 minutes, or 4,200 > hours). > The average number of minutes a project uses in any consecutive five-day > period MUST NOT exceed the equivalent of 30 full-time runners (216,000 > minutes, or 3,600 hours). > Projects whose builds consistently cross the maximum use limits will lose > their access to GitHub Actions until they fix their build configurations. > The full policy is at > https://infra.apache.org/github-actions-policy.html. > {quote} > Currently (last week of March 2024) Flink was ranked at #19 of projects that > used the Apache Infra runner resources the most which doesn't seem too bad. > This contained not only Apache Flink but also the Kubernetes operator, > connectors and other resources. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-34331) Enable Apache INFRA ephemeral runners for nightly builds
[ https://issues.apache.org/jira/browse/FLINK-34331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthias Pohl updated FLINK-34331: -- Summary: Enable Apache INFRA ephemeral runners for nightly builds (was: Enable Apache INFRA runners for nightly builds) > Enable Apache INFRA ephemeral runners for nightly builds > > > Key: FLINK-34331 > URL: https://issues.apache.org/jira/browse/FLINK-34331 > Project: Flink > Issue Type: Sub-task > Components: Build System / CI >Affects Versions: 1.19.0, 1.18.1 >Reporter: Matthias Pohl >Assignee: Matthias Pohl >Priority: Major > Labels: pull-request-available > > The nightly CI is currently still utilizing the GitHub runners. We want to > switch to Apache INFRA's ephemeral runners (see > [docs|https://cwiki.apache.org/confluence/display/INFRA/ASF+Infra+provided+self-hosted+runners]). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-34989) Apache Infra requests to reduce the runner usage for a project
[ https://issues.apache.org/jira/browse/FLINK-34989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833154#comment-17833154 ] Matthias Pohl commented on FLINK-34989: --- This Jira issue is about adding job concurrency support. Ideally, we should make it configurable in an easy way and set it to a concurrency level >20 as requested by Apache Infra. This affects the nightly builds which run per branch with 5 different test profiles and each test profile having 11 runners (10 stages + a short-running license check) being occupied in parallel. Generally, we should make CI be more selective anyway. Apache Infra constantly criticizes projects to run heavy-load CI for things like simple doc changes. > Apache Infra requests to reduce the runner usage for a project > -- > > Key: FLINK-34989 > URL: https://issues.apache.org/jira/browse/FLINK-34989 > Project: Flink > Issue Type: Sub-task > Components: Build System / CI >Affects Versions: 1.19.0, 1.18.1, 1.20.0 >Reporter: Matthias Pohl >Priority: Major > > The GitHub Actions CI utilizes runners that are hosted by Apache Infra right > now. These runners are limited. The runner usage can be monitored via the > following links: > * [Flink-specific > report|https://infra-reports.apache.org/#ghactions=flink=168] > (needs ASF committer rights) This project-specific report can only be > modified through the HTTP GET parameters of the URL. > * [Global report|https://infra-reports.apache.org/#ghactions] (needs ASF > membership) > There was a policy change announced recently: > {quote} > Policy change on use of GitHub Actions > Due to misconfigurations in their builds, some projects have been using > unsupportable numbers of GitHub Actions. As part of fixing this situation, > Infra has added a 'resource use' section to the policy on GitHub Actions. > This section of the policy will come into effect on April 20, 2024: > All workflows MUST have a job concurrency level less than or equal to 20. > This means a workflow cannot have more than 20 jobs running at the same time > across all matrices. > All workflows SHOULD have a job concurrency level less than or equal to 15. > Just because 20 is the max, doesn't mean you should strive for 20. > The average number of minutes a project uses per calendar week MUST NOT > exceed the equivalent of 25 full-time runners (250,000 minutes, or 4,200 > hours). > The average number of minutes a project uses in any consecutive five-day > period MUST NOT exceed the equivalent of 30 full-time runners (216,000 > minutes, or 3,600 hours). > Projects whose builds consistently cross the maximum use limits will lose > their access to GitHub Actions until they fix their build configurations. > The full policy is at > https://infra.apache.org/github-actions-policy.html. > {quote} > Currently (last week of March 2024) Flink was ranked at #19 of projects that > used the Apache Infra runner resources the most which doesn't seem too bad. > This contained not only Apache Flink but also the Kubernetes operator, > connectors and other resources. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-34989) Apache Infra requests to reduce the runner usage for a project
[ https://issues.apache.org/jira/browse/FLINK-34989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833153#comment-17833153 ] Matthias Pohl commented on FLINK-34989: --- Here's a summary of the requirements and whether we meet them based on the most-recent report: || Requirement || Flink CI || | Job concurrency level 20 (or better 15) or below | (n) | | Do not exceed 25 full-time runners (FT runner), i.e. 4200hours per 7 days | (y) | | Avg number of minutes should not exceed 3600 hours per 5 days | (y) | > Apache Infra requests to reduce the runner usage for a project > -- > > Key: FLINK-34989 > URL: https://issues.apache.org/jira/browse/FLINK-34989 > Project: Flink > Issue Type: Sub-task > Components: Build System / CI >Affects Versions: 1.19.0, 1.18.1, 1.20.0 >Reporter: Matthias Pohl >Priority: Major > > The GitHub Actions CI utilizes runners that are hosted by Apache Infra right > now. These runners are limited. The runner usage can be monitored via the > following links: > * [Flink-specific > report|https://infra-reports.apache.org/#ghactions=flink=168] > (needs ASF committer rights) This project-specific report can only be > modified through the HTTP GET parameters of the URL. > * [Global report|https://infra-reports.apache.org/#ghactions] (needs ASF > membership) > There was a policy change announced recently: > {quote} > Policy change on use of GitHub Actions > Due to misconfigurations in their builds, some projects have been using > unsupportable numbers of GitHub Actions. As part of fixing this situation, > Infra has added a 'resource use' section to the policy on GitHub Actions. > This section of the policy will come into effect on April 20, 2024: > All workflows MUST have a job concurrency level less than or equal to 20. > This means a workflow cannot have more than 20 jobs running at the same time > across all matrices. > All workflows SHOULD have a job concurrency level less than or equal to 15. > Just because 20 is the max, doesn't mean you should strive for 20. > The average number of minutes a project uses per calendar week MUST NOT > exceed the equivalent of 25 full-time runners (250,000 minutes, or 4,200 > hours). > The average number of minutes a project uses in any consecutive five-day > period MUST NOT exceed the equivalent of 30 full-time runners (216,000 > minutes, or 3,600 hours). > Projects whose builds consistently cross the maximum use limits will lose > their access to GitHub Actions until they fix their build configurations. > The full policy is at > https://infra.apache.org/github-actions-policy.html. > {quote} > Currently (last week of March 2024) Flink was ranked at #19 of projects that > used the Apache Infra runner resources the most which doesn't seem too bad. > This contained not only Apache Flink but also the Kubernetes operator, > connectors and other resources. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-34989) Apache Infra requests to reduce the runner usage for a project
[ https://issues.apache.org/jira/browse/FLINK-34989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthias Pohl updated FLINK-34989: -- Description: The GitHub Actions CI utilizes runners that are hosted by Apache Infra right now. These runners are limited. The runner usage can be monitored via the following links: * [Flink-specific report|https://infra-reports.apache.org/#ghactions=flink=168] (needs ASF committer rights) This project-specific report can only be modified through the HTTP GET parameters of the URL. * [Global report|https://infra-reports.apache.org/#ghactions] (needs ASF membership) There was a policy change announced recently: {quote} Policy change on use of GitHub Actions Due to misconfigurations in their builds, some projects have been using unsupportable numbers of GitHub Actions. As part of fixing this situation, Infra has added a 'resource use' section to the policy on GitHub Actions. This section of the policy will come into effect on April 20, 2024: All workflows MUST have a job concurrency level less than or equal to 20. This means a workflow cannot have more than 20 jobs running at the same time across all matrices. All workflows SHOULD have a job concurrency level less than or equal to 15. Just because 20 is the max, doesn't mean you should strive for 20. The average number of minutes a project uses per calendar week MUST NOT exceed the equivalent of 25 full-time runners (250,000 minutes, or 4,200 hours). The average number of minutes a project uses in any consecutive five-day period MUST NOT exceed the equivalent of 30 full-time runners (216,000 minutes, or 3,600 hours). Projects whose builds consistently cross the maximum use limits will lose their access to GitHub Actions until they fix their build configurations. The full policy is at https://infra.apache.org/github-actions-policy.html. {quote} Currently (last week of March 2024) Flink was ranked at #19 of projects that used the Apache Infra runner resources the most which doesn't seem too bad. This contained not only Apache Flink but also the Kubernetes operator, connectors and other resources. was: The GitHub Actions CI utilizes runners that are hosted by Apache Infra right now. These runners are limited. The runner usage can be monitored via the following links: * [Flink-specific report|https://infra-reports.apache.org/#ghactions=flink=168] (needs ASF committer rights) This project-specific report can only be modified through the HTTP GET parameters of the URL. * [Global report|https://infra-reports.apache.org/#ghactions] (needs ASF membership) There was a policy change announced recently: {quote} Policy change on use of GitHub Actions Due to misconfigurations in their builds, some projects have been using unsupportable numbers of GitHub Actions. As part of fixing this situation, Infra has added a 'resource use' section to the policy on GitHub Actions. This section of the policy will come into effect on April 20, 2024: All workflows MUST have a job concurrency level less than or equal to 20. This means a workflow cannot have more than 20 jobs running at the same time across all matrices. All workflows SHOULD have a job concurrency level less than or equal to 15. Just because 20 is the max, doesn't mean you should strive for 20. The average number of minutes a project uses per calendar week MUST NOT exceed the equivalent of 25 full-time runners (250,000 minutes, or 4,200 hours). The average number of minutes a project uses in any consecutive five-day period MUST NOT exceed the equivalent of 30 full-time runners (216,000 minutes, or 3,600 hours). Projects whose builds consistently cross the maximum use limits will lose their access to GitHub Actions until they fix their build configurations. The full policy is at https://infra.apache.org/github-actions-policy.html. {quote} > Apache Infra requests to reduce the runner usage for a project > -- > > Key: FLINK-34989 > URL: https://issues.apache.org/jira/browse/FLINK-34989 > Project: Flink > Issue Type: Sub-task > Components: Build System / CI >Affects Versions: 1.19.0, 1.18.1, 1.20.0 >Reporter: Matthias Pohl >Priority: Major > > The GitHub Actions CI utilizes runners that are hosted by Apache Infra right > now. These runners are limited. The runner usage can be monitored via the > following links: > * [Flink-specific > report|https://infra-reports.apache.org/#ghactions=flink=168] > (needs ASF committer rights) This project-specific report can only be > modified through the HTTP GET parameters of the URL. > * [Global report|https://infra-reports.apache.org/#ghactions] (needs ASF > membership) > There was a policy change announced recently: > {quote} > Policy change on use of GitHub Actions > Due to
[jira] [Updated] (FLINK-34989) Apache Infra requests to reduce the runner usage for a project
[ https://issues.apache.org/jira/browse/FLINK-34989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthias Pohl updated FLINK-34989: -- Description: The GitHub Actions CI utilizes runners that are hosted by Apache Infra right now. These runners are limited. The runner usage can be monitored via the following links: * [Flink-specific report|https://infra-reports.apache.org/#ghactions=flink=168] (needs ASF committer rights) This project-specific report can only be modified through the HTTP GET parameters of the URL. * [Global report|https://infra-reports.apache.org/#ghactions] (needs ASF membership) There was a policy change announced recently: {quote} Policy change on use of GitHub Actions Due to misconfigurations in their builds, some projects have been using unsupportable numbers of GitHub Actions. As part of fixing this situation, Infra has added a 'resource use' section to the policy on GitHub Actions. This section of the policy will come into effect on April 20, 2024: All workflows MUST have a job concurrency level less than or equal to 20. This means a workflow cannot have more than 20 jobs running at the same time across all matrices. All workflows SHOULD have a job concurrency level less than or equal to 15. Just because 20 is the max, doesn't mean you should strive for 20. The average number of minutes a project uses per calendar week MUST NOT exceed the equivalent of 25 full-time runners (250,000 minutes, or 4,200 hours). The average number of minutes a project uses in any consecutive five-day period MUST NOT exceed the equivalent of 30 full-time runners (216,000 minutes, or 3,600 hours). Projects whose builds consistently cross the maximum use limits will lose their access to GitHub Actions until they fix their build configurations. The full policy is at https://infra.apache.org/github-actions-policy.html. {quote} was: The GitHub Actions CI utilizes runners that are hosted by Apache Infra right now. These runners are limited. The runner usage can be monitored via the following links: * [Flink-specific report|https://infra-reports.apache.org/#ghactions=flink=168] (needs ASF committer rights) This project-specific report can only be modified through the HTTP GET parameters of the URL. * [Global report|https://infra-reports.apache.org/#ghactions] (needs ASF membership) > Apache Infra requests to reduce the runner usage for a project > -- > > Key: FLINK-34989 > URL: https://issues.apache.org/jira/browse/FLINK-34989 > Project: Flink > Issue Type: Sub-task > Components: Build System / CI >Affects Versions: 1.19.0, 1.18.1, 1.20.0 >Reporter: Matthias Pohl >Priority: Major > > The GitHub Actions CI utilizes runners that are hosted by Apache Infra right > now. These runners are limited. The runner usage can be monitored via the > following links: > * [Flink-specific > report|https://infra-reports.apache.org/#ghactions=flink=168] > (needs ASF committer rights) This project-specific report can only be > modified through the HTTP GET parameters of the URL. > * [Global report|https://infra-reports.apache.org/#ghactions] (needs ASF > membership) > There was a policy change announced recently: > {quote} > Policy change on use of GitHub Actions > Due to misconfigurations in their builds, some projects have been using > unsupportable numbers of GitHub Actions. As part of fixing this situation, > Infra has added a 'resource use' section to the policy on GitHub Actions. > This section of the policy will come into effect on April 20, 2024: > All workflows MUST have a job concurrency level less than or equal to 20. > This means a workflow cannot have more than 20 jobs running at the same time > across all matrices. > All workflows SHOULD have a job concurrency level less than or equal to 15. > Just because 20 is the max, doesn't mean you should strive for 20. > The average number of minutes a project uses per calendar week MUST NOT > exceed the equivalent of 25 full-time runners (250,000 minutes, or 4,200 > hours). > The average number of minutes a project uses in any consecutive five-day > period MUST NOT exceed the equivalent of 30 full-time runners (216,000 > minutes, or 3,600 hours). > Projects whose builds consistently cross the maximum use limits will lose > their access to GitHub Actions until they fix their build configurations. > The full policy is at > https://infra.apache.org/github-actions-policy.html. > {quote} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-34937) Apache Infra GHA policy update
[ https://issues.apache.org/jira/browse/FLINK-34937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833149#comment-17833149 ] Matthias Pohl commented on FLINK-34937: --- I moved the runner usage discussion into FLINK-34989 > Apache Infra GHA policy update > -- > > Key: FLINK-34937 > URL: https://issues.apache.org/jira/browse/FLINK-34937 > Project: Flink > Issue Type: Sub-task > Components: Build System / CI >Affects Versions: 1.19.0, 1.18.1, 1.20.0 >Reporter: Matthias Pohl >Priority: Major > > There is a policy update [announced in the infra > ML|https://www.mail-archive.com/jdo-dev@db.apache.org/msg13638.html] which > asked Apache projects to limit the number of runners per job. Additionally, > the [GHA policy|https://infra.apache.org/github-actions-policy.html] is > referenced which I wasn't aware of when working on the action workflow. > This issue is about applying the policy to the Flink GHA workflows. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-34989) Apache Infra requests to reduce the runner usage for a project
Matthias Pohl created FLINK-34989: - Summary: Apache Infra requests to reduce the runner usage for a project Key: FLINK-34989 URL: https://issues.apache.org/jira/browse/FLINK-34989 Project: Flink Issue Type: Sub-task Components: Build System / CI Affects Versions: 1.18.1, 1.19.0, 1.20.0 Reporter: Matthias Pohl The GitHub Actions CI utilizes runners that are hosted by Apache Infra right now. These runners are limited. The runner usage can be monitored via the following links: * [Flink-specific report|https://infra-reports.apache.org/#ghactions=flink=168] (needs ASF committer rights) This project-specific report can only be modified through the HTTP GET parameters of the URL. * [Global report|https://infra-reports.apache.org/#ghactions] (needs ASF membership) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-34427) FineGrainedSlotManagerTest fails fatally (exit code 239)
[ https://issues.apache.org/jira/browse/FLINK-34427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833098#comment-17833098 ] Matthias Pohl commented on FLINK-34427: --- Copied over from FLINK-33416: * https://github.com/XComp/flink/actions/runs/6472726326/job/17575765131 * 1.19: https://github.com/apache/flink/actions/runs/8467681781/job/23199435037#step:10:8909 > FineGrainedSlotManagerTest fails fatally (exit code 239) > > > Key: FLINK-34427 > URL: https://issues.apache.org/jira/browse/FLINK-34427 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.19.0, 1.18.1, 1.20.0 >Reporter: Matthias Pohl >Assignee: Matthias Pohl >Priority: Critical > Labels: pull-request-available, test-stability > > https://github.com/apache/flink/actions/runs/7866453350/job/21460921911#step:10:8959 > {code} > Error: 02:28:53 02:28:53.220 [ERROR] Process Exit Code: 239 > Error: 02:28:53 02:28:53.220 [ERROR] Crashed tests: > Error: 02:28:53 02:28:53.220 [ERROR] > org.apache.flink.runtime.resourcemanager.ResourceManagerTaskExecutorTest > Error: 02:28:53 02:28:53.220 [ERROR] > org.apache.maven.surefire.booter.SurefireBooterForkException: > ExecutionException The forked VM terminated without properly saying goodbye. > VM crash or System.exit called? > Error: 02:28:53 02:28:53.220 [ERROR] Command was /bin/sh -c cd > '/root/flink/flink-runtime' && > '/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java' '-XX:+UseG1GC' '-Xms256m' > '-XX:+IgnoreUnrecognizedVMOptions' > '--add-opens=java.base/java.util=ALL-UNNAMED' > '--add-opens=java.base/java.lang=ALL-UNNAMED' > '--add-opens=java.base/java.net=ALL-UNNAMED' > '--add-opens=java.base/java.io=ALL-UNNAMED' > '--add-opens=java.base/java.util.concurrent=ALL-UNNAMED' '-Xmx768m' '-jar' > '/root/flink/flink-runtime/target/surefire/surefirebooter-20240212022332296_94.jar' > '/root/flink/flink-runtime/target/surefire' > '2024-02-12T02-21-39_495-jvmRun3' 'surefire-20240212022332296_88tmp' > 'surefire_26-20240212022332296_91tmp' > Error: 02:28:53 02:28:53.220 [ERROR] Error occurred in starting fork, check > output in log > Error: 02:28:53 02:28:53.220 [ERROR] Process Exit Code: 239 > Error: 02:28:53 02:28:53.220 [ERROR] Crashed tests: > Error: 02:28:53 02:28:53.221 [ERROR] > org.apache.flink.runtime.resourcemanager.ResourceManagerTaskExecutorTest > Error: 02:28:53 02:28:53.221 [ERROR] at > org.apache.maven.plugin.surefire.booterclient.ForkStarter.awaitResultsDone(ForkStarter.java:456) > [...] > {code} > The fatal error is triggered most likely within the > {{FineGrainedSlotManagerTest}}: > {code} > 02:26:39,362 [ pool-643-thread-1] ERROR > org.apache.flink.util.FatalExitExceptionHandler [] - FATAL: > Thread 'pool-643-thread-1' produced an uncaught exception. Stopping the > process... > java.util.concurrent.CompletionException: > java.util.concurrent.RejectedExecutionException: Task > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@4bbc0b10 > rejected from > java.util.concurrent.ScheduledThreadPoolExecutor@7a45cd9a[Shutting down, pool > size = 1, active threads = 1, queued tasks = 1, completed tasks = 194] > at > java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273) > ~[?:1.8.0_392] > at > java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280) > ~[?:1.8.0_392] > at > java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:838) > ~[?:1.8.0_392] > at > java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811) > ~[?:1.8.0_392] > at > java.util.concurrent.CompletableFuture.uniHandleStage(CompletableFuture.java:851) > ~[?:1.8.0_392] > at > java.util.concurrent.CompletableFuture.handleAsync(CompletableFuture.java:2178) > ~[?:1.8.0_392] > at > org.apache.flink.runtime.resourcemanager.slotmanager.DefaultSlotStatusSyncer.allocateSlot(DefaultSlotStatusSyncer.java:138) > ~[classes/:?] > at > org.apache.flink.runtime.resourcemanager.slotmanager.FineGrainedSlotManager.allocateSlotsAccordingTo(FineGrainedSlotManager.java:722) > ~[classes/:?] > at > org.apache.flink.runtime.resourcemanager.slotmanager.FineGrainedSlotManager.checkResourceRequirements(FineGrainedSlotManager.java:645) > ~[classes/:?] > at > org.apache.flink.runtime.resourcemanager.slotmanager.FineGrainedSlotManager.lambda$null$12(FineGrainedSlotManager.java:603) > ~[classes/:?] > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > [?:1.8.0_392] > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > [?:1.8.0_392] > at >
[jira] [Closed] (FLINK-33416) FineGrainedSlotManagerTest failed with fatal error
[ https://issues.apache.org/jira/browse/FLINK-33416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthias Pohl closed FLINK-33416. - Resolution: Duplicate This issue is addressed in FLINK-34427. I'm closing FLINK-33416 in favor of FLINK-34427 because the investigation happened there. > FineGrainedSlotManagerTest failed with fatal error > -- > > Key: FLINK-33416 > URL: https://issues.apache.org/jira/browse/FLINK-33416 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Coordination >Reporter: Matthias Pohl >Priority: Major > Labels: github-actions, test-stability > > In FLINK-33245, we reported an error of the > {{ZooKeeperLeaderElectionConnectionHandlingTest}} failure due to a fatal > error. The corresponding build is [this > one|https://github.com/XComp/flink/actions/runs/6472726326/job/17575765131]. > But the stacktrace indicates that it's actually > {{FineGrainedSlotManagerTest}} which ran before the ZK-related test: > {code} > Test > org.apache.flink.runtime.resourcemanager.slotmanager.FineGrainedSlotManagerTest.testSlotAllocationAccordingToStrategyResult[testSlotAllocationAccordingToStrategyResult()] > successfully run. > > 19:30:11,463 [ pool-752-thread-1] ERROR > org.apache.flink.util.FatalExitExceptionHandler [] - FATAL: > Thread 'pool-752-thread-1' produced an uncaught exception. Stopping the > process... > java.util.concurrent.CompletionException: > java.util.concurrent.RejectedExecutionException: Task > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@1201ef67[Not > completed, task = > java.util.concurrent.Executors$RunnableAdapter@1ea6ccfa[Wrapped task = > java.util.concurrent.CompletableFuture$UniHandle@36f84d94]] rejected from > java.util.concurrent.ScheduledThreadPoolExecutor@4642c78d[Shutting down, pool > size = 1, active threads = 1, queued tasks = 1, completed tasks = 194] > at > java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:314) > ~[?:?] > at > java.util.concurrent.CompletableFuture.uniHandleStage(CompletableFuture.java:951) > ~[?:?] > at > java.util.concurrent.CompletableFuture.handleAsync(CompletableFuture.java:2276) > ~[?:?] > at > org.apache.flink.runtime.resourcemanager.slotmanager.DefaultSlotStatusSyncer.allocateSlot(DefaultSlotStatusSyncer.java:138) > ~[classes/:?] > at > org.apache.flink.runtime.resourcemanager.slotmanager.FineGrainedSlotManager.allocateSlotsAccordingTo(FineGrainedSlotManager.java:722) > ~[classes/:?] > at > org.apache.flink.runtime.resourcemanager.slotmanager.FineGrainedSlotManager.checkResourceRequirements(FineGrainedSlotManager.java:645) > ~[classes/:?] > at > org.apache.flink.runtime.resourcemanager.slotmanager.FineGrainedSlotManager.lambda$checkResourceRequirementsWithDelay$12(FineGrainedSlotManager.java:603) > ~[classes/:?] > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?] > at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?] > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) > [?:?] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > [?:?] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > [?:?] > at java.lang.Thread.run(Thread.java:829) [?:?] > Caused by: java.util.concurrent.RejectedExecutionException: Task > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@1201ef67[Not > completed, task = > java.util.concurrent.Executors$RunnableAdapter@1ea6ccfa[Wrapped task = > java.util.concurrent.CompletableFuture$UniHandle@36f84d94]] rejected from > java.util.concurrent.ScheduledThreadPoolExecutor@4642c78d[Shutting down, pool > size = 1, active threads = 1, queued tasks = 1, completed tasks = 194] > at > java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055) > ~[?:?] > at > java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825) > ~[?:?] > at > java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(ScheduledThreadPoolExecutor.java:340) > ~[?:?] > at > java.util.concurrent.ScheduledThreadPoolExecutor.schedule(ScheduledThreadPoolExecutor.java:562) > ~[?:?] > at > java.util.concurrent.ScheduledThreadPoolExecutor.execute(ScheduledThreadPoolExecutor.java:705) > ~[?:?] > at > java.util.concurrent.Executors$DelegatedExecutorService.execute(Executors.java:687) > ~[?:?] > at >
[jira] [Comment Edited] (FLINK-34988) Class loading issues in JDK17 and JDK21
[ https://issues.apache.org/jira/browse/FLINK-34988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833095#comment-17833095 ] Matthias Pohl edited comment on FLINK-34988 at 4/2/24 10:07 AM: It's most likely caused by FLINK-34548 based on the git history between the most recent successful nightly run on master [20240331.1|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58645=results] (based on {{3841f062}}) and [20240402.1|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58676=results] (based on {{d271495c}}): {code} $ git log 3841f062..d271495c --oneline d271495c5be [hotfix] Fix compile error in DataStreamV2SinkTransformation 28762497bdf [FLINK-34548][API] Supports sink-v2 Sink 056660e0b69 [FLINK-34548][API] Supports FLIP-27 Source ceafa5a5705 [FLINK-34548][API] Implement datastream 4f71c5b4660 [FLINK-34548][API] Implement process function's underlying operators e1147ca7e39 [FLINK-34548][API] Introduce ExecutionEnvironment 9fa74a8a706 [FLINK-34548][API] Introduce stream interface and move KeySelector to flink-core-api cedbcce6eff [FLINK-34548][API] Introduce variants of ProcessFunction 13cfaa76b5e [FLINK-34548][API] Introduce ProcessFunction and RuntimeContext related interfaces 13790e03207 [FLINK-34548][API] Move Function interface to flink-core-api 59525e460af [FLINK-34548][API] Create flink-core-api module and let flink-core depend on it 5b2e923be0a [FLINK-34548][API] Initialize the datastream v2 related modules {code} was (Author: mapohl): It's most likely caused by FLINK-34548 based on the git history between the most recent successful nightly run on master [20240331.1|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58645=results] (based on {{3841f062}}) and [20240402.1|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58676=results] (based on {{d271495c}}): {code} $ git log 3841f062..d271495c5be34f4e4a518207ca7716f4e8907e5f --oneline d271495c5be [hotfix] Fix compile error in DataStreamV2SinkTransformation 28762497bdf [FLINK-34548][API] Supports sink-v2 Sink 056660e0b69 [FLINK-34548][API] Supports FLIP-27 Source ceafa5a5705 [FLINK-34548][API] Implement datastream 4f71c5b4660 [FLINK-34548][API] Implement process function's underlying operators e1147ca7e39 [FLINK-34548][API] Introduce ExecutionEnvironment 9fa74a8a706 [FLINK-34548][API] Introduce stream interface and move KeySelector to flink-core-api cedbcce6eff [FLINK-34548][API] Introduce variants of ProcessFunction 13cfaa76b5e [FLINK-34548][API] Introduce ProcessFunction and RuntimeContext related interfaces 13790e03207 [FLINK-34548][API] Move Function interface to flink-core-api 59525e460af [FLINK-34548][API] Create flink-core-api module and let flink-core depend on it 5b2e923be0a [FLINK-34548][API] Initialize the datastream v2 related modules {code} > Class loading issues in JDK17 and JDK21 > --- > > Key: FLINK-34988 > URL: https://issues.apache.org/jira/browse/FLINK-34988 > Project: Flink > Issue Type: Bug > Components: API / DataStream >Affects Versions: 1.20.0 >Reporter: Matthias Pohl >Priority: Major > Labels: test-stability > > * JDK 17 (core; NoClassDefFoundError caused by ExceptionInInitializeError): > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58676=logs=675bf62c-8558-587e-2555-dcad13acefb5=5878eed3-cc1e-5b12-1ed0-9e7139ce0992=12942 > * JDK 17 (misc; ExceptionInInitializeError): > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58676=logs=d871f0ce-7328-5d00-023b-e7391f5801c8=77cbea27-feb9-5cf5-53f7-3267f9f9c6b6=22548 > * JDK 21 (core; same as above): > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58676=logs=d06b80b4-9e88-5d40-12a2-18072cf60528=609ecd5a-3f6e-5d0c-2239-2096b155a4d0=12963 > * JDK 21 (misc; same as above): > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58676=logs=59a2b95a-736b-5c46-b3e0-cee6e587fd86=c301da75-e699-5c06-735f-778207c16f50=22506 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-34988) Class loading issues in JDK17 and JDK21
[ https://issues.apache.org/jira/browse/FLINK-34988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833095#comment-17833095 ] Matthias Pohl commented on FLINK-34988: --- It's most likely caused by FLINK-34548 based on the git history between the most recent successful nightly run on master [20240331.1|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58645=results] (based on {{3841f062}}) and [20240402.1|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58676=results] (based on {{d271495c}}): {code} $ git log 3841f062..d271495c5be34f4e4a518207ca7716f4e8907e5f --oneline d271495c5be [hotfix] Fix compile error in DataStreamV2SinkTransformation 28762497bdf [FLINK-34548][API] Supports sink-v2 Sink 056660e0b69 [FLINK-34548][API] Supports FLIP-27 Source ceafa5a5705 [FLINK-34548][API] Implement datastream 4f71c5b4660 [FLINK-34548][API] Implement process function's underlying operators e1147ca7e39 [FLINK-34548][API] Introduce ExecutionEnvironment 9fa74a8a706 [FLINK-34548][API] Introduce stream interface and move KeySelector to flink-core-api cedbcce6eff [FLINK-34548][API] Introduce variants of ProcessFunction 13cfaa76b5e [FLINK-34548][API] Introduce ProcessFunction and RuntimeContext related interfaces 13790e03207 [FLINK-34548][API] Move Function interface to flink-core-api 59525e460af [FLINK-34548][API] Create flink-core-api module and let flink-core depend on it 5b2e923be0a [FLINK-34548][API] Initialize the datastream v2 related modules {code} > Class loading issues in JDK17 and JDK21 > --- > > Key: FLINK-34988 > URL: https://issues.apache.org/jira/browse/FLINK-34988 > Project: Flink > Issue Type: Bug > Components: API / DataStream >Affects Versions: 1.20.0 >Reporter: Matthias Pohl >Priority: Major > Labels: test-stability > > * JDK 17 (core; NoClassDefFoundError caused by ExceptionInInitializeError): > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58676=logs=675bf62c-8558-587e-2555-dcad13acefb5=5878eed3-cc1e-5b12-1ed0-9e7139ce0992=12942 > * JDK 17 (misc; ExceptionInInitializeError): > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58676=logs=d871f0ce-7328-5d00-023b-e7391f5801c8=77cbea27-feb9-5cf5-53f7-3267f9f9c6b6=22548 > * JDK 21 (core; same as above): > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58676=logs=d06b80b4-9e88-5d40-12a2-18072cf60528=609ecd5a-3f6e-5d0c-2239-2096b155a4d0=12963 > * JDK 21 (misc; same as above): > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58676=logs=59a2b95a-736b-5c46-b3e0-cee6e587fd86=c301da75-e699-5c06-735f-778207c16f50=22506 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-34988) Class loading issues in JDK17 and JDK21
Matthias Pohl created FLINK-34988: - Summary: Class loading issues in JDK17 and JDK21 Key: FLINK-34988 URL: https://issues.apache.org/jira/browse/FLINK-34988 Project: Flink Issue Type: Bug Components: API / DataStream Affects Versions: 1.20.0 Reporter: Matthias Pohl * JDK 17 (core; NoClassDefFoundError caused by ExceptionInInitializeError): https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58676=logs=675bf62c-8558-587e-2555-dcad13acefb5=5878eed3-cc1e-5b12-1ed0-9e7139ce0992=12942 * JDK 17 (misc; ExceptionInInitializeError): https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58676=logs=d871f0ce-7328-5d00-023b-e7391f5801c8=77cbea27-feb9-5cf5-53f7-3267f9f9c6b6=22548 * JDK 21 (core; same as above): https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58676=logs=d06b80b4-9e88-5d40-12a2-18072cf60528=609ecd5a-3f6e-5d0c-2239-2096b155a4d0=12963 * JDK 21 (misc; same as above): https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58676=logs=59a2b95a-736b-5c46-b3e0-cee6e587fd86=c301da75-e699-5c06-735f-778207c16f50=22506 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-33816) SourceStreamTaskTest.testTriggeringStopWithSavepointWithDrain failed due async checkpoint triggering not being completed
[ https://issues.apache.org/jira/browse/FLINK-33816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthias Pohl updated FLINK-33816: -- Fix Version/s: 1.19.1 > SourceStreamTaskTest.testTriggeringStopWithSavepointWithDrain failed due > async checkpoint triggering not being completed > - > > Key: FLINK-33816 > URL: https://issues.apache.org/jira/browse/FLINK-33816 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Checkpointing, Runtime / Coordination >Affects Versions: 1.19.0 >Reporter: Matthias Pohl >Assignee: jiabao.sun >Priority: Major > Labels: github-actions, pull-request-available, test-stability > Fix For: 1.20.0, 1.19.1 > > Attachments: screenshot-1.png > > > [https://github.com/XComp/flink/actions/runs/7182604625/job/19559947894#step:12:9430] > {code:java} > rror: 14:39:01 14:39:01.930 [ERROR] Tests run: 16, Failures: 1, Errors: 0, > Skipped: 0, Time elapsed: 1.878 s <<< FAILURE! - in > org.apache.flink.streaming.runtime.tasks.SourceStreamTaskTest > 9426Error: 14:39:01 14:39:01.930 [ERROR] > org.apache.flink.streaming.runtime.tasks.SourceStreamTaskTest.testTriggeringStopWithSavepointWithDrain > Time elapsed: 0.034 s <<< FAILURE! > 9427Dec 12 14:39:01 org.opentest4j.AssertionFailedError: > 9428Dec 12 14:39:01 > 9429Dec 12 14:39:01 Expecting value to be true but was false > 9430Dec 12 14:39:01 at > java.base/jdk.internal.reflect.DirectConstructorHandleAccessor.newInstance(DirectConstructorHandleAccessor.java:62) > 9431Dec 12 14:39:01 at > java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:502) > 9432Dec 12 14:39:01 at > org.apache.flink.streaming.runtime.tasks.SourceStreamTaskTest.testTriggeringStopWithSavepointWithDrain(SourceStreamTaskTest.java:710) > 9433Dec 12 14:39:01 at > java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103) > 9434Dec 12 14:39:01 at > java.base/java.lang.reflect.Method.invoke(Method.java:580) > [...] {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-33816) SourceStreamTaskTest.testTriggeringStopWithSavepointWithDrain failed due async checkpoint triggering not being completed
[ https://issues.apache.org/jira/browse/FLINK-33816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833056#comment-17833056 ] Matthias Pohl commented on FLINK-33816: --- master: [5aebb04b3055fbec6a74eaf4226c4a88d3fd2d6e|https://github.com/apache/flink/commit/5aebb04b3055fbec6a74eaf4226c4a88d3fd2d6e] 1.19: [ece4faee055b3797b39e9c0b55f3e94a3db2f912|https://github.com/apache/flink/commit/ece4faee055b3797b39e9c0b55f3e94a3db2f912] > SourceStreamTaskTest.testTriggeringStopWithSavepointWithDrain failed due > async checkpoint triggering not being completed > - > > Key: FLINK-33816 > URL: https://issues.apache.org/jira/browse/FLINK-33816 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Checkpointing, Runtime / Coordination >Affects Versions: 1.19.0 >Reporter: Matthias Pohl >Assignee: jiabao.sun >Priority: Major > Labels: github-actions, pull-request-available, test-stability > Fix For: 1.20.0 > > Attachments: screenshot-1.png > > > [https://github.com/XComp/flink/actions/runs/7182604625/job/19559947894#step:12:9430] > {code:java} > rror: 14:39:01 14:39:01.930 [ERROR] Tests run: 16, Failures: 1, Errors: 0, > Skipped: 0, Time elapsed: 1.878 s <<< FAILURE! - in > org.apache.flink.streaming.runtime.tasks.SourceStreamTaskTest > 9426Error: 14:39:01 14:39:01.930 [ERROR] > org.apache.flink.streaming.runtime.tasks.SourceStreamTaskTest.testTriggeringStopWithSavepointWithDrain > Time elapsed: 0.034 s <<< FAILURE! > 9427Dec 12 14:39:01 org.opentest4j.AssertionFailedError: > 9428Dec 12 14:39:01 > 9429Dec 12 14:39:01 Expecting value to be true but was false > 9430Dec 12 14:39:01 at > java.base/jdk.internal.reflect.DirectConstructorHandleAccessor.newInstance(DirectConstructorHandleAccessor.java:62) > 9431Dec 12 14:39:01 at > java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:502) > 9432Dec 12 14:39:01 at > org.apache.flink.streaming.runtime.tasks.SourceStreamTaskTest.testTriggeringStopWithSavepointWithDrain(SourceStreamTaskTest.java:710) > 9433Dec 12 14:39:01 at > java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103) > 9434Dec 12 14:39:01 at > java.base/java.lang.reflect.Method.invoke(Method.java:580) > [...] {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-34953) Add github ci for flink-web to auto commit build files
[ https://issues.apache.org/jira/browse/FLINK-34953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833036#comment-17833036 ] Matthias Pohl commented on FLINK-34953: --- Hi [~gongzhongqiang], it sounds like we reached consensus in this matter already. But you can bring this up in the dev ML to check whether there are some objections against this approach before going ahead with this ticket to have a proper backing from the community. > Add github ci for flink-web to auto commit build files > -- > > Key: FLINK-34953 > URL: https://issues.apache.org/jira/browse/FLINK-34953 > Project: Flink > Issue Type: Improvement > Components: Project Website >Reporter: Zhongqiang Gong >Priority: Minor > Labels: website > > Currently, https://github.com/apache/flink-web commit build files by local > build. So I want use github ci to build docs and commit. > > Changes: > * Add website build check for pr > * Auto build and commit build files after pr was merged to `asf-site` > * Optinal: this ci can triggered by manual -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-34961) GitHub Actions runner statistcs can be monitored per workflow name
[ https://issues.apache.org/jira/browse/FLINK-34961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthias Pohl updated FLINK-34961: -- Labels: starter (was: ) > GitHub Actions runner statistcs can be monitored per workflow name > -- > > Key: FLINK-34961 > URL: https://issues.apache.org/jira/browse/FLINK-34961 > Project: Flink > Issue Type: Improvement > Components: Build System / CI >Reporter: Matthias Pohl >Priority: Major > Labels: starter > > Apache Infra allows the monitoring of runner usage per workflow (see [report > for > Flink|https://infra-reports.apache.org/#ghactions=flink=168=10]; > only accessible with Apache committer rights). They accumulate the data by > workflow name. The Flink space has multiple repositories that use the generic > workflow name {{CI}}). That makes the differentiation in the report harder. > This Jira issue is about identifying all Flink-related projects with a CI > workflow (Kubernetes operator and the JDBC connector were identified, for > instance) and adding a more distinct name. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-34961) GitHub Actions statistcs can be monitored per workflow name
Matthias Pohl created FLINK-34961: - Summary: GitHub Actions statistcs can be monitored per workflow name Key: FLINK-34961 URL: https://issues.apache.org/jira/browse/FLINK-34961 Project: Flink Issue Type: Improvement Components: Build System / CI Reporter: Matthias Pohl Apache Infra allows the monitoring of runner usage per workflow (see [report for Flink|https://infra-reports.apache.org/#ghactions=flink=168=10]; only accessible with Apache committer rights). They accumulate the data by workflow name. The Flink space has multiple repositories that use the generic workflow name {{CI}}). That makes the differentiation in the report harder. This Jira issue is about identifying all Flink-related projects with a CI workflow (Kubernetes operator and the JDBC connector were identified, for instance) and adding a more distinct name. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-34961) GitHub Actions runner statistcs can be monitored per workflow name
[ https://issues.apache.org/jira/browse/FLINK-34961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthias Pohl updated FLINK-34961: -- Summary: GitHub Actions runner statistcs can be monitored per workflow name (was: GitHub Actions statistcs can be monitored per workflow name) > GitHub Actions runner statistcs can be monitored per workflow name > -- > > Key: FLINK-34961 > URL: https://issues.apache.org/jira/browse/FLINK-34961 > Project: Flink > Issue Type: Improvement > Components: Build System / CI >Reporter: Matthias Pohl >Priority: Major > > Apache Infra allows the monitoring of runner usage per workflow (see [report > for > Flink|https://infra-reports.apache.org/#ghactions=flink=168=10]; > only accessible with Apache committer rights). They accumulate the data by > workflow name. The Flink space has multiple repositories that use the generic > workflow name {{CI}}). That makes the differentiation in the report harder. > This Jira issue is about identifying all Flink-related projects with a CI > workflow (Kubernetes operator and the JDBC connector were identified, for > instance) and adding a more distinct name. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-34937) Apache Infra GHA policy update
[ https://issues.apache.org/jira/browse/FLINK-34937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831844#comment-17831844 ] Matthias Pohl commented on FLINK-34937: --- Looks like Flink is on rank 19 in terms of runner minutes used for the past 7 days: [Flink-specific report|https://infra-reports.apache.org/#ghactions=flink=168] (needs ASF committer rights) [Global report|https://infra-reports.apache.org/#ghactions] (needs ASF membership) > Apache Infra GHA policy update > -- > > Key: FLINK-34937 > URL: https://issues.apache.org/jira/browse/FLINK-34937 > Project: Flink > Issue Type: Sub-task > Components: Build System / CI >Affects Versions: 1.19.0, 1.18.1, 1.20.0 >Reporter: Matthias Pohl >Priority: Major > > There is a policy update [announced in the infra > ML|https://www.mail-archive.com/jdo-dev@db.apache.org/msg13638.html] which > asked Apache projects to limit the number of runners per job. Additionally, > the [GHA policy|https://infra.apache.org/github-actions-policy.html] is > referenced which I wasn't aware of when working on the action workflow. > This issue is about applying the policy to the Flink GHA workflows. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (FLINK-34933) JobMasterServiceLeadershipRunnerTest#testResultFutureCompletionOfOutdatedLeaderIsIgnored isn't implemented properly
[ https://issues.apache.org/jira/browse/FLINK-34933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthias Pohl resolved FLINK-34933. --- Fix Version/s: 1.18.2 1.20.0 1.19.1 Resolution: Fixed master: [1668a07276929416469392a35a77ba7699aac30b|https://github.com/apache/flink/commit/1668a07276929416469392a35a77ba7699aac30b] 1.19: [c11656a2406f07e2ae7cd6f80c46afb14385ee0e|https://github.com/apache/flink/commit/c11656a2406f07e2ae7cd6f80c46afb14385ee0e] 1.18: [94d1363c27e26fc8313721e138c7b4de744ca69e|https://github.com/apache/flink/commit/94d1363c27e26fc8313721e138c7b4de744ca69e] > JobMasterServiceLeadershipRunnerTest#testResultFutureCompletionOfOutdatedLeaderIsIgnored > isn't implemented properly > --- > > Key: FLINK-34933 > URL: https://issues.apache.org/jira/browse/FLINK-34933 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.17.2, 1.19.0, 1.18.1, 1.20.0 >Reporter: Matthias Pohl >Assignee: Matthias Pohl >Priority: Major > Labels: pull-request-available > Fix For: 1.18.2, 1.20.0, 1.19.1 > > > {{testResultFutureCompletionOfOutdatedLeaderIsIgnored}} doesn't test the > desired behavior: The {{TestingJobMasterService#closeAsync()}} callback > throws an {{UnsupportedOperationException}} by default which prevents the > test from properly finalizing the leadership revocation. > The test is still passing because the test checks implicitly for this error. > Instead, we should verify that the runner's resultFuture doesn't complete > until the runner is closed. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (FLINK-33376) Extend Curator config option for Zookeeper configuration
[ https://issues.apache.org/jira/browse/FLINK-33376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthias Pohl resolved FLINK-33376. --- Fix Version/s: 1.20.0 Release Note: Adds support for the following curator parameters: high-availability.zookeeper.client.authorization (curator parameter: authorization), high-availability.zookeeper.client.max-close-wait (curator parameter: maxCloseWaitMs), high-availability.zookeeper.client.simulated-session-expiration-percent (curator parameter: simulatedSessionExpirationPercent) Resolution: Fixed master: [83f82ab0c865a4fa9e119c96e11e0fb3df4a5ecd|https://github.com/apache/flink/commit/83f82ab0c865a4fa9e119c96e11e0fb3df4a5ecd] > Extend Curator config option for Zookeeper configuration > > > Key: FLINK-33376 > URL: https://issues.apache.org/jira/browse/FLINK-33376 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination >Reporter: Oleksandr Nitavskyi >Assignee: Oleksandr Nitavskyi >Priority: Major > Labels: pull-request-available > Fix For: 1.20.0 > > > In certain cases ZooKeeper requires additional Authentication information. > For example list of valid [names for > ensemble|https://zookeeper.apache.org/doc/r3.8.0/zookeeperAdmin.html#:~:text=for%20secure%20authentication.-,zookeeper.ensembleAuthName,-%3A%20(Java%20system%20property] > in order to prevent the accidental connecting to a wrong ensemble. > Curator allows to add additional AuthInfo object for such configuration. Thus > it would be useful to add one more additional Map property which would allow > to pass AuthInfo objects during Curator client creation. > *Acceptance Criteria:* For Flink users it is possible to configure auth info > list for Curator framework client. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-33376) Extend Curator config option for Zookeeper configuration
[ https://issues.apache.org/jira/browse/FLINK-33376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthias Pohl updated FLINK-33376: -- Release Note: Adds support for the following curator parameters: high-availability.zookeeper.client.authorization (corresponding curator parameter: authorization), high-availability.zookeeper.client.max-close-wait (corresponding curator parameter: maxCloseWaitMs), high-availability.zookeeper.client.simulated-session-expiration-percent (corresponding curator parameter: simulatedSessionExpirationPercent). (was: Adds support for the following curator parameters: high-availability.zookeeper.client.authorization (curator parameter: authorization), high-availability.zookeeper.client.max-close-wait (curator parameter: maxCloseWaitMs), high-availability.zookeeper.client.simulated-session-expiration-percent (curator parameter: simulatedSessionExpirationPercent)) > Extend Curator config option for Zookeeper configuration > > > Key: FLINK-33376 > URL: https://issues.apache.org/jira/browse/FLINK-33376 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination >Reporter: Oleksandr Nitavskyi >Assignee: Oleksandr Nitavskyi >Priority: Major > Labels: pull-request-available > Fix For: 1.20.0 > > > In certain cases ZooKeeper requires additional Authentication information. > For example list of valid [names for > ensemble|https://zookeeper.apache.org/doc/r3.8.0/zookeeperAdmin.html#:~:text=for%20secure%20authentication.-,zookeeper.ensembleAuthName,-%3A%20(Java%20system%20property] > in order to prevent the accidental connecting to a wrong ensemble. > Curator allows to add additional AuthInfo object for such configuration. Thus > it would be useful to add one more additional Map property which would allow > to pass AuthInfo objects during Curator client creation. > *Acceptance Criteria:* For Flink users it is possible to configure auth info > list for Curator framework client. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Reopened] (FLINK-34953) Add github ci for flink-web to auto commit build files
[ https://issues.apache.org/jira/browse/FLINK-34953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthias Pohl reopened FLINK-34953: --- > Add github ci for flink-web to auto commit build files > -- > > Key: FLINK-34953 > URL: https://issues.apache.org/jira/browse/FLINK-34953 > Project: Flink > Issue Type: Improvement > Components: Project Website >Reporter: Zhongqiang Gong >Priority: Minor > Labels: website > > Currently, https://github.com/apache/flink-web commit build files by local > build. So I want use github ci to build docs and commit. > > Changes: > * Add website build check for pr > * Auto build and commit build files after pr was merged to `asf-site` > * Optinal: this ci can triggered by manual -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (FLINK-34953) Add github ci for flink-web to auto commit build files
[ https://issues.apache.org/jira/browse/FLINK-34953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831665#comment-17831665 ] Matthias Pohl edited comment on FLINK-34953 at 3/28/24 9:52 AM: I guess we could do it. The [GitHub Actions Policy|https://infra.apache.org/github-actions-policy.html] excludes non-released artifacts like websites from the restriction: {quote}Automated services such as GitHub Actions (and Jenkins, BuildBot, etc.) MAY work on website content and other non-released data such as documentation and convenience binaries. Automated services MUST NOT push data to a repository or branch that is subject to official release as a software package by the project, unless the project secures specific prior authorization of the workflow from Infrastructure. {quote} Not sure whether they updated that one recently. Or do you have another source which is stricter, [~martijnvisser] ? was (Author: mapohl): I guess we could do it. The [GitHub Actions Policy|https://infra.apache.org/github-actions-policy.html] excludes non-released artifacts like website from the restriction: {quote}Automated services such as GitHub Actions (and Jenkins, BuildBot, etc.) MAY work on website content and other non-released data such as documentation and convenience binaries. Automated services MUST NOT push data to a repository or branch that is subject to official release as a software package by the project, unless the project secures specific prior authorization of the workflow from Infrastructure. {quote} Not sure whether they updated that one recently. Or do you have another source which is stricter, [~martijnvisser] ? > Add github ci for flink-web to auto commit build files > -- > > Key: FLINK-34953 > URL: https://issues.apache.org/jira/browse/FLINK-34953 > Project: Flink > Issue Type: Improvement > Components: Project Website >Reporter: Zhongqiang Gong >Priority: Minor > Labels: website > > Currently, https://github.com/apache/flink-web commit build files by local > build. So I want use github ci to build docs and commit. > > Changes: > * Add website build check for pr > * Auto build and commit build files after pr was merged to `asf-site` > * Optinal: this ci can triggered by manual -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-34953) Add github ci for flink-web to auto commit build files
[ https://issues.apache.org/jira/browse/FLINK-34953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831665#comment-17831665 ] Matthias Pohl commented on FLINK-34953: --- I guess we could do it. The [GitHub Actions Policy|https://infra.apache.org/github-actions-policy.html] excludes non-released artifacts like website from the restriction: {quote}Automated services such as GitHub Actions (and Jenkins, BuildBot, etc.) MAY work on website content and other non-released data such as documentation and convenience binaries. Automated services MUST NOT push data to a repository or branch that is subject to official release as a software package by the project, unless the project secures specific prior authorization of the workflow from Infrastructure. {quote} Not sure whether they updated that one recently. Or do you have another source which is stricter, [~martijnvisser] ? > Add github ci for flink-web to auto commit build files > -- > > Key: FLINK-34953 > URL: https://issues.apache.org/jira/browse/FLINK-34953 > Project: Flink > Issue Type: Improvement > Components: Project Website >Reporter: Zhongqiang Gong >Priority: Minor > Labels: website > > Currently, https://github.com/apache/flink-web commit build files by local > build. So I want use github ci to build docs and commit. > > Changes: > * Add website build check for pr > * Auto build and commit build files after pr was merged to `asf-site` > * Optinal: this ci can triggered by manual -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-34937) Apache Infra GHA policy update
[ https://issues.apache.org/jira/browse/FLINK-34937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831659#comment-17831659 ] Matthias Pohl commented on FLINK-34937: --- let's check https://github.com/assignUser/stash (which is provided by [~assignuser] from the Apache Arrow project and promoted in Apache Infra's roundtable group) whether our CI can benefit from it > Apache Infra GHA policy update > -- > > Key: FLINK-34937 > URL: https://issues.apache.org/jira/browse/FLINK-34937 > Project: Flink > Issue Type: Sub-task > Components: Build System / CI >Affects Versions: 1.19.0, 1.18.1, 1.20.0 >Reporter: Matthias Pohl >Priority: Major > > There is a policy update [announced in the infra > ML|https://www.mail-archive.com/jdo-dev@db.apache.org/msg13638.html] which > asked Apache projects to limit the number of runners per job. Additionally, > the [GHA policy|https://infra.apache.org/github-actions-policy.html] is > referenced which I wasn't aware of when working on the action workflow. > This issue is about applying the policy to the Flink GHA workflows. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (FLINK-34551) Align retry mechanisms of FutureUtils
[ https://issues.apache.org/jira/browse/FLINK-34551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthias Pohl reassigned FLINK-34551: - Assignee: Matthias Pohl (was: Kumar Mallikarjuna) > Align retry mechanisms of FutureUtils > - > > Key: FLINK-34551 > URL: https://issues.apache.org/jira/browse/FLINK-34551 > Project: Flink > Issue Type: Technical Debt > Components: API / Core >Affects Versions: 1.20.0 >Reporter: Matthias Pohl >Assignee: Matthias Pohl >Priority: Major > Labels: pull-request-available > > The retry mechanisms of FutureUtils include quite a bit of redundant code > which makes it hard to understand and to extend. The logic should be aligned > properly. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-34551) Align retry mechanisms of FutureUtils
[ https://issues.apache.org/jira/browse/FLINK-34551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831657#comment-17831657 ] Matthias Pohl commented on FLINK-34551: --- The intention of this ticket came from FLINK-34227 where I wanted to add logic for retrying forever. I managed to split the {{retrySuccessfulOperationWithDelay}} in FLINK-34227 in a way now that I didn't generate too much additional redundant code. I created FLINK-34551 as a follow-up anyway because I noticed that {{retrySuccessfulOperationWithDelay}} and {{retryOperation}} share some common logic and that we could improve the way how these methods decide on which executor to run the {{operation}} on (scheduledExecutor vs calling thread). Your current proposal has still redundant code. We would need to iterate over the change a bit more and discuss the contract of these methods in more detail. But unfortunately, I am gone for quite a bit soon. So, I would not be able to help you. Additionally, it's not a high-priority task right. I'm wondering whether we should unassign the task again. I want to avoid that you spend time on it and then get stuck because of missing feedback from my side. I should have considered it yesterday already. Sorry for that. > Align retry mechanisms of FutureUtils > - > > Key: FLINK-34551 > URL: https://issues.apache.org/jira/browse/FLINK-34551 > Project: Flink > Issue Type: Technical Debt > Components: API / Core >Affects Versions: 1.20.0 >Reporter: Matthias Pohl >Assignee: Kumar Mallikarjuna >Priority: Major > Labels: pull-request-available > > The retry mechanisms of FutureUtils include quite a bit of redundant code > which makes it hard to understand and to extend. The logic should be aligned > properly. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (FLINK-34937) Apache Infra GHA policy update
[ https://issues.apache.org/jira/browse/FLINK-34937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831422#comment-17831422 ] Matthias Pohl edited comment on FLINK-34937 at 3/27/24 3:45 PM: We should pin all actions (i.e. use the git SHA rather than a version tag) for external actions (anything other than {{actions/\*}}, {{github/\*}} and {{apache/\*}} prefixed actions). That's not the case right now. was (Author: mapohl): We should pin all actions (i.e. use the git SHA rather than a version tag) for external actions (anything other than {{actions/*}}, {{github/*}} and {{apache/*}} prefixed actions). That's not the case right now. > Apache Infra GHA policy update > -- > > Key: FLINK-34937 > URL: https://issues.apache.org/jira/browse/FLINK-34937 > Project: Flink > Issue Type: Sub-task > Components: Build System / CI >Affects Versions: 1.19.0, 1.18.1, 1.20.0 >Reporter: Matthias Pohl >Priority: Major > > There is a policy update [announced in the infra > ML|https://www.mail-archive.com/jdo-dev@db.apache.org/msg13638.html] which > asked Apache projects to limit the number of runners per job. Additionally, > the [GHA policy|https://infra.apache.org/github-actions-policy.html] is > referenced which I wasn't aware of when working on the action workflow. > This issue is about applying the policy to the Flink GHA workflows. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-34937) Apache Infra GHA policy update
[ https://issues.apache.org/jira/browse/FLINK-34937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831422#comment-17831422 ] Matthias Pohl commented on FLINK-34937: --- We should pin all actions (i.e. use the git SHA rather than a version tag) for external actions (anything other than {{actions/*}}, {{github/*}} and {{apache/*}} prefixed actions). That's not the case right now. > Apache Infra GHA policy update > -- > > Key: FLINK-34937 > URL: https://issues.apache.org/jira/browse/FLINK-34937 > Project: Flink > Issue Type: Sub-task > Components: Build System / CI >Affects Versions: 1.19.0, 1.18.1, 1.20.0 >Reporter: Matthias Pohl >Priority: Major > > There is a policy update [announced in the infra > ML|https://www.mail-archive.com/jdo-dev@db.apache.org/msg13638.html] which > asked Apache projects to limit the number of runners per job. Additionally, > the [GHA policy|https://infra.apache.org/github-actions-policy.html] is > referenced which I wasn't aware of when working on the action workflow. > This issue is about applying the policy to the Flink GHA workflows. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (FLINK-34419) flink-docker's .github/workflows/snapshot.yml doesn't support JDK 17 and 21
[ https://issues.apache.org/jira/browse/FLINK-34419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthias Pohl resolved FLINK-34419. --- Resolution: Fixed > flink-docker's .github/workflows/snapshot.yml doesn't support JDK 17 and 21 > --- > > Key: FLINK-34419 > URL: https://issues.apache.org/jira/browse/FLINK-34419 > Project: Flink > Issue Type: Technical Debt > Components: Build System / CI >Reporter: Matthias Pohl >Assignee: Muhammet Orazov >Priority: Major > Labels: pull-request-available, starter > > [.github/workflows/snapshot.yml|https://github.com/apache/flink-docker/blob/master/.github/workflows/snapshot.yml#L40] > needs to be updated: JDK 17 support was added in 1.18 (FLINK-15736). JDK 21 > support was added in 1.19 (FLINK-33163) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (FLINK-34419) flink-docker's .github/workflows/snapshot.yml doesn't support JDK 17 and 21
[ https://issues.apache.org/jira/browse/FLINK-34419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831391#comment-17831391 ] Matthias Pohl edited comment on FLINK-34419 at 3/27/24 2:56 PM: master: 9e0041a2c9dace4bf3f32815e3e24e24385b179b dev-master: 1460077743b29e17edd0a2d7efd3897fa097988d dev-1.19: 67d7c46ed382a665e941f0cf1f1606d10f87dee5 dev-1.18: d93d911b015e535fc2b6f1426c3b36229ff3d02a was (Author: mapohl): master: 9e0041a2c9dace4bf3f32815e3e24e24385b179b dev-master: tba dev-1.19: tba dev-1.18: tba > flink-docker's .github/workflows/snapshot.yml doesn't support JDK 17 and 21 > --- > > Key: FLINK-34419 > URL: https://issues.apache.org/jira/browse/FLINK-34419 > Project: Flink > Issue Type: Technical Debt > Components: Build System / CI >Reporter: Matthias Pohl >Assignee: Muhammet Orazov >Priority: Major > Labels: pull-request-available, starter > > [.github/workflows/snapshot.yml|https://github.com/apache/flink-docker/blob/master/.github/workflows/snapshot.yml#L40] > needs to be updated: JDK 17 support was added in 1.18 (FLINK-15736). JDK 21 > support was added in 1.19 (FLINK-33163) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-34419) flink-docker's .github/workflows/snapshot.yml doesn't support JDK 17 and 21
[ https://issues.apache.org/jira/browse/FLINK-34419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831391#comment-17831391 ] Matthias Pohl commented on FLINK-34419: --- master: 9e0041a2c9dace4bf3f32815e3e24e24385b179b dev-master: tba dev-1.19: tba dev-1.18: tba > flink-docker's .github/workflows/snapshot.yml doesn't support JDK 17 and 21 > --- > > Key: FLINK-34419 > URL: https://issues.apache.org/jira/browse/FLINK-34419 > Project: Flink > Issue Type: Technical Debt > Components: Build System / CI >Reporter: Matthias Pohl >Assignee: Muhammet Orazov >Priority: Major > Labels: pull-request-available, starter > > [.github/workflows/snapshot.yml|https://github.com/apache/flink-docker/blob/master/.github/workflows/snapshot.yml#L40] > needs to be updated: JDK 17 support was added in 1.18 (FLINK-15736). JDK 21 > support was added in 1.19 (FLINK-33163) -- This message was sent by Atlassian Jira (v8.20.10#820010)