[jira] [Assigned] (FLINK-35552) Move CheckpointStatsTracker out of ExecutionGraph into Scheduler

2024-06-19 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-35552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl reassigned FLINK-35552:
-

Assignee: Matthias Pohl

> Move CheckpointStatsTracker out of ExecutionGraph into Scheduler
> 
>
> Key: FLINK-35552
> URL: https://issues.apache.org/jira/browse/FLINK-35552
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Checkpointing, Runtime / Coordination
>Reporter: Matthias Pohl
>Assignee: Matthias Pohl
>Priority: Major
>  Labels: pull-request-available
>
> The scheduler needs to know about the CheckpointStatsTracker to allow 
> listening to checkpoint failures and completion.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (FLINK-35551) Introduces RescaleManager#onTrigger endpoint

2024-06-19 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-35551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl reassigned FLINK-35551:
-

Assignee: Matthias Pohl

> Introduces RescaleManager#onTrigger endpoint
> 
>
> Key: FLINK-35551
> URL: https://issues.apache.org/jira/browse/FLINK-35551
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Reporter: Matthias Pohl
>Assignee: Matthias Pohl
>Priority: Major
>  Labels: pull-request-available
>
> The new endpoint would allow use from separating observing change events from 
> actually triggering the rescale operation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (FLINK-35553) Integrate newly added trigger interface with checkpointing

2024-06-19 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-35553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl reassigned FLINK-35553:
-

Assignee: Matthias Pohl

> Integrate newly added trigger interface with checkpointing
> --
>
> Key: FLINK-35553
> URL: https://issues.apache.org/jira/browse/FLINK-35553
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Checkpointing, Runtime / Coordination
>Reporter: Matthias Pohl
>Assignee: Matthias Pohl
>Priority: Major
>  Labels: pull-request-available
>
> This connects the newly introduced trigger logic (FLINK-35551) with the 
> {{CheckpointStatsTracker}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-35639) upgrade to 1.19 with job in HA state with restart strategy crashes job manager

2024-06-19 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-35639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17856230#comment-17856230
 ] 

Matthias Pohl edited comment on FLINK-35639 at 6/19/24 9:53 AM:


[~chesnay] pointed me to the actual issue. I was initially wondering why the 
change in FLINK-32570 was actually "overlooked" by our {{japicmp}} checks. The 
problem is that you're actually not following the supported process (as 
documented in the [Flink 
docs|https://nightlies.apache.org/flink/flink-docs-master/docs/ops/upgrading/#restarting-streaming-applications]).
 That results in incompatibilities of internal APIs (the constructor in 
question is package-private). Please use savepoints to migrate jobs. There are 
other internal APIs (the JobGraph itself isn't a stable API, either) that might 
cause problems in your upgrade process.
 # Create a savepoint of the job in the old version.
 # Start the Flink cluster with the upgraded Flink version.
 # Submit the job using the created savepoint to restart the job using the job 
client of the new Flink binaries (to allow for proper JobGraph creation).


was (Author: mapohl):
[~chesnay] pointed me to the actual issue because I was wondering why the 
change in FLINK-32570 was actually "overlooked" by our {{japicmp}} checks. The 
problem is that you're actually not following the supported process (as 
documented in the [Flink 
docs|https://nightlies.apache.org/flink/flink-docs-master/docs/ops/upgrading/#restarting-streaming-applications]).
 That results in incompatibilities of internal APIs (the constructor in 
question is package-private). Please use savepoints to migrate jobs. There are 
other internal APIs (the JobGraph itself isn't a stable API, either) that might 
cause problems in your upgrade process.
 # Create a savepoint of the job in the old version.
 # Start the Flink cluster with the upgraded Flink version.
 # Submit the job using the created savepoint to restart the job using the job 
client of the new Flink binaries (to allow for proper JobGraph creation).

> upgrade to 1.19 with job in HA state with restart strategy crashes job manager
> --
>
> Key: FLINK-35639
> URL: https://issues.apache.org/jira/browse/FLINK-35639
> Project: Flink
>  Issue Type: Bug
>  Components: API / Core
>Affects Versions: 1.20.0, 1.19.1
> Environment: Download 1.18 and 1.19 binary releases. Add the 
> following to flink-1.19.0/conf/config.yaml and 
> flink-1.18.1/conf/flink-conf.yaml ```yaml high-availability: zookeeper 
> high-availability.zookeeper.quorum: localhost high-availability.storageDir: 
> file:///tmp/flink/recovery ``` Launch zookeeper: docker run --network host 
> zookeeper:latest launch 1.18 task manager: ./flink-1.18.1/bin/taskmanager.sh 
> start-foreground launch 1.18 job manager: ./flink-1.18.1/bin/jobmanager.sh 
> start-foreground launch the following job: ```java import 
> org.apache.flink.api.java.ExecutionEnvironment; import 
> org.apache.flink.api.java.tuple.Tuple2; import 
> org.apache.flink.api.common.functions.FlatMapFunction; import 
> org.apache.flink.util.Collector; import 
> org.apache.flink.api.common.restartstrategy.RestartStrategies; import 
> org.apache.flink.api.common.time.Time; import java.util.concurrent.TimeUnit; 
> public class FlinkJob \{ public static void main(String[] args) throws 
> Exception { final ExecutionEnvironment env = 
> ExecutionEnvironment.getExecutionEnvironment(); env.setRestartStrategy( 
> RestartStrategies.fixedDelayRestart(Integer.MAX_VALUE, Time.of(20, 
> TimeUnit.SECONDS)) ); env.fromElements("Hello World", "Hello Flink") 
> .flatMap(new LineSplitter()) .groupBy(0) .sum(1) .print(); } public static 
> final class LineSplitter implements FlatMapFunction> \{ @Override public void 
> flatMap(String value, Collector> out) { for (String word : value.split(" ")) 
> { try { Thread.sleep(12); } catch (InterruptedException e) \{ 
> e.printStackTrace(); } out.collect(new Tuple2<>(word, 1)); } } } } ``` ```xml 
> 4.0.0 org.apache.flink myflinkjob 1.0-SNAPSHOT 1.18.1 1.8 org.apache.flink 
> flink-java ${flink.version} org.apache.flink flink-streaming-java 
> ${flink.version} org.apache.maven.plugins maven-compiler-plugin 3.8.1 
> ${java.version} ${java.version} org.apache.maven.plugins maven-jar-plugin 
> 3.1.0 true lib/ FlinkJob ``` Launch job: ./flink-1.18.1/bin/flink run 
> ../flink-job/target/myflinkjob-1.0-SNAPSHOT.jar Job has been submitted with 
> JobID 5f0898c964a93a47aa480427f3e2c6c0 Kill job manager and task manager. 
> Then launch job manager 1.19.0 ./flink-1.19.0/bin/jobmanager.sh 
> start-foreground Root cause == It looks like the type of 
> delayBetweenAttemptsInterval was changed in 1.19 
> 

[jira] [Updated] (FLINK-35639) upgrade to 1.19 with job in HA state with restart strategy crashes job manager

2024-06-19 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-35639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl updated FLINK-35639:
--
Priority: Major  (was: Blocker)

> upgrade to 1.19 with job in HA state with restart strategy crashes job manager
> --
>
> Key: FLINK-35639
> URL: https://issues.apache.org/jira/browse/FLINK-35639
> Project: Flink
>  Issue Type: Bug
>  Components: API / Core
>Affects Versions: 1.20.0, 1.19.1
> Environment: Download 1.18 and 1.19 binary releases. Add the 
> following to flink-1.19.0/conf/config.yaml and 
> flink-1.18.1/conf/flink-conf.yaml ```yaml high-availability: zookeeper 
> high-availability.zookeeper.quorum: localhost high-availability.storageDir: 
> file:///tmp/flink/recovery ``` Launch zookeeper: docker run --network host 
> zookeeper:latest launch 1.18 task manager: ./flink-1.18.1/bin/taskmanager.sh 
> start-foreground launch 1.18 job manager: ./flink-1.18.1/bin/jobmanager.sh 
> start-foreground launch the following job: ```java import 
> org.apache.flink.api.java.ExecutionEnvironment; import 
> org.apache.flink.api.java.tuple.Tuple2; import 
> org.apache.flink.api.common.functions.FlatMapFunction; import 
> org.apache.flink.util.Collector; import 
> org.apache.flink.api.common.restartstrategy.RestartStrategies; import 
> org.apache.flink.api.common.time.Time; import java.util.concurrent.TimeUnit; 
> public class FlinkJob \{ public static void main(String[] args) throws 
> Exception { final ExecutionEnvironment env = 
> ExecutionEnvironment.getExecutionEnvironment(); env.setRestartStrategy( 
> RestartStrategies.fixedDelayRestart(Integer.MAX_VALUE, Time.of(20, 
> TimeUnit.SECONDS)) ); env.fromElements("Hello World", "Hello Flink") 
> .flatMap(new LineSplitter()) .groupBy(0) .sum(1) .print(); } public static 
> final class LineSplitter implements FlatMapFunction> \{ @Override public void 
> flatMap(String value, Collector> out) { for (String word : value.split(" ")) 
> { try { Thread.sleep(12); } catch (InterruptedException e) \{ 
> e.printStackTrace(); } out.collect(new Tuple2<>(word, 1)); } } } } ``` ```xml 
> 4.0.0 org.apache.flink myflinkjob 1.0-SNAPSHOT 1.18.1 1.8 org.apache.flink 
> flink-java ${flink.version} org.apache.flink flink-streaming-java 
> ${flink.version} org.apache.maven.plugins maven-compiler-plugin 3.8.1 
> ${java.version} ${java.version} org.apache.maven.plugins maven-jar-plugin 
> 3.1.0 true lib/ FlinkJob ``` Launch job: ./flink-1.18.1/bin/flink run 
> ../flink-job/target/myflinkjob-1.0-SNAPSHOT.jar Job has been submitted with 
> JobID 5f0898c964a93a47aa480427f3e2c6c0 Kill job manager and task manager. 
> Then launch job manager 1.19.0 ./flink-1.19.0/bin/jobmanager.sh 
> start-foreground Root cause == It looks like the type of 
> delayBetweenAttemptsInterval was changed in 1.19 
> https://github.com/apache/flink/pull/22984/files#diff-d174f32ffdea69de610c4f37c545bd22a253b9846434f83397f1bbc2aaa399faR239
>  , introducing an incompatibility which is not handled by flink 1.19. In my 
> opinion, job-maanger should not crash when starting in that case. 
>Reporter: yazgoo
>Assignee: Matthias Pohl
>Priority: Major
>  Labels: pull-request-available
>
> When trying to upgrade a flink cluster from 1.18 to 1.19, with a 1.18 job in 
> zookeeper HA state, I have a jobmanager crash with a ClassCastException, see 
> log below  
>  
> {code:java}
> 2024-06-18 16:58:14,401 ERROR 
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] - Fatal error 
> occurred in the cluster entrypoint. org.apache.flink.util.FlinkException: 
> JobMaster for job 5f0898c964a93a47aa480427f3e2c6c0 failed.     at 
> org.apache.flink.runtime.dispatcher.Dispatcher.jobMasterFailed(Dispatcher.java:1484)
>  ~[flink-dist-1.19.0.jar:1.19.0]     at 
> org.apache.flink.runtime.dispatcher.Dispatcher.jobManagerRunnerFailed(Dispatcher.java:775)
>  ~[flink-dist-1.19.0.jar:1.19.0]     at 
> org.apache.flink.runtime.dispatcher.Dispatcher.handleJobManagerRunnerResult(Dispatcher.java:738)
>  ~[flink-dist-1.19.0.jar:1.19.0]     at 
> org.apache.flink.runtime.dispatcher.Dispatcher.lambda$runJob$7(Dispatcher.java:693)
>  ~[flink-dist-1.19.0.jar:1.19.0]     at 
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:934) 
> ~[?:?]     at 
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:911)
>  ~[?:?]     at 
> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:482)
>  ~[?:?]     at 
> org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.lambda$handleRunAsync$4(PekkoRpcActor.java:451)
>  ~[flink-rpc-akka84eb9e64-a1ce-450c-ad53-d9fa579b67e1.jar:1.19.0]     at 
> 

[jira] [Closed] (FLINK-35639) upgrade to 1.19 with job in HA state with restart strategy crashes job manager

2024-06-19 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-35639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl closed FLINK-35639.
-
Resolution: Not A Problem

I'm closing the issue and the related PRs because we're actually not supporting 
this kind of version upgrades in general. Fixing the {{RestartStrategy}} issue 
wouldn't necessarily solve the issue.

> upgrade to 1.19 with job in HA state with restart strategy crashes job manager
> --
>
> Key: FLINK-35639
> URL: https://issues.apache.org/jira/browse/FLINK-35639
> Project: Flink
>  Issue Type: Bug
>  Components: API / Core
>Affects Versions: 1.20.0, 1.19.1
> Environment: Download 1.18 and 1.19 binary releases. Add the 
> following to flink-1.19.0/conf/config.yaml and 
> flink-1.18.1/conf/flink-conf.yaml ```yaml high-availability: zookeeper 
> high-availability.zookeeper.quorum: localhost high-availability.storageDir: 
> file:///tmp/flink/recovery ``` Launch zookeeper: docker run --network host 
> zookeeper:latest launch 1.18 task manager: ./flink-1.18.1/bin/taskmanager.sh 
> start-foreground launch 1.18 job manager: ./flink-1.18.1/bin/jobmanager.sh 
> start-foreground launch the following job: ```java import 
> org.apache.flink.api.java.ExecutionEnvironment; import 
> org.apache.flink.api.java.tuple.Tuple2; import 
> org.apache.flink.api.common.functions.FlatMapFunction; import 
> org.apache.flink.util.Collector; import 
> org.apache.flink.api.common.restartstrategy.RestartStrategies; import 
> org.apache.flink.api.common.time.Time; import java.util.concurrent.TimeUnit; 
> public class FlinkJob \{ public static void main(String[] args) throws 
> Exception { final ExecutionEnvironment env = 
> ExecutionEnvironment.getExecutionEnvironment(); env.setRestartStrategy( 
> RestartStrategies.fixedDelayRestart(Integer.MAX_VALUE, Time.of(20, 
> TimeUnit.SECONDS)) ); env.fromElements("Hello World", "Hello Flink") 
> .flatMap(new LineSplitter()) .groupBy(0) .sum(1) .print(); } public static 
> final class LineSplitter implements FlatMapFunction> \{ @Override public void 
> flatMap(String value, Collector> out) { for (String word : value.split(" ")) 
> { try { Thread.sleep(12); } catch (InterruptedException e) \{ 
> e.printStackTrace(); } out.collect(new Tuple2<>(word, 1)); } } } } ``` ```xml 
> 4.0.0 org.apache.flink myflinkjob 1.0-SNAPSHOT 1.18.1 1.8 org.apache.flink 
> flink-java ${flink.version} org.apache.flink flink-streaming-java 
> ${flink.version} org.apache.maven.plugins maven-compiler-plugin 3.8.1 
> ${java.version} ${java.version} org.apache.maven.plugins maven-jar-plugin 
> 3.1.0 true lib/ FlinkJob ``` Launch job: ./flink-1.18.1/bin/flink run 
> ../flink-job/target/myflinkjob-1.0-SNAPSHOT.jar Job has been submitted with 
> JobID 5f0898c964a93a47aa480427f3e2c6c0 Kill job manager and task manager. 
> Then launch job manager 1.19.0 ./flink-1.19.0/bin/jobmanager.sh 
> start-foreground Root cause == It looks like the type of 
> delayBetweenAttemptsInterval was changed in 1.19 
> https://github.com/apache/flink/pull/22984/files#diff-d174f32ffdea69de610c4f37c545bd22a253b9846434f83397f1bbc2aaa399faR239
>  , introducing an incompatibility which is not handled by flink 1.19. In my 
> opinion, job-maanger should not crash when starting in that case. 
>Reporter: yazgoo
>Assignee: Matthias Pohl
>Priority: Blocker
>  Labels: pull-request-available
>
> When trying to upgrade a flink cluster from 1.18 to 1.19, with a 1.18 job in 
> zookeeper HA state, I have a jobmanager crash with a ClassCastException, see 
> log below  
>  
> {code:java}
> 2024-06-18 16:58:14,401 ERROR 
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] - Fatal error 
> occurred in the cluster entrypoint. org.apache.flink.util.FlinkException: 
> JobMaster for job 5f0898c964a93a47aa480427f3e2c6c0 failed.     at 
> org.apache.flink.runtime.dispatcher.Dispatcher.jobMasterFailed(Dispatcher.java:1484)
>  ~[flink-dist-1.19.0.jar:1.19.0]     at 
> org.apache.flink.runtime.dispatcher.Dispatcher.jobManagerRunnerFailed(Dispatcher.java:775)
>  ~[flink-dist-1.19.0.jar:1.19.0]     at 
> org.apache.flink.runtime.dispatcher.Dispatcher.handleJobManagerRunnerResult(Dispatcher.java:738)
>  ~[flink-dist-1.19.0.jar:1.19.0]     at 
> org.apache.flink.runtime.dispatcher.Dispatcher.lambda$runJob$7(Dispatcher.java:693)
>  ~[flink-dist-1.19.0.jar:1.19.0]     at 
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:934) 
> ~[?:?]     at 
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:911)
>  ~[?:?]     at 
> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:482)
>  ~[?:?]     at 
> 

[jira] [Comment Edited] (FLINK-35639) upgrade to 1.19 with job in HA state with restart strategy crashes job manager

2024-06-19 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-35639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17856230#comment-17856230
 ] 

Matthias Pohl edited comment on FLINK-35639 at 6/19/24 9:47 AM:


[~chesnay] pointed me to the actual issue because I was wondering why the 
change in FLINK-32570 was actually "overlooked" by our {{japicmp}} checks. The 
problem is that you're actually not following the supported process (as 
documented in the [Flink 
docs|https://nightlies.apache.org/flink/flink-docs-master/docs/ops/upgrading/#restarting-streaming-applications]).
 That results in incompatibilities of internal APIs (the constructor in 
question is package-private). Please use savepoints to migrate jobs. There are 
other internal APIs (the JobGraph itself isn't a stable API, either) that might 
cause problems in your upgrade process.
 # Create a savepoint of the job in the old version.
 # Start the Flink cluster with the upgraded Flink version.
 # Submit the job using the created savepoint to restart the job using the job 
client of the new Flink binaries (to allow for proper JobGraph creation).


was (Author: mapohl):
[~chesnay] pointed me to the actual issue because I was wondering why the 
change in FLINK-32570 was actually "overlooked" by our {{japicmp}} checks. The 
problem is that you're actually not following the supported process (as 
documented in the [Flink 
docs|https://nightlies.apache.org/flink/flink-docs-master/docs/ops/upgrading/#restarting-streaming-applications]).
 That results in incompatibilities of internal APIs (the constructor in 
question is package-private). Please use savepoints to migrate jobs. There are 
other internal APIs (the JobGraph itself isn't a stable API, either) that might 
cause problems in your upgrade process.
 # Create a savepoint of the job in the old version.
 # Start the Flink cluster with the upgraded Flink version.
 # Submit the job using the created savepoint to restart the job.

> upgrade to 1.19 with job in HA state with restart strategy crashes job manager
> --
>
> Key: FLINK-35639
> URL: https://issues.apache.org/jira/browse/FLINK-35639
> Project: Flink
>  Issue Type: Bug
>  Components: API / Core
>Affects Versions: 1.20.0, 1.19.1
> Environment: Download 1.18 and 1.19 binary releases. Add the 
> following to flink-1.19.0/conf/config.yaml and 
> flink-1.18.1/conf/flink-conf.yaml ```yaml high-availability: zookeeper 
> high-availability.zookeeper.quorum: localhost high-availability.storageDir: 
> file:///tmp/flink/recovery ``` Launch zookeeper: docker run --network host 
> zookeeper:latest launch 1.18 task manager: ./flink-1.18.1/bin/taskmanager.sh 
> start-foreground launch 1.18 job manager: ./flink-1.18.1/bin/jobmanager.sh 
> start-foreground launch the following job: ```java import 
> org.apache.flink.api.java.ExecutionEnvironment; import 
> org.apache.flink.api.java.tuple.Tuple2; import 
> org.apache.flink.api.common.functions.FlatMapFunction; import 
> org.apache.flink.util.Collector; import 
> org.apache.flink.api.common.restartstrategy.RestartStrategies; import 
> org.apache.flink.api.common.time.Time; import java.util.concurrent.TimeUnit; 
> public class FlinkJob \{ public static void main(String[] args) throws 
> Exception { final ExecutionEnvironment env = 
> ExecutionEnvironment.getExecutionEnvironment(); env.setRestartStrategy( 
> RestartStrategies.fixedDelayRestart(Integer.MAX_VALUE, Time.of(20, 
> TimeUnit.SECONDS)) ); env.fromElements("Hello World", "Hello Flink") 
> .flatMap(new LineSplitter()) .groupBy(0) .sum(1) .print(); } public static 
> final class LineSplitter implements FlatMapFunction> \{ @Override public void 
> flatMap(String value, Collector> out) { for (String word : value.split(" ")) 
> { try { Thread.sleep(12); } catch (InterruptedException e) \{ 
> e.printStackTrace(); } out.collect(new Tuple2<>(word, 1)); } } } } ``` ```xml 
> 4.0.0 org.apache.flink myflinkjob 1.0-SNAPSHOT 1.18.1 1.8 org.apache.flink 
> flink-java ${flink.version} org.apache.flink flink-streaming-java 
> ${flink.version} org.apache.maven.plugins maven-compiler-plugin 3.8.1 
> ${java.version} ${java.version} org.apache.maven.plugins maven-jar-plugin 
> 3.1.0 true lib/ FlinkJob ``` Launch job: ./flink-1.18.1/bin/flink run 
> ../flink-job/target/myflinkjob-1.0-SNAPSHOT.jar Job has been submitted with 
> JobID 5f0898c964a93a47aa480427f3e2c6c0 Kill job manager and task manager. 
> Then launch job manager 1.19.0 ./flink-1.19.0/bin/jobmanager.sh 
> start-foreground Root cause == It looks like the type of 
> delayBetweenAttemptsInterval was changed in 1.19 
> https://github.com/apache/flink/pull/22984/files#diff-d174f32ffdea69de610c4f37c545bd22a253b9846434f83397f1bbc2aaa399faR239
>  , introducing an incompatibility which is 

[jira] [Commented] (FLINK-35639) upgrade to 1.19 with job in HA state with restart strategy crashes job manager

2024-06-19 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-35639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17856230#comment-17856230
 ] 

Matthias Pohl commented on FLINK-35639:
---

[~chesnay] pointed me to the actual issue because I was wondering why the 
change in FLINK-32570 was actually "overlooked" by our {{japicmp}} checks. The 
problem is that you're actually not following the supported process (as 
documented in the [Flink 
docs|https://nightlies.apache.org/flink/flink-docs-master/docs/ops/upgrading/#restarting-streaming-applications]).
 That results in incompatibilities of internal APIs (the constructor in 
question is package-private). Please use savepoints to migrate jobs. There are 
other internal APIs (the JobGraph itself isn't a stable API, either) that might 
cause problems in your upgrade process.
 # Create a savepoint of the job in the old version.
 # Start the Flink cluster with the upgraded Flink version.
 # Submit the job using the created savepoint to restart the job.

> upgrade to 1.19 with job in HA state with restart strategy crashes job manager
> --
>
> Key: FLINK-35639
> URL: https://issues.apache.org/jira/browse/FLINK-35639
> Project: Flink
>  Issue Type: Bug
>  Components: API / Core
>Affects Versions: 1.20.0, 1.19.1
> Environment: Download 1.18 and 1.19 binary releases. Add the 
> following to flink-1.19.0/conf/config.yaml and 
> flink-1.18.1/conf/flink-conf.yaml ```yaml high-availability: zookeeper 
> high-availability.zookeeper.quorum: localhost high-availability.storageDir: 
> file:///tmp/flink/recovery ``` Launch zookeeper: docker run --network host 
> zookeeper:latest launch 1.18 task manager: ./flink-1.18.1/bin/taskmanager.sh 
> start-foreground launch 1.18 job manager: ./flink-1.18.1/bin/jobmanager.sh 
> start-foreground launch the following job: ```java import 
> org.apache.flink.api.java.ExecutionEnvironment; import 
> org.apache.flink.api.java.tuple.Tuple2; import 
> org.apache.flink.api.common.functions.FlatMapFunction; import 
> org.apache.flink.util.Collector; import 
> org.apache.flink.api.common.restartstrategy.RestartStrategies; import 
> org.apache.flink.api.common.time.Time; import java.util.concurrent.TimeUnit; 
> public class FlinkJob \{ public static void main(String[] args) throws 
> Exception { final ExecutionEnvironment env = 
> ExecutionEnvironment.getExecutionEnvironment(); env.setRestartStrategy( 
> RestartStrategies.fixedDelayRestart(Integer.MAX_VALUE, Time.of(20, 
> TimeUnit.SECONDS)) ); env.fromElements("Hello World", "Hello Flink") 
> .flatMap(new LineSplitter()) .groupBy(0) .sum(1) .print(); } public static 
> final class LineSplitter implements FlatMapFunction> \{ @Override public void 
> flatMap(String value, Collector> out) { for (String word : value.split(" ")) 
> { try { Thread.sleep(12); } catch (InterruptedException e) \{ 
> e.printStackTrace(); } out.collect(new Tuple2<>(word, 1)); } } } } ``` ```xml 
> 4.0.0 org.apache.flink myflinkjob 1.0-SNAPSHOT 1.18.1 1.8 org.apache.flink 
> flink-java ${flink.version} org.apache.flink flink-streaming-java 
> ${flink.version} org.apache.maven.plugins maven-compiler-plugin 3.8.1 
> ${java.version} ${java.version} org.apache.maven.plugins maven-jar-plugin 
> 3.1.0 true lib/ FlinkJob ``` Launch job: ./flink-1.18.1/bin/flink run 
> ../flink-job/target/myflinkjob-1.0-SNAPSHOT.jar Job has been submitted with 
> JobID 5f0898c964a93a47aa480427f3e2c6c0 Kill job manager and task manager. 
> Then launch job manager 1.19.0 ./flink-1.19.0/bin/jobmanager.sh 
> start-foreground Root cause == It looks like the type of 
> delayBetweenAttemptsInterval was changed in 1.19 
> https://github.com/apache/flink/pull/22984/files#diff-d174f32ffdea69de610c4f37c545bd22a253b9846434f83397f1bbc2aaa399faR239
>  , introducing an incompatibility which is not handled by flink 1.19. In my 
> opinion, job-maanger should not crash when starting in that case. 
>Reporter: yazgoo
>Assignee: Matthias Pohl
>Priority: Blocker
>  Labels: pull-request-available
>
> When trying to upgrade a flink cluster from 1.18 to 1.19, with a 1.18 job in 
> zookeeper HA state, I have a jobmanager crash with a ClassCastException, see 
> log below  
>  
> {code:java}
> 2024-06-18 16:58:14,401 ERROR 
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] - Fatal error 
> occurred in the cluster entrypoint. org.apache.flink.util.FlinkException: 
> JobMaster for job 5f0898c964a93a47aa480427f3e2c6c0 failed.     at 
> org.apache.flink.runtime.dispatcher.Dispatcher.jobMasterFailed(Dispatcher.java:1484)
>  ~[flink-dist-1.19.0.jar:1.19.0]     at 
> org.apache.flink.runtime.dispatcher.Dispatcher.jobManagerRunnerFailed(Dispatcher.java:775)
>  ~[flink-dist-1.19.0.jar:1.19.0]     at 
> 

[jira] [Assigned] (FLINK-35639) upgrade to 1.19 with job in HA state with restart strategy crashes job manager

2024-06-18 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-35639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl reassigned FLINK-35639:
-

Assignee: Matthias Pohl

> upgrade to 1.19 with job in HA state with restart strategy crashes job manager
> --
>
> Key: FLINK-35639
> URL: https://issues.apache.org/jira/browse/FLINK-35639
> Project: Flink
>  Issue Type: Bug
>  Components: API / Core
>Affects Versions: 1.19.1
> Environment: Download 1.18 and 1.19 binary releases. Add the 
> following to flink-1.19.0/conf/config.yaml and 
> flink-1.18.1/conf/flink-conf.yaml ```yaml high-availability: zookeeper 
> high-availability.zookeeper.quorum: localhost high-availability.storageDir: 
> file:///tmp/flink/recovery ``` Launch zookeeper: docker run --network host 
> zookeeper:latest launch 1.18 task manager: ./flink-1.18.1/bin/taskmanager.sh 
> start-foreground launch 1.18 job manager: ./flink-1.18.1/bin/jobmanager.sh 
> start-foreground launch the following job: ```java import 
> org.apache.flink.api.java.ExecutionEnvironment; import 
> org.apache.flink.api.java.tuple.Tuple2; import 
> org.apache.flink.api.common.functions.FlatMapFunction; import 
> org.apache.flink.util.Collector; import 
> org.apache.flink.api.common.restartstrategy.RestartStrategies; import 
> org.apache.flink.api.common.time.Time; import java.util.concurrent.TimeUnit; 
> public class FlinkJob \{ public static void main(String[] args) throws 
> Exception { final ExecutionEnvironment env = 
> ExecutionEnvironment.getExecutionEnvironment(); env.setRestartStrategy( 
> RestartStrategies.fixedDelayRestart(Integer.MAX_VALUE, Time.of(20, 
> TimeUnit.SECONDS)) ); env.fromElements("Hello World", "Hello Flink") 
> .flatMap(new LineSplitter()) .groupBy(0) .sum(1) .print(); } public static 
> final class LineSplitter implements FlatMapFunction> \{ @Override public void 
> flatMap(String value, Collector> out) { for (String word : value.split(" ")) 
> { try { Thread.sleep(12); } catch (InterruptedException e) \{ 
> e.printStackTrace(); } out.collect(new Tuple2<>(word, 1)); } } } } ``` ```xml 
> 4.0.0 org.apache.flink myflinkjob 1.0-SNAPSHOT 1.18.1 1.8 org.apache.flink 
> flink-java ${flink.version} org.apache.flink flink-streaming-java 
> ${flink.version} org.apache.maven.plugins maven-compiler-plugin 3.8.1 
> ${java.version} ${java.version} org.apache.maven.plugins maven-jar-plugin 
> 3.1.0 true lib/ FlinkJob ``` Launch job: ./flink-1.18.1/bin/flink run 
> ../flink-job/target/myflinkjob-1.0-SNAPSHOT.jar Job has been submitted with 
> JobID 5f0898c964a93a47aa480427f3e2c6c0 Kill job manager and task manager. 
> Then launch job manager 1.19.0 ./flink-1.19.0/bin/jobmanager.sh 
> start-foreground Root cause == It looks like the type of 
> delayBetweenAttemptsInterval was changed in 1.19 
> https://github.com/apache/flink/pull/22984/files#diff-d174f32ffdea69de610c4f37c545bd22a253b9846434f83397f1bbc2aaa399faR239
>  , introducing an incompatibility which is not handled by flink 1.19. In my 
> opinion, job-maanger should not crash when starting in that case. 
>Reporter: yazgoo
>Assignee: Matthias Pohl
>Priority: Major
>
> When trying to upgrade a flink cluster from 1.18 to 1.19, with a 1.18 job in 
> zookeeper HA state, I have a jobmanager crash with a ClassCastException, see 
> log below  
>  
> {code:java}
> 2024-06-18 16:58:14,401 ERROR 
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] - Fatal error 
> occurred in the cluster entrypoint. org.apache.flink.util.FlinkException: 
> JobMaster for job 5f0898c964a93a47aa480427f3e2c6c0 failed.     at 
> org.apache.flink.runtime.dispatcher.Dispatcher.jobMasterFailed(Dispatcher.java:1484)
>  ~[flink-dist-1.19.0.jar:1.19.0]     at 
> org.apache.flink.runtime.dispatcher.Dispatcher.jobManagerRunnerFailed(Dispatcher.java:775)
>  ~[flink-dist-1.19.0.jar:1.19.0]     at 
> org.apache.flink.runtime.dispatcher.Dispatcher.handleJobManagerRunnerResult(Dispatcher.java:738)
>  ~[flink-dist-1.19.0.jar:1.19.0]     at 
> org.apache.flink.runtime.dispatcher.Dispatcher.lambda$runJob$7(Dispatcher.java:693)
>  ~[flink-dist-1.19.0.jar:1.19.0]     at 
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:934) 
> ~[?:?]     at 
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:911)
>  ~[?:?]     at 
> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:482)
>  ~[?:?]     at 
> org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.lambda$handleRunAsync$4(PekkoRpcActor.java:451)
>  ~[flink-rpc-akka84eb9e64-a1ce-450c-ad53-d9fa579b67e1.jar:1.19.0]     at 
> org.apache.flink.runtime.concurrent.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68)
>  ~[flink-dist-1.19.0.jar:1.19.0]     

[jira] [Comment Edited] (FLINK-35639) upgrade to 1.19 with job in HA state with restart strategy crashes job manager

2024-06-18 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-35639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855979#comment-17855979
 ] 

Matthias Pohl edited comment on FLINK-35639 at 6/18/24 3:42 PM:


I guess, you're right. Thanks for reporting this. Looks like there was an error 
being made when deprecating the {{@PublicEvolving}} API of 
{{{}RestartStrategies#FixedDelayRestartStrategyConfiguration{}}}. I will raise 
the priority for this one to blocker because it also affects 1.20.


was (Author: mapohl):
I guess, you're right. Looks like there was an error being made when 
deprecating the {{@PublicEvolving}} API of 
{{{}RestartStrategies#FixedDelayRestartStrategyConfiguration{}}}. I will raise 
the priority for this one to blocker because it also affects 1.20.

> upgrade to 1.19 with job in HA state with restart strategy crashes job manager
> --
>
> Key: FLINK-35639
> URL: https://issues.apache.org/jira/browse/FLINK-35639
> Project: Flink
>  Issue Type: Bug
>  Components: API / Core
>Affects Versions: 1.20.0, 1.19.1
> Environment: Download 1.18 and 1.19 binary releases. Add the 
> following to flink-1.19.0/conf/config.yaml and 
> flink-1.18.1/conf/flink-conf.yaml ```yaml high-availability: zookeeper 
> high-availability.zookeeper.quorum: localhost high-availability.storageDir: 
> file:///tmp/flink/recovery ``` Launch zookeeper: docker run --network host 
> zookeeper:latest launch 1.18 task manager: ./flink-1.18.1/bin/taskmanager.sh 
> start-foreground launch 1.18 job manager: ./flink-1.18.1/bin/jobmanager.sh 
> start-foreground launch the following job: ```java import 
> org.apache.flink.api.java.ExecutionEnvironment; import 
> org.apache.flink.api.java.tuple.Tuple2; import 
> org.apache.flink.api.common.functions.FlatMapFunction; import 
> org.apache.flink.util.Collector; import 
> org.apache.flink.api.common.restartstrategy.RestartStrategies; import 
> org.apache.flink.api.common.time.Time; import java.util.concurrent.TimeUnit; 
> public class FlinkJob \{ public static void main(String[] args) throws 
> Exception { final ExecutionEnvironment env = 
> ExecutionEnvironment.getExecutionEnvironment(); env.setRestartStrategy( 
> RestartStrategies.fixedDelayRestart(Integer.MAX_VALUE, Time.of(20, 
> TimeUnit.SECONDS)) ); env.fromElements("Hello World", "Hello Flink") 
> .flatMap(new LineSplitter()) .groupBy(0) .sum(1) .print(); } public static 
> final class LineSplitter implements FlatMapFunction> \{ @Override public void 
> flatMap(String value, Collector> out) { for (String word : value.split(" ")) 
> { try { Thread.sleep(12); } catch (InterruptedException e) \{ 
> e.printStackTrace(); } out.collect(new Tuple2<>(word, 1)); } } } } ``` ```xml 
> 4.0.0 org.apache.flink myflinkjob 1.0-SNAPSHOT 1.18.1 1.8 org.apache.flink 
> flink-java ${flink.version} org.apache.flink flink-streaming-java 
> ${flink.version} org.apache.maven.plugins maven-compiler-plugin 3.8.1 
> ${java.version} ${java.version} org.apache.maven.plugins maven-jar-plugin 
> 3.1.0 true lib/ FlinkJob ``` Launch job: ./flink-1.18.1/bin/flink run 
> ../flink-job/target/myflinkjob-1.0-SNAPSHOT.jar Job has been submitted with 
> JobID 5f0898c964a93a47aa480427f3e2c6c0 Kill job manager and task manager. 
> Then launch job manager 1.19.0 ./flink-1.19.0/bin/jobmanager.sh 
> start-foreground Root cause == It looks like the type of 
> delayBetweenAttemptsInterval was changed in 1.19 
> https://github.com/apache/flink/pull/22984/files#diff-d174f32ffdea69de610c4f37c545bd22a253b9846434f83397f1bbc2aaa399faR239
>  , introducing an incompatibility which is not handled by flink 1.19. In my 
> opinion, job-maanger should not crash when starting in that case. 
>Reporter: yazgoo
>Assignee: Matthias Pohl
>Priority: Blocker
>
> When trying to upgrade a flink cluster from 1.18 to 1.19, with a 1.18 job in 
> zookeeper HA state, I have a jobmanager crash with a ClassCastException, see 
> log below  
>  
> {code:java}
> 2024-06-18 16:58:14,401 ERROR 
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] - Fatal error 
> occurred in the cluster entrypoint. org.apache.flink.util.FlinkException: 
> JobMaster for job 5f0898c964a93a47aa480427f3e2c6c0 failed.     at 
> org.apache.flink.runtime.dispatcher.Dispatcher.jobMasterFailed(Dispatcher.java:1484)
>  ~[flink-dist-1.19.0.jar:1.19.0]     at 
> org.apache.flink.runtime.dispatcher.Dispatcher.jobManagerRunnerFailed(Dispatcher.java:775)
>  ~[flink-dist-1.19.0.jar:1.19.0]     at 
> org.apache.flink.runtime.dispatcher.Dispatcher.handleJobManagerRunnerResult(Dispatcher.java:738)
>  ~[flink-dist-1.19.0.jar:1.19.0]     at 
> org.apache.flink.runtime.dispatcher.Dispatcher.lambda$runJob$7(Dispatcher.java:693)
>  

[jira] [Updated] (FLINK-35639) upgrade to 1.19 with job in HA state with restart strategy crashes job manager

2024-06-18 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-35639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl updated FLINK-35639:
--
Priority: Blocker  (was: Major)

> upgrade to 1.19 with job in HA state with restart strategy crashes job manager
> --
>
> Key: FLINK-35639
> URL: https://issues.apache.org/jira/browse/FLINK-35639
> Project: Flink
>  Issue Type: Bug
>  Components: API / Core
>Affects Versions: 1.20.0, 1.19.1
> Environment: Download 1.18 and 1.19 binary releases. Add the 
> following to flink-1.19.0/conf/config.yaml and 
> flink-1.18.1/conf/flink-conf.yaml ```yaml high-availability: zookeeper 
> high-availability.zookeeper.quorum: localhost high-availability.storageDir: 
> file:///tmp/flink/recovery ``` Launch zookeeper: docker run --network host 
> zookeeper:latest launch 1.18 task manager: ./flink-1.18.1/bin/taskmanager.sh 
> start-foreground launch 1.18 job manager: ./flink-1.18.1/bin/jobmanager.sh 
> start-foreground launch the following job: ```java import 
> org.apache.flink.api.java.ExecutionEnvironment; import 
> org.apache.flink.api.java.tuple.Tuple2; import 
> org.apache.flink.api.common.functions.FlatMapFunction; import 
> org.apache.flink.util.Collector; import 
> org.apache.flink.api.common.restartstrategy.RestartStrategies; import 
> org.apache.flink.api.common.time.Time; import java.util.concurrent.TimeUnit; 
> public class FlinkJob \{ public static void main(String[] args) throws 
> Exception { final ExecutionEnvironment env = 
> ExecutionEnvironment.getExecutionEnvironment(); env.setRestartStrategy( 
> RestartStrategies.fixedDelayRestart(Integer.MAX_VALUE, Time.of(20, 
> TimeUnit.SECONDS)) ); env.fromElements("Hello World", "Hello Flink") 
> .flatMap(new LineSplitter()) .groupBy(0) .sum(1) .print(); } public static 
> final class LineSplitter implements FlatMapFunction> \{ @Override public void 
> flatMap(String value, Collector> out) { for (String word : value.split(" ")) 
> { try { Thread.sleep(12); } catch (InterruptedException e) \{ 
> e.printStackTrace(); } out.collect(new Tuple2<>(word, 1)); } } } } ``` ```xml 
> 4.0.0 org.apache.flink myflinkjob 1.0-SNAPSHOT 1.18.1 1.8 org.apache.flink 
> flink-java ${flink.version} org.apache.flink flink-streaming-java 
> ${flink.version} org.apache.maven.plugins maven-compiler-plugin 3.8.1 
> ${java.version} ${java.version} org.apache.maven.plugins maven-jar-plugin 
> 3.1.0 true lib/ FlinkJob ``` Launch job: ./flink-1.18.1/bin/flink run 
> ../flink-job/target/myflinkjob-1.0-SNAPSHOT.jar Job has been submitted with 
> JobID 5f0898c964a93a47aa480427f3e2c6c0 Kill job manager and task manager. 
> Then launch job manager 1.19.0 ./flink-1.19.0/bin/jobmanager.sh 
> start-foreground Root cause == It looks like the type of 
> delayBetweenAttemptsInterval was changed in 1.19 
> https://github.com/apache/flink/pull/22984/files#diff-d174f32ffdea69de610c4f37c545bd22a253b9846434f83397f1bbc2aaa399faR239
>  , introducing an incompatibility which is not handled by flink 1.19. In my 
> opinion, job-maanger should not crash when starting in that case. 
>Reporter: yazgoo
>Assignee: Matthias Pohl
>Priority: Blocker
>
> When trying to upgrade a flink cluster from 1.18 to 1.19, with a 1.18 job in 
> zookeeper HA state, I have a jobmanager crash with a ClassCastException, see 
> log below  
>  
> {code:java}
> 2024-06-18 16:58:14,401 ERROR 
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] - Fatal error 
> occurred in the cluster entrypoint. org.apache.flink.util.FlinkException: 
> JobMaster for job 5f0898c964a93a47aa480427f3e2c6c0 failed.     at 
> org.apache.flink.runtime.dispatcher.Dispatcher.jobMasterFailed(Dispatcher.java:1484)
>  ~[flink-dist-1.19.0.jar:1.19.0]     at 
> org.apache.flink.runtime.dispatcher.Dispatcher.jobManagerRunnerFailed(Dispatcher.java:775)
>  ~[flink-dist-1.19.0.jar:1.19.0]     at 
> org.apache.flink.runtime.dispatcher.Dispatcher.handleJobManagerRunnerResult(Dispatcher.java:738)
>  ~[flink-dist-1.19.0.jar:1.19.0]     at 
> org.apache.flink.runtime.dispatcher.Dispatcher.lambda$runJob$7(Dispatcher.java:693)
>  ~[flink-dist-1.19.0.jar:1.19.0]     at 
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:934) 
> ~[?:?]     at 
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:911)
>  ~[?:?]     at 
> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:482)
>  ~[?:?]     at 
> org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.lambda$handleRunAsync$4(PekkoRpcActor.java:451)
>  ~[flink-rpc-akka84eb9e64-a1ce-450c-ad53-d9fa579b67e1.jar:1.19.0]     at 
> org.apache.flink.runtime.concurrent.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68)
>  

[jira] [Updated] (FLINK-35639) upgrade to 1.19 with job in HA state with restart strategy crashes job manager

2024-06-18 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-35639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl updated FLINK-35639:
--
Affects Version/s: 1.20.0

> upgrade to 1.19 with job in HA state with restart strategy crashes job manager
> --
>
> Key: FLINK-35639
> URL: https://issues.apache.org/jira/browse/FLINK-35639
> Project: Flink
>  Issue Type: Bug
>  Components: API / Core
>Affects Versions: 1.20.0, 1.19.1
> Environment: Download 1.18 and 1.19 binary releases. Add the 
> following to flink-1.19.0/conf/config.yaml and 
> flink-1.18.1/conf/flink-conf.yaml ```yaml high-availability: zookeeper 
> high-availability.zookeeper.quorum: localhost high-availability.storageDir: 
> file:///tmp/flink/recovery ``` Launch zookeeper: docker run --network host 
> zookeeper:latest launch 1.18 task manager: ./flink-1.18.1/bin/taskmanager.sh 
> start-foreground launch 1.18 job manager: ./flink-1.18.1/bin/jobmanager.sh 
> start-foreground launch the following job: ```java import 
> org.apache.flink.api.java.ExecutionEnvironment; import 
> org.apache.flink.api.java.tuple.Tuple2; import 
> org.apache.flink.api.common.functions.FlatMapFunction; import 
> org.apache.flink.util.Collector; import 
> org.apache.flink.api.common.restartstrategy.RestartStrategies; import 
> org.apache.flink.api.common.time.Time; import java.util.concurrent.TimeUnit; 
> public class FlinkJob \{ public static void main(String[] args) throws 
> Exception { final ExecutionEnvironment env = 
> ExecutionEnvironment.getExecutionEnvironment(); env.setRestartStrategy( 
> RestartStrategies.fixedDelayRestart(Integer.MAX_VALUE, Time.of(20, 
> TimeUnit.SECONDS)) ); env.fromElements("Hello World", "Hello Flink") 
> .flatMap(new LineSplitter()) .groupBy(0) .sum(1) .print(); } public static 
> final class LineSplitter implements FlatMapFunction> \{ @Override public void 
> flatMap(String value, Collector> out) { for (String word : value.split(" ")) 
> { try { Thread.sleep(12); } catch (InterruptedException e) \{ 
> e.printStackTrace(); } out.collect(new Tuple2<>(word, 1)); } } } } ``` ```xml 
> 4.0.0 org.apache.flink myflinkjob 1.0-SNAPSHOT 1.18.1 1.8 org.apache.flink 
> flink-java ${flink.version} org.apache.flink flink-streaming-java 
> ${flink.version} org.apache.maven.plugins maven-compiler-plugin 3.8.1 
> ${java.version} ${java.version} org.apache.maven.plugins maven-jar-plugin 
> 3.1.0 true lib/ FlinkJob ``` Launch job: ./flink-1.18.1/bin/flink run 
> ../flink-job/target/myflinkjob-1.0-SNAPSHOT.jar Job has been submitted with 
> JobID 5f0898c964a93a47aa480427f3e2c6c0 Kill job manager and task manager. 
> Then launch job manager 1.19.0 ./flink-1.19.0/bin/jobmanager.sh 
> start-foreground Root cause == It looks like the type of 
> delayBetweenAttemptsInterval was changed in 1.19 
> https://github.com/apache/flink/pull/22984/files#diff-d174f32ffdea69de610c4f37c545bd22a253b9846434f83397f1bbc2aaa399faR239
>  , introducing an incompatibility which is not handled by flink 1.19. In my 
> opinion, job-maanger should not crash when starting in that case. 
>Reporter: yazgoo
>Assignee: Matthias Pohl
>Priority: Major
>
> When trying to upgrade a flink cluster from 1.18 to 1.19, with a 1.18 job in 
> zookeeper HA state, I have a jobmanager crash with a ClassCastException, see 
> log below  
>  
> {code:java}
> 2024-06-18 16:58:14,401 ERROR 
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] - Fatal error 
> occurred in the cluster entrypoint. org.apache.flink.util.FlinkException: 
> JobMaster for job 5f0898c964a93a47aa480427f3e2c6c0 failed.     at 
> org.apache.flink.runtime.dispatcher.Dispatcher.jobMasterFailed(Dispatcher.java:1484)
>  ~[flink-dist-1.19.0.jar:1.19.0]     at 
> org.apache.flink.runtime.dispatcher.Dispatcher.jobManagerRunnerFailed(Dispatcher.java:775)
>  ~[flink-dist-1.19.0.jar:1.19.0]     at 
> org.apache.flink.runtime.dispatcher.Dispatcher.handleJobManagerRunnerResult(Dispatcher.java:738)
>  ~[flink-dist-1.19.0.jar:1.19.0]     at 
> org.apache.flink.runtime.dispatcher.Dispatcher.lambda$runJob$7(Dispatcher.java:693)
>  ~[flink-dist-1.19.0.jar:1.19.0]     at 
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:934) 
> ~[?:?]     at 
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:911)
>  ~[?:?]     at 
> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:482)
>  ~[?:?]     at 
> org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.lambda$handleRunAsync$4(PekkoRpcActor.java:451)
>  ~[flink-rpc-akka84eb9e64-a1ce-450c-ad53-d9fa579b67e1.jar:1.19.0]     at 
> org.apache.flink.runtime.concurrent.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68)
>  ~[flink-dist-1.19.0.jar:1.19.0]    

[jira] [Commented] (FLINK-35639) upgrade to 1.19 with job in HA state with restart strategy crashes job manager

2024-06-18 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-35639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855979#comment-17855979
 ] 

Matthias Pohl commented on FLINK-35639:
---

I guess, you're right. Looks like there was an error being made when 
deprecating the {{@PublicEvolving}} API of 
{{{}RestartStrategies#FixedDelayRestartStrategyConfiguration{}}}. I will raise 
the priority for this one to blocker because it also affects 1.20.

> upgrade to 1.19 with job in HA state with restart strategy crashes job manager
> --
>
> Key: FLINK-35639
> URL: https://issues.apache.org/jira/browse/FLINK-35639
> Project: Flink
>  Issue Type: Bug
>  Components: API / Core
>Affects Versions: 1.19.1
> Environment: Download 1.18 and 1.19 binary releases. Add the 
> following to flink-1.19.0/conf/config.yaml and 
> flink-1.18.1/conf/flink-conf.yaml ```yaml high-availability: zookeeper 
> high-availability.zookeeper.quorum: localhost high-availability.storageDir: 
> file:///tmp/flink/recovery ``` Launch zookeeper: docker run --network host 
> zookeeper:latest launch 1.18 task manager: ./flink-1.18.1/bin/taskmanager.sh 
> start-foreground launch 1.18 job manager: ./flink-1.18.1/bin/jobmanager.sh 
> start-foreground launch the following job: ```java import 
> org.apache.flink.api.java.ExecutionEnvironment; import 
> org.apache.flink.api.java.tuple.Tuple2; import 
> org.apache.flink.api.common.functions.FlatMapFunction; import 
> org.apache.flink.util.Collector; import 
> org.apache.flink.api.common.restartstrategy.RestartStrategies; import 
> org.apache.flink.api.common.time.Time; import java.util.concurrent.TimeUnit; 
> public class FlinkJob \{ public static void main(String[] args) throws 
> Exception { final ExecutionEnvironment env = 
> ExecutionEnvironment.getExecutionEnvironment(); env.setRestartStrategy( 
> RestartStrategies.fixedDelayRestart(Integer.MAX_VALUE, Time.of(20, 
> TimeUnit.SECONDS)) ); env.fromElements("Hello World", "Hello Flink") 
> .flatMap(new LineSplitter()) .groupBy(0) .sum(1) .print(); } public static 
> final class LineSplitter implements FlatMapFunction> \{ @Override public void 
> flatMap(String value, Collector> out) { for (String word : value.split(" ")) 
> { try { Thread.sleep(12); } catch (InterruptedException e) \{ 
> e.printStackTrace(); } out.collect(new Tuple2<>(word, 1)); } } } } ``` ```xml 
> 4.0.0 org.apache.flink myflinkjob 1.0-SNAPSHOT 1.18.1 1.8 org.apache.flink 
> flink-java ${flink.version} org.apache.flink flink-streaming-java 
> ${flink.version} org.apache.maven.plugins maven-compiler-plugin 3.8.1 
> ${java.version} ${java.version} org.apache.maven.plugins maven-jar-plugin 
> 3.1.0 true lib/ FlinkJob ``` Launch job: ./flink-1.18.1/bin/flink run 
> ../flink-job/target/myflinkjob-1.0-SNAPSHOT.jar Job has been submitted with 
> JobID 5f0898c964a93a47aa480427f3e2c6c0 Kill job manager and task manager. 
> Then launch job manager 1.19.0 ./flink-1.19.0/bin/jobmanager.sh 
> start-foreground Root cause == It looks like the type of 
> delayBetweenAttemptsInterval was changed in 1.19 
> https://github.com/apache/flink/pull/22984/files#diff-d174f32ffdea69de610c4f37c545bd22a253b9846434f83397f1bbc2aaa399faR239
>  , introducing an incompatibility which is not handled by flink 1.19. In my 
> opinion, job-maanger should not crash when starting in that case. 
>Reporter: yazgoo
>Priority: Major
>
> When trying to upgrade a flink cluster from 1.18 to 1.19, with a 1.18 job in 
> zookeeper HA state, I have a jobmanager crash with a ClassCastException, see 
> log below  
>  
> {code:java}
> 2024-06-18 16:58:14,401 ERROR 
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] - Fatal error 
> occurred in the cluster entrypoint. org.apache.flink.util.FlinkException: 
> JobMaster for job 5f0898c964a93a47aa480427f3e2c6c0 failed.     at 
> org.apache.flink.runtime.dispatcher.Dispatcher.jobMasterFailed(Dispatcher.java:1484)
>  ~[flink-dist-1.19.0.jar:1.19.0]     at 
> org.apache.flink.runtime.dispatcher.Dispatcher.jobManagerRunnerFailed(Dispatcher.java:775)
>  ~[flink-dist-1.19.0.jar:1.19.0]     at 
> org.apache.flink.runtime.dispatcher.Dispatcher.handleJobManagerRunnerResult(Dispatcher.java:738)
>  ~[flink-dist-1.19.0.jar:1.19.0]     at 
> org.apache.flink.runtime.dispatcher.Dispatcher.lambda$runJob$7(Dispatcher.java:693)
>  ~[flink-dist-1.19.0.jar:1.19.0]     at 
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:934) 
> ~[?:?]     at 
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:911)
>  ~[?:?]     at 
> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:482)
>  ~[?:?]     at 
> 

[jira] [Comment Edited] (FLINK-35601) InitOutputPathTest.testErrorOccursUnSynchronized failed due to NoSuchFieldException

2024-06-16 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-35601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855405#comment-17855405
 ] 

Matthias Pohl edited comment on FLINK-35601 at 6/16/24 1:29 PM:


This seems to be caused by recently added [PR 
#24481|https://github.com/apache/flink/pull/24881] that is connected to 
FLINK-25537. I haven't been able to reproduce the error of this Jira issue (I 
haven't tried it with JDK 17) but running the test repeatedly causes the test 
to eventually timeout (tried 3x where the test ran into a deadlock (?) after 
512, 618 and 3724 repetitions). It might be worth reverting the changes of PR 
#24481


was (Author: mapohl):
This seems to be caused by recently added [PR 
#24481|https://github.com/apache/flink/pull/24881] that is connected to 
FLINK-25537. I haven't been able to reproduce the error of this Jira issue but 
running the test repeatedly causes the test to eventually timeout (tried 3x 
where the test ran into a deadlock (?) after 512, 618 and 3724 repetitions). It 
might be worth reverting the changes of PR #24481

> InitOutputPathTest.testErrorOccursUnSynchronized failed due to 
> NoSuchFieldException
> ---
>
> Key: FLINK-35601
> URL: https://issues.apache.org/jira/browse/FLINK-35601
> Project: Flink
>  Issue Type: Bug
>  Components: Build System / CI
>Affects Versions: 1.20.0
>Reporter: Weijie Guo
>Priority: Major
>
> {code:java}
> Jun 14 02:17:56 02:17:56.037 [ERROR] 
> org.apache.flink.core.fs.InitOutputPathTest.testErrorOccursUnSynchronized -- 
> Time elapsed: 0.021 s <<< ERROR!
> Jun 14 02:17:56 java.lang.NoSuchFieldException: modifiers
> Jun 14 02:17:56   at 
> java.base/java.lang.Class.getDeclaredField(Class.java:2610)
> Jun 14 02:17:56   at 
> org.apache.flink.core.fs.InitOutputPathTest.testErrorOccursUnSynchronized(InitOutputPathTest.java:59)
> Jun 14 02:17:56   at 
> java.base/java.lang.reflect.Method.invoke(Method.java:568)
> Jun 14 02:17:56   at 
> java.base/java.util.concurrent.RecursiveAction.exec(RecursiveAction.java:194)
> Jun 14 02:17:56   at 
> java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:373)
> Jun 14 02:17:56   at 
> java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1182)
> Jun 14 02:17:56   at 
> java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1655)
> Jun 14 02:17:56   at 
> java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1622)
> Jun 14 02:17:56   at 
> java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:165)
> {code}
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=60259=logs=675bf62c-8558-587e-2555-dcad13acefb5=5878eed3-cc1e-5b12-1ed0-9e7139ce0992=6491



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-35601) InitOutputPathTest.testErrorOccursUnSynchronized failed due to NoSuchFieldException

2024-06-16 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-35601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855406#comment-17855406
 ] 

Matthias Pohl commented on FLINK-35601:
---

[~gongzhongqiang] can you have a look?

> InitOutputPathTest.testErrorOccursUnSynchronized failed due to 
> NoSuchFieldException
> ---
>
> Key: FLINK-35601
> URL: https://issues.apache.org/jira/browse/FLINK-35601
> Project: Flink
>  Issue Type: Bug
>  Components: Build System / CI
>Affects Versions: 1.20.0
>Reporter: Weijie Guo
>Priority: Major
>
> {code:java}
> Jun 14 02:17:56 02:17:56.037 [ERROR] 
> org.apache.flink.core.fs.InitOutputPathTest.testErrorOccursUnSynchronized -- 
> Time elapsed: 0.021 s <<< ERROR!
> Jun 14 02:17:56 java.lang.NoSuchFieldException: modifiers
> Jun 14 02:17:56   at 
> java.base/java.lang.Class.getDeclaredField(Class.java:2610)
> Jun 14 02:17:56   at 
> org.apache.flink.core.fs.InitOutputPathTest.testErrorOccursUnSynchronized(InitOutputPathTest.java:59)
> Jun 14 02:17:56   at 
> java.base/java.lang.reflect.Method.invoke(Method.java:568)
> Jun 14 02:17:56   at 
> java.base/java.util.concurrent.RecursiveAction.exec(RecursiveAction.java:194)
> Jun 14 02:17:56   at 
> java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:373)
> Jun 14 02:17:56   at 
> java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1182)
> Jun 14 02:17:56   at 
> java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1655)
> Jun 14 02:17:56   at 
> java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1622)
> Jun 14 02:17:56   at 
> java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:165)
> {code}
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=60259=logs=675bf62c-8558-587e-2555-dcad13acefb5=5878eed3-cc1e-5b12-1ed0-9e7139ce0992=6491



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-35601) InitOutputPathTest.testErrorOccursUnSynchronized failed due to NoSuchFieldException

2024-06-16 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-35601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855405#comment-17855405
 ] 

Matthias Pohl commented on FLINK-35601:
---

This seems to be caused by recently added [PR 
#24481|https://github.com/apache/flink/pull/24881] that is connected to 
FLINK-25537. I haven't been able to reproduce the error of this Jira issue but 
running the test repeatedly causes the test to eventually timeout (tried 3x 
where the test ran into a deadlock (?) after 512, 618 and 3724 repetitions). It 
might be worth reverting the changes of PR #24481

> InitOutputPathTest.testErrorOccursUnSynchronized failed due to 
> NoSuchFieldException
> ---
>
> Key: FLINK-35601
> URL: https://issues.apache.org/jira/browse/FLINK-35601
> Project: Flink
>  Issue Type: Bug
>  Components: Build System / CI
>Affects Versions: 1.20.0
>Reporter: Weijie Guo
>Priority: Major
>
> {code:java}
> Jun 14 02:17:56 02:17:56.037 [ERROR] 
> org.apache.flink.core.fs.InitOutputPathTest.testErrorOccursUnSynchronized -- 
> Time elapsed: 0.021 s <<< ERROR!
> Jun 14 02:17:56 java.lang.NoSuchFieldException: modifiers
> Jun 14 02:17:56   at 
> java.base/java.lang.Class.getDeclaredField(Class.java:2610)
> Jun 14 02:17:56   at 
> org.apache.flink.core.fs.InitOutputPathTest.testErrorOccursUnSynchronized(InitOutputPathTest.java:59)
> Jun 14 02:17:56   at 
> java.base/java.lang.reflect.Method.invoke(Method.java:568)
> Jun 14 02:17:56   at 
> java.base/java.util.concurrent.RecursiveAction.exec(RecursiveAction.java:194)
> Jun 14 02:17:56   at 
> java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:373)
> Jun 14 02:17:56   at 
> java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1182)
> Jun 14 02:17:56   at 
> java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1655)
> Jun 14 02:17:56   at 
> java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1622)
> Jun 14 02:17:56   at 
> java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:165)
> {code}
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=60259=logs=675bf62c-8558-587e-2555-dcad13acefb5=5878eed3-cc1e-5b12-1ed0-9e7139ce0992=6491



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-35042) Streaming File Sink s3 end-to-end test failed as TM lost

2024-06-14 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-35042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17854944#comment-17854944
 ] 

Matthias Pohl commented on FLINK-35042:
---

I noticed that the build failure in the description is unrelated to FLINK-34150 
because it appeared on April 8, 2024 whereas FLINK-24150 only was merged on May 
10, 2024.

But the build failure I shared might be related. So, it could be that these two 
are actually two different issues.

> Streaming File Sink s3 end-to-end test failed as TM lost
> 
>
> Key: FLINK-35042
> URL: https://issues.apache.org/jira/browse/FLINK-35042
> Project: Flink
>  Issue Type: Bug
>  Components: Build System / CI
>Affects Versions: 1.20.0
>Reporter: Weijie Guo
>Priority: Major
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58782=logs=fb37c667-81b7-5c22-dd91-846535e99a97=011e961e-597c-5c96-04fe-7941c8b83f23=14344
> FAIL 'Streaming File Sink s3 end-to-end test' failed after 15 minutes and 20 
> seconds! Test exited with exit code 1
> I have checked the JM log, it seems that a taskmanager is no longer reachable:
> {code:java}
> 2024-04-08T01:12:04.3922210Z Apr 08 01:12:04 2024-04-08 00:58:15,517 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph   [] - Sink: 
> Unnamed (4/4) 
> (14b44f534745ffb2f1ef03fca34f7f0d_0a448493b4782967b150582570326227_3_0) 
> switched from RUNNING to FAILED on localhost:44987-47f5af @ localhost 
> (dataPort=34489).
> 2024-04-08T01:12:04.3924522Z Apr 08 01:12:04 
> org.apache.flink.runtime.jobmaster.JobMasterException: TaskManager with id 
> localhost:44987-47f5af is no longer reachable.
> 2024-04-08T01:12:04.3925421Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.notifyTargetUnreachable(JobMaster.java:1511)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3926185Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.heartbeat.DefaultHeartbeatMonitor.reportHeartbeatRpcFailure(DefaultHeartbeatMonitor.java:126)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3926925Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl.runIfHeartbeatMonitorExists(HeartbeatManagerImpl.java:275)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3929898Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl.reportHeartbeatTargetUnreachable(HeartbeatManagerImpl.java:267)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3930692Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl.handleHeartbeatRpcFailure(HeartbeatManagerImpl.java:262)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3931442Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl.lambda$handleHeartbeatRpc$0(HeartbeatManagerImpl.java:248)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3931917Z Apr 08 01:12:04  at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
>  ~[?:1.8.0_402]
> 2024-04-08T01:12:04.3934759Z Apr 08 01:12:04  at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
>  ~[?:1.8.0_402]
> 2024-04-08T01:12:04.3935252Z Apr 08 01:12:04  at 
> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456)
>  ~[?:1.8.0_402]
> 2024-04-08T01:12:04.3935989Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.lambda$handleRunAsync$4(PekkoRpcActor.java:460)
>  ~[flink-rpc-akka9681a48a-ca1a-45b0-bb71-4bdb5d2aed93.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3936731Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.concurrent.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3938103Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleRunAsync(PekkoRpcActor.java:460)
>  ~[flink-rpc-akka9681a48a-ca1a-45b0-bb71-4bdb5d2aed93.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3942549Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleRpcMessage(PekkoRpcActor.java:225)
>  ~[flink-rpc-akka9681a48a-ca1a-45b0-bb71-4bdb5d2aed93.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3945371Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.rpc.pekko.FencedPekkoRpcActor.handleRpcMessage(FencedPekkoRpcActor.java:88)
>  ~[flink-rpc-akka9681a48a-ca1a-45b0-bb71-4bdb5d2aed93.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3946244Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleMessage(PekkoRpcActor.java:174)
>  ~[flink-rpc-akka9681a48a-ca1a-45b0-bb71-4bdb5d2aed93.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3946960Z Apr 08 01:12:04 

[jira] [Comment Edited] (FLINK-35042) Streaming File Sink s3 end-to-end test failed as TM lost

2024-06-14 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-35042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17854944#comment-17854944
 ] 

Matthias Pohl edited comment on FLINK-35042 at 6/14/24 6:37 AM:


I noticed that the build failure in the description is unrelated to FLINK-34150 
because it appeared on April 8, 2024 whereas FLINK-34150 only was merged on May 
10, 2024.

But the build failure I shared might be related. So, it could be that these two 
are actually two different issues.


was (Author: mapohl):
I noticed that the build failure in the description is unrelated to FLINK-34150 
because it appeared on April 8, 2024 whereas FLINK-24150 only was merged on May 
10, 2024.

But the build failure I shared might be related. So, it could be that these two 
are actually two different issues.

> Streaming File Sink s3 end-to-end test failed as TM lost
> 
>
> Key: FLINK-35042
> URL: https://issues.apache.org/jira/browse/FLINK-35042
> Project: Flink
>  Issue Type: Bug
>  Components: Build System / CI
>Affects Versions: 1.20.0
>Reporter: Weijie Guo
>Priority: Major
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58782=logs=fb37c667-81b7-5c22-dd91-846535e99a97=011e961e-597c-5c96-04fe-7941c8b83f23=14344
> FAIL 'Streaming File Sink s3 end-to-end test' failed after 15 minutes and 20 
> seconds! Test exited with exit code 1
> I have checked the JM log, it seems that a taskmanager is no longer reachable:
> {code:java}
> 2024-04-08T01:12:04.3922210Z Apr 08 01:12:04 2024-04-08 00:58:15,517 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph   [] - Sink: 
> Unnamed (4/4) 
> (14b44f534745ffb2f1ef03fca34f7f0d_0a448493b4782967b150582570326227_3_0) 
> switched from RUNNING to FAILED on localhost:44987-47f5af @ localhost 
> (dataPort=34489).
> 2024-04-08T01:12:04.3924522Z Apr 08 01:12:04 
> org.apache.flink.runtime.jobmaster.JobMasterException: TaskManager with id 
> localhost:44987-47f5af is no longer reachable.
> 2024-04-08T01:12:04.3925421Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.notifyTargetUnreachable(JobMaster.java:1511)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3926185Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.heartbeat.DefaultHeartbeatMonitor.reportHeartbeatRpcFailure(DefaultHeartbeatMonitor.java:126)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3926925Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl.runIfHeartbeatMonitorExists(HeartbeatManagerImpl.java:275)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3929898Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl.reportHeartbeatTargetUnreachable(HeartbeatManagerImpl.java:267)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3930692Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl.handleHeartbeatRpcFailure(HeartbeatManagerImpl.java:262)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3931442Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl.lambda$handleHeartbeatRpc$0(HeartbeatManagerImpl.java:248)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3931917Z Apr 08 01:12:04  at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
>  ~[?:1.8.0_402]
> 2024-04-08T01:12:04.3934759Z Apr 08 01:12:04  at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
>  ~[?:1.8.0_402]
> 2024-04-08T01:12:04.3935252Z Apr 08 01:12:04  at 
> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456)
>  ~[?:1.8.0_402]
> 2024-04-08T01:12:04.3935989Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.lambda$handleRunAsync$4(PekkoRpcActor.java:460)
>  ~[flink-rpc-akka9681a48a-ca1a-45b0-bb71-4bdb5d2aed93.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3936731Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.concurrent.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3938103Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleRunAsync(PekkoRpcActor.java:460)
>  ~[flink-rpc-akka9681a48a-ca1a-45b0-bb71-4bdb5d2aed93.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3942549Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleRpcMessage(PekkoRpcActor.java:225)
>  ~[flink-rpc-akka9681a48a-ca1a-45b0-bb71-4bdb5d2aed93.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3945371Z Apr 08 01:12:04  at 
> 

[jira] [Commented] (FLINK-35042) Streaming File Sink s3 end-to-end test failed as TM lost

2024-06-13 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-35042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17854800#comment-17854800
 ] 

Matthias Pohl commented on FLINK-35042:
---

This is different to what [~Weijie Guo] observed in his build where the Job 
never observes the expected 2 TM restart:
{code:java}
Apr 08 00:57:27 Submitting job.
Apr 08 00:57:39 Job (d0bec02a7136e671f764bba2938933db) is not yet running.
Apr 08 00:57:49 Job (d0bec02a7136e671f764bba2938933db) is running.
Apr 08 00:57:49 Waiting for job (d0bec02a7136e671f764bba2938933db) to have at 
least 3 completed checkpoints ...
Apr 08 00:58:03 Killing TM
Apr 08 00:58:04 TaskManager 138601 killed.
Apr 08 00:58:04 Starting TM
Apr 08 00:58:06 [INFO] 3 instance(s) of taskexecutor are already running on 
fv-az68-869.
Apr 08 00:58:06 Starting taskexecutor daemon on host fv-az68-869.
Apr 08 00:58:06 Waiting for restart to happen
Apr 08 00:58:06 Still waiting for restarts. Expected: 1 Current: 0
Apr 08 00:58:11 Still waiting for restarts. Expected: 1 Current: 0
Apr 08 00:58:16 Still waiting for restarts. Expected: 1 Current: 0
Apr 08 00:58:21 Killing 2 TMs
Apr 08 00:58:21 TaskManager 141400 killed.
Apr 08 00:58:21 TaskManager 139144 killed.
Apr 08 00:58:21 Starting 2 TMs
Apr 08 00:58:24 [INFO] 2 instance(s) of taskexecutor are already running on 
fv-az68-869.
Apr 08 00:58:24 Starting taskexecutor daemon on host fv-az68-869.
Apr 08 00:58:29 [INFO] 3 instance(s) of taskexecutor are already running on 
fv-az68-869.
Apr 08 00:58:29 Starting taskexecutor daemon on host fv-az68-869.
Apr 08 00:58:29 Waiting for restart to happen
Apr 08 00:58:29 Still waiting for restarts. Expected: 2 Current: 1
Apr 08 00:58:34 Still waiting for restarts. Expected: 2 Current: 1
Apr 08 00:58:39 Still waiting for restarts. Expected: 2 Current: 1
[...]
Apr 08 01:11:56 Still waiting for restarts. Expected: 2 Current: 1
Apr 08 01:12:01 Still waiting for restarts. Expected: 2 Current: 1
Apr 08 01:12:04 Test (pid: 136749) did not finish after 900 seconds.{code}

> Streaming File Sink s3 end-to-end test failed as TM lost
> 
>
> Key: FLINK-35042
> URL: https://issues.apache.org/jira/browse/FLINK-35042
> Project: Flink
>  Issue Type: Bug
>  Components: Build System / CI
>Affects Versions: 1.20.0
>Reporter: Weijie Guo
>Priority: Major
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58782=logs=fb37c667-81b7-5c22-dd91-846535e99a97=011e961e-597c-5c96-04fe-7941c8b83f23=14344
> FAIL 'Streaming File Sink s3 end-to-end test' failed after 15 minutes and 20 
> seconds! Test exited with exit code 1
> I have checked the JM log, it seems that a taskmanager is no longer reachable:
> {code:java}
> 2024-04-08T01:12:04.3922210Z Apr 08 01:12:04 2024-04-08 00:58:15,517 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph   [] - Sink: 
> Unnamed (4/4) 
> (14b44f534745ffb2f1ef03fca34f7f0d_0a448493b4782967b150582570326227_3_0) 
> switched from RUNNING to FAILED on localhost:44987-47f5af @ localhost 
> (dataPort=34489).
> 2024-04-08T01:12:04.3924522Z Apr 08 01:12:04 
> org.apache.flink.runtime.jobmaster.JobMasterException: TaskManager with id 
> localhost:44987-47f5af is no longer reachable.
> 2024-04-08T01:12:04.3925421Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.notifyTargetUnreachable(JobMaster.java:1511)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3926185Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.heartbeat.DefaultHeartbeatMonitor.reportHeartbeatRpcFailure(DefaultHeartbeatMonitor.java:126)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3926925Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl.runIfHeartbeatMonitorExists(HeartbeatManagerImpl.java:275)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3929898Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl.reportHeartbeatTargetUnreachable(HeartbeatManagerImpl.java:267)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3930692Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl.handleHeartbeatRpcFailure(HeartbeatManagerImpl.java:262)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3931442Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl.lambda$handleHeartbeatRpc$0(HeartbeatManagerImpl.java:248)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3931917Z Apr 08 01:12:04  at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
>  ~[?:1.8.0_402]
> 2024-04-08T01:12:04.3934759Z Apr 08 01:12:04  at 
> 

[jira] [Commented] (FLINK-35042) Streaming File Sink s3 end-to-end test failed as TM lost

2024-06-13 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-35042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17854801#comment-17854801
 ] 

Matthias Pohl commented on FLINK-35042:
---

I'm linking FLINK-34150 because we refactored the test to rely on Minio rather 
than AWS s3 backend.

> Streaming File Sink s3 end-to-end test failed as TM lost
> 
>
> Key: FLINK-35042
> URL: https://issues.apache.org/jira/browse/FLINK-35042
> Project: Flink
>  Issue Type: Bug
>  Components: Build System / CI
>Affects Versions: 1.20.0
>Reporter: Weijie Guo
>Priority: Major
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58782=logs=fb37c667-81b7-5c22-dd91-846535e99a97=011e961e-597c-5c96-04fe-7941c8b83f23=14344
> FAIL 'Streaming File Sink s3 end-to-end test' failed after 15 minutes and 20 
> seconds! Test exited with exit code 1
> I have checked the JM log, it seems that a taskmanager is no longer reachable:
> {code:java}
> 2024-04-08T01:12:04.3922210Z Apr 08 01:12:04 2024-04-08 00:58:15,517 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph   [] - Sink: 
> Unnamed (4/4) 
> (14b44f534745ffb2f1ef03fca34f7f0d_0a448493b4782967b150582570326227_3_0) 
> switched from RUNNING to FAILED on localhost:44987-47f5af @ localhost 
> (dataPort=34489).
> 2024-04-08T01:12:04.3924522Z Apr 08 01:12:04 
> org.apache.flink.runtime.jobmaster.JobMasterException: TaskManager with id 
> localhost:44987-47f5af is no longer reachable.
> 2024-04-08T01:12:04.3925421Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.notifyTargetUnreachable(JobMaster.java:1511)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3926185Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.heartbeat.DefaultHeartbeatMonitor.reportHeartbeatRpcFailure(DefaultHeartbeatMonitor.java:126)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3926925Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl.runIfHeartbeatMonitorExists(HeartbeatManagerImpl.java:275)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3929898Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl.reportHeartbeatTargetUnreachable(HeartbeatManagerImpl.java:267)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3930692Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl.handleHeartbeatRpcFailure(HeartbeatManagerImpl.java:262)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3931442Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl.lambda$handleHeartbeatRpc$0(HeartbeatManagerImpl.java:248)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3931917Z Apr 08 01:12:04  at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
>  ~[?:1.8.0_402]
> 2024-04-08T01:12:04.3934759Z Apr 08 01:12:04  at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
>  ~[?:1.8.0_402]
> 2024-04-08T01:12:04.3935252Z Apr 08 01:12:04  at 
> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456)
>  ~[?:1.8.0_402]
> 2024-04-08T01:12:04.3935989Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.lambda$handleRunAsync$4(PekkoRpcActor.java:460)
>  ~[flink-rpc-akka9681a48a-ca1a-45b0-bb71-4bdb5d2aed93.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3936731Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.concurrent.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3938103Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleRunAsync(PekkoRpcActor.java:460)
>  ~[flink-rpc-akka9681a48a-ca1a-45b0-bb71-4bdb5d2aed93.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3942549Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleRpcMessage(PekkoRpcActor.java:225)
>  ~[flink-rpc-akka9681a48a-ca1a-45b0-bb71-4bdb5d2aed93.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3945371Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.rpc.pekko.FencedPekkoRpcActor.handleRpcMessage(FencedPekkoRpcActor.java:88)
>  ~[flink-rpc-akka9681a48a-ca1a-45b0-bb71-4bdb5d2aed93.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3946244Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleMessage(PekkoRpcActor.java:174)
>  ~[flink-rpc-akka9681a48a-ca1a-45b0-bb71-4bdb5d2aed93.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3946960Z Apr 08 01:12:04  at 
> org.apache.pekko.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:33) 
> [flink-rpc-akka9681a48a-ca1a-45b0-bb71-4bdb5d2aed93.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3947664Z 

[jira] [Comment Edited] (FLINK-35042) Streaming File Sink s3 end-to-end test failed as TM lost

2024-06-13 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-35042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17854789#comment-17854789
 ] 

Matthias Pohl edited comment on FLINK-35042 at 6/13/24 4:16 PM:


[https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=60237=logs=ef799394-2d67-5ff4-b2e5-410b80c9c0af=9e5768bc-daae-5f5f-1861-e58617922c7a=9817]

for that one it looks like the test never reached the expected processed values:
{code:java}
Jun 13 13:04:25 Waiting for Dispatcher REST endpoint to come up...
Jun 13 13:04:26 Dispatcher REST endpoint is up.
Jun 13 13:04:28 [INFO] 1 instance(s) of taskexecutor are already running on 
fv-az209-180.
Jun 13 13:04:28 Starting taskexecutor daemon on host fv-az209-180.
Jun 13 13:04:32 [INFO] 2 instance(s) of taskexecutor are already running on 
fv-az209-180.
Jun 13 13:04:32 Starting taskexecutor daemon on host fv-az209-180.
Jun 13 13:04:37 [INFO] 3 instance(s) of taskexecutor are already running on 
fv-az209-180.
Jun 13 13:04:37 Starting taskexecutor daemon on host fv-az209-180.
Jun 13 13:04:37 Submitting job.
Jun 13 13:04:57 Job (be9bc06a08a4c0fc3bf2c9e1c92219d4) is running.
Jun 13 13:04:57 Waiting for job (be9bc06a08a4c0fc3bf2c9e1c92219d4) to have at 
least 3 completed checkpoints ...
Jun 13 13:05:06 Killing TM
Jun 13 13:05:06 TaskManager 122377 killed.
Jun 13 13:05:06 Starting TM
Jun 13 13:05:08 [INFO] 3 instance(s) of taskexecutor are already running on 
fv-az209-180.
Jun 13 13:05:08 Starting taskexecutor daemon on host fv-az209-180.
Jun 13 13:05:08 Waiting for restart to happen
Jun 13 13:05:08 Still waiting for restarts. Expected: 1 Current: 0
Jun 13 13:05:13 Still waiting for restarts. Expected: 1 Current: 0
Jun 13 13:05:18 Still waiting for restarts. Expected: 1 Current: 0
Jun 13 13:05:23 Killing 2 TMs
Jun 13 13:05:24 TaskManager 121771 killed.
Jun 13 13:05:24 TaskManager 122908 killed.
Jun 13 13:05:24 Starting 2 TMs
Jun 13 13:05:26 [INFO] 2 instance(s) of taskexecutor are already running on 
fv-az209-180.
Jun 13 13:05:26 Starting taskexecutor daemon on host fv-az209-180.
Jun 13 13:05:31 [INFO] 3 instance(s) of taskexecutor are already running on 
fv-az209-180.
Jun 13 13:05:31 Starting taskexecutor daemon on host fv-az209-180.
Jun 13 13:05:31 Waiting for restart to happen
Jun 13 13:05:31 Still waiting for restarts. Expected: 2 Current: 1
Jun 13 13:05:36 Still waiting for restarts. Expected: 2 Current: 1
Jun 13 13:05:41 Waiting until all values have been produced
Jun 13 13:05:43 Number of produced values 0/6
[...] {code}


was (Author: mapohl):
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=60237=logs=ef799394-2d67-5ff4-b2e5-410b80c9c0af=9e5768bc-daae-5f5f-1861-e58617922c7a=9817

> Streaming File Sink s3 end-to-end test failed as TM lost
> 
>
> Key: FLINK-35042
> URL: https://issues.apache.org/jira/browse/FLINK-35042
> Project: Flink
>  Issue Type: Bug
>  Components: Build System / CI
>Affects Versions: 1.20.0
>Reporter: Weijie Guo
>Priority: Major
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58782=logs=fb37c667-81b7-5c22-dd91-846535e99a97=011e961e-597c-5c96-04fe-7941c8b83f23=14344
> FAIL 'Streaming File Sink s3 end-to-end test' failed after 15 minutes and 20 
> seconds! Test exited with exit code 1
> I have checked the JM log, it seems that a taskmanager is no longer reachable:
> {code:java}
> 2024-04-08T01:12:04.3922210Z Apr 08 01:12:04 2024-04-08 00:58:15,517 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph   [] - Sink: 
> Unnamed (4/4) 
> (14b44f534745ffb2f1ef03fca34f7f0d_0a448493b4782967b150582570326227_3_0) 
> switched from RUNNING to FAILED on localhost:44987-47f5af @ localhost 
> (dataPort=34489).
> 2024-04-08T01:12:04.3924522Z Apr 08 01:12:04 
> org.apache.flink.runtime.jobmaster.JobMasterException: TaskManager with id 
> localhost:44987-47f5af is no longer reachable.
> 2024-04-08T01:12:04.3925421Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.notifyTargetUnreachable(JobMaster.java:1511)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3926185Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.heartbeat.DefaultHeartbeatMonitor.reportHeartbeatRpcFailure(DefaultHeartbeatMonitor.java:126)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3926925Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl.runIfHeartbeatMonitorExists(HeartbeatManagerImpl.java:275)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3929898Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl.reportHeartbeatTargetUnreachable(HeartbeatManagerImpl.java:267)
>  

[jira] [Commented] (FLINK-35042) Streaming File Sink s3 end-to-end test failed as TM lost

2024-06-13 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-35042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17854789#comment-17854789
 ] 

Matthias Pohl commented on FLINK-35042:
---

https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=60237=logs=ef799394-2d67-5ff4-b2e5-410b80c9c0af=9e5768bc-daae-5f5f-1861-e58617922c7a=9817

> Streaming File Sink s3 end-to-end test failed as TM lost
> 
>
> Key: FLINK-35042
> URL: https://issues.apache.org/jira/browse/FLINK-35042
> Project: Flink
>  Issue Type: Bug
>  Components: Build System / CI
>Affects Versions: 1.20.0
>Reporter: Weijie Guo
>Priority: Major
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58782=logs=fb37c667-81b7-5c22-dd91-846535e99a97=011e961e-597c-5c96-04fe-7941c8b83f23=14344
> FAIL 'Streaming File Sink s3 end-to-end test' failed after 15 minutes and 20 
> seconds! Test exited with exit code 1
> I have checked the JM log, it seems that a taskmanager is no longer reachable:
> {code:java}
> 2024-04-08T01:12:04.3922210Z Apr 08 01:12:04 2024-04-08 00:58:15,517 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph   [] - Sink: 
> Unnamed (4/4) 
> (14b44f534745ffb2f1ef03fca34f7f0d_0a448493b4782967b150582570326227_3_0) 
> switched from RUNNING to FAILED on localhost:44987-47f5af @ localhost 
> (dataPort=34489).
> 2024-04-08T01:12:04.3924522Z Apr 08 01:12:04 
> org.apache.flink.runtime.jobmaster.JobMasterException: TaskManager with id 
> localhost:44987-47f5af is no longer reachable.
> 2024-04-08T01:12:04.3925421Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.notifyTargetUnreachable(JobMaster.java:1511)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3926185Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.heartbeat.DefaultHeartbeatMonitor.reportHeartbeatRpcFailure(DefaultHeartbeatMonitor.java:126)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3926925Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl.runIfHeartbeatMonitorExists(HeartbeatManagerImpl.java:275)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3929898Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl.reportHeartbeatTargetUnreachable(HeartbeatManagerImpl.java:267)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3930692Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl.handleHeartbeatRpcFailure(HeartbeatManagerImpl.java:262)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3931442Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl.lambda$handleHeartbeatRpc$0(HeartbeatManagerImpl.java:248)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3931917Z Apr 08 01:12:04  at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
>  ~[?:1.8.0_402]
> 2024-04-08T01:12:04.3934759Z Apr 08 01:12:04  at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
>  ~[?:1.8.0_402]
> 2024-04-08T01:12:04.3935252Z Apr 08 01:12:04  at 
> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456)
>  ~[?:1.8.0_402]
> 2024-04-08T01:12:04.3935989Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.lambda$handleRunAsync$4(PekkoRpcActor.java:460)
>  ~[flink-rpc-akka9681a48a-ca1a-45b0-bb71-4bdb5d2aed93.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3936731Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.concurrent.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68)
>  ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3938103Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleRunAsync(PekkoRpcActor.java:460)
>  ~[flink-rpc-akka9681a48a-ca1a-45b0-bb71-4bdb5d2aed93.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3942549Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleRpcMessage(PekkoRpcActor.java:225)
>  ~[flink-rpc-akka9681a48a-ca1a-45b0-bb71-4bdb5d2aed93.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3945371Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.rpc.pekko.FencedPekkoRpcActor.handleRpcMessage(FencedPekkoRpcActor.java:88)
>  ~[flink-rpc-akka9681a48a-ca1a-45b0-bb71-4bdb5d2aed93.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3946244Z Apr 08 01:12:04  at 
> org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleMessage(PekkoRpcActor.java:174)
>  ~[flink-rpc-akka9681a48a-ca1a-45b0-bb71-4bdb5d2aed93.jar:1.20-SNAPSHOT]
> 2024-04-08T01:12:04.3946960Z Apr 08 01:12:04  at 
> org.apache.pekko.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:33) 
> 

[jira] [Commented] (FLINK-35549) FLIP-461: Synchronize rescaling with checkpoint creation to minimize reprocessing for the AdaptiveScheduler

2024-06-07 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-35549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17853218#comment-17853218
 ] 

Matthias Pohl commented on FLINK-35549:
---

cc [~fanrui] I added the code change for FLIP-461. ...if you want to have a 
look.

> FLIP-461: Synchronize rescaling with checkpoint creation to minimize 
> reprocessing for the AdaptiveScheduler
> ---
>
> Key: FLINK-35549
> URL: https://issues.apache.org/jira/browse/FLINK-35549
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing, Runtime / Coordination
>Affects Versions: 1.20.0
>Reporter: Matthias Pohl
>Assignee: Matthias Pohl
>Priority: Major
>
> This is the umbrella issue for implementing 
> [FLIP-461|https://cwiki.apache.org/confluence/display/FLINK/FLIP-461%3A+Synchronize+rescaling+with+checkpoint+creation+to+minimize+reprocessing+for+the+AdaptiveScheduler]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (FLINK-35550) Introduce new component RescaleManager

2024-06-07 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-35550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl reassigned FLINK-35550:
-

Assignee: Matthias Pohl

> Introduce new component RescaleManager
> --
>
> Key: FLINK-35550
> URL: https://issues.apache.org/jira/browse/FLINK-35550
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Reporter: Matthias Pohl
>Assignee: Matthias Pohl
>Priority: Major
>
> The goal here is to collect the rescaling logic in a single component to 
> improve testability.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (FLINK-35549) FLIP-461: Synchronize rescaling with checkpoint creation to minimize reprocessing for the AdaptiveScheduler

2024-06-07 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-35549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl reassigned FLINK-35549:
-

Assignee: Matthias Pohl

> FLIP-461: Synchronize rescaling with checkpoint creation to minimize 
> reprocessing for the AdaptiveScheduler
> ---
>
> Key: FLINK-35549
> URL: https://issues.apache.org/jira/browse/FLINK-35549
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing, Runtime / Coordination
>Affects Versions: 1.20.0
>Reporter: Matthias Pohl
>Assignee: Matthias Pohl
>Priority: Major
>
> This is the umbrella issue for implementing 
> [FLIP-461|https://cwiki.apache.org/confluence/display/FLINK/FLIP-461%3A+Synchronize+rescaling+with+checkpoint+creation+to+minimize+reprocessing+for+the+AdaptiveScheduler]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-35553) Integrate newly added trigger interface with checkpointing

2024-06-07 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-35553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl updated FLINK-35553:
--
Description: This connects the newly introduced trigger logic (FLINK-35551) 
with the {{CheckpointStatsTracker}}  (was: This connects the newly introduced 
trigger logic (FLINK-35551) with the newly added checkpoint lifecycle listening 
feature (FLINK-35552).)

> Integrate newly added trigger interface with checkpointing
> --
>
> Key: FLINK-35553
> URL: https://issues.apache.org/jira/browse/FLINK-35553
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Checkpointing, Runtime / Coordination
>Reporter: Matthias Pohl
>Priority: Major
>
> This connects the newly introduced trigger logic (FLINK-35551) with the 
> {{CheckpointStatsTracker}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-35553) Integrate newly added trigger interface with checkpointing

2024-06-07 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-35553:
-

 Summary: Integrate newly added trigger interface with checkpointing
 Key: FLINK-35553
 URL: https://issues.apache.org/jira/browse/FLINK-35553
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Checkpointing, Runtime / Coordination
Reporter: Matthias Pohl


This connects the newly introduced trigger logic (FLINK-35551) with the newly 
added checkpoint lifecycle listening feature (FLINK-35552).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-35552) Move CheckpointStatsTracker out of ExecutionGraph into Scheduler

2024-06-07 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-35552:
-

 Summary: Move CheckpointStatsTracker out of ExecutionGraph into 
Scheduler
 Key: FLINK-35552
 URL: https://issues.apache.org/jira/browse/FLINK-35552
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Checkpointing, Runtime / Coordination
Reporter: Matthias Pohl


The scheduler needs to know about the CheckpointStatsTracker to allow listening 
to checkpoint failures and completion.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-35551) Introduces RescaleManager#onTrigger endpoint

2024-06-07 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-35551:
-

 Summary: Introduces RescaleManager#onTrigger endpoint
 Key: FLINK-35551
 URL: https://issues.apache.org/jira/browse/FLINK-35551
 Project: Flink
  Issue Type: Sub-task
Reporter: Matthias Pohl


The new endpoint would allow use from separating observing change events from 
actually triggering the rescale operation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-35550) Introduce new component RescaleManager

2024-06-07 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-35550:
-

 Summary: Introduce new component RescaleManager
 Key: FLINK-35550
 URL: https://issues.apache.org/jira/browse/FLINK-35550
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Coordination
Reporter: Matthias Pohl


The goal here is to collect the rescaling logic in a single component to 
improve testability.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-35551) Introduces RescaleManager#onTrigger endpoint

2024-06-07 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-35551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl updated FLINK-35551:
--
Component/s: Runtime / Coordination

> Introduces RescaleManager#onTrigger endpoint
> 
>
> Key: FLINK-35551
> URL: https://issues.apache.org/jira/browse/FLINK-35551
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Reporter: Matthias Pohl
>Priority: Major
>
> The new endpoint would allow use from separating observing change events from 
> actually triggering the rescale operation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-35549) FLIP-461: Synchronize rescaling with checkpoint creation to minimize reprocessing for the AdaptiveScheduler

2024-06-07 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-35549:
-

 Summary: FLIP-461: Synchronize rescaling with checkpoint creation 
to minimize reprocessing for the AdaptiveScheduler
 Key: FLINK-35549
 URL: https://issues.apache.org/jira/browse/FLINK-35549
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Checkpointing, Runtime / Coordination
Affects Versions: 1.20.0
Reporter: Matthias Pohl


This is the umbrella issue for implementing 
[FLIP-461|https://cwiki.apache.org/confluence/display/FLINK/FLIP-461%3A+Synchronize+rescaling+with+checkpoint+creation+to+minimize+reprocessing+for+the+AdaptiveScheduler]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-35035) Reduce job pause time when cluster resources are expanded in adaptive mode

2024-06-06 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-35035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17852748#comment-17852748
 ] 

Matthias Pohl edited comment on FLINK-35035 at 6/6/24 11:47 AM:


Thanks for the pointer, [~dmvk]. We looked into this issue while working on 
[FLIP-461|https://cwiki.apache.org/confluence/display/FLINK/FLIP-461%3A+Synchronize+rescaling+with+checkpoint+creation+to+minimize+reprocessing+for+the+AdaptiveScheduler]
 (which is kind of related) and plan to do a follow-up FLIP that will align the 
resource controlling mechanism of the {{{}AdaptiveScheduler{}}}'s 
{{WaitingForResources}} and {{Executing}} states.

Currently, we have parameters intervening in the rescaling in different places 
([j.a.scaling-interval.min|https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#jobmanager-adaptive-scheduler-scaling-interval-min],
 
[j.a.scaling-interval.max|https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#jobmanager-adaptive-scheduler-scaling-interval-max]
 being utilized in {{Executing}} and 
[j.a.resource-stabilization-timeout|https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#jobmanager-adaptive-scheduler-resource-stabilization-timeout]
 being utilized in {{{}WaitingForResources){}}}. Having a 
{{resource-stabilization}} phase in {{Executing}} should resolve the problem 
described in this Jira issue here.


was (Author: mapohl):
Thanks for the pointer, [~dmvk]. We looked into this issue while working on 
[FLIP-461|https://cwiki.apache.org/confluence/display/FLINK/FLIP-461%3A+Synchronize+rescaling+with+checkpoint+creation+to+minimize+reprocessing+for+the+AdaptiveScheduler]
 (which is kind of related) and plan to do a follow-up FLIP that will align the 
resource controlling mechanism of the {{AdaptiveScheduler}}'s 
{{WaitingForResources}} and {{Executing}} states. 

Currently, we have parameters intervening in the rescaling in different places 
([j.a.scaling-interval.min|https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#jobmanager-adaptive-scheduler-scaling-interval-min],
 
[j.a.scaling-interval.max|https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#jobmanager-adaptive-scheduler-scaling-interval-max]
 being utilized in {{Executing}} and 
[j.a.resource-stabilization-timeout|https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#jobmanager-adaptive-scheduler-resource-stabilization-timeout)
 being utilized in {{WaitingForResources}}). Having a 
{{resource-stabilization}} phase in {{Executing}} should resolve the problem 
described in this Jira issue here.

> Reduce job pause time when cluster resources are expanded in adaptive mode
> --
>
> Key: FLINK-35035
> URL: https://issues.apache.org/jira/browse/FLINK-35035
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Task
>Affects Versions: 1.19.0
>Reporter: yuanfenghu
>Priority: Minor
>
> When 'jobmanager.scheduler = adaptive' , job graph changes triggered by 
> cluster expansion will cause long-term task stagnation. We should reduce this 
> impact.
> As an example:
> I have jobgraph for : [v1 (maxp=10 minp = 1)] -> [v2 (maxp=10, minp=1)]
> When my cluster has 5 slots, the job will be executed as [v1 p5]->[v2 p5]
> When I add slots the task will trigger jobgraph changes,by
> org.apache.flink.runtime.scheduler.adaptive.ResourceListener#onNewResourcesAvailable,
> However, the five new slots I added were not discovered at the same time (for 
> convenience, I assume that a taskmanager has one slot), because no matter 
> what environment we add, we cannot guarantee that the new slots will be added 
> at once, so this will cause onNewResourcesAvailable triggers repeatedly
> ,If each new slot action has a certain interval, then the jobgraph will 
> continue to change during this period. What I hope is that there will be a 
> stable time to configure the cluster resources  and then go to it after the 
> number of cluster slots has been stable for a certain period of time. Trigger 
> jobgraph changes to avoid this situation



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-35035) Reduce job pause time when cluster resources are expanded in adaptive mode

2024-06-06 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-35035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17852748#comment-17852748
 ] 

Matthias Pohl commented on FLINK-35035:
---

Thanks for the pointer, [~dmvk]. We looked into this issue while working on 
[FLIP-461|https://cwiki.apache.org/confluence/display/FLINK/FLIP-461%3A+Synchronize+rescaling+with+checkpoint+creation+to+minimize+reprocessing+for+the+AdaptiveScheduler]
 (which is kind of related) and plan to do a follow-up FLIP that will align the 
resource controlling mechanism of the {{AdaptiveScheduler}}'s 
{{WaitingForResources}} and {{Executing}} states. 

Currently, we have parameters intervening in the rescaling in different places 
([j.a.scaling-interval.min|https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#jobmanager-adaptive-scheduler-scaling-interval-min],
 
[j.a.scaling-interval.max|https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#jobmanager-adaptive-scheduler-scaling-interval-max]
 being utilized in {{Executing}} and 
[j.a.resource-stabilization-timeout|https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#jobmanager-adaptive-scheduler-resource-stabilization-timeout)
 being utilized in {{WaitingForResources}}). Having a 
{{resource-stabilization}} phase in {{Executing}} should resolve the problem 
described in this Jira issue here.

> Reduce job pause time when cluster resources are expanded in adaptive mode
> --
>
> Key: FLINK-35035
> URL: https://issues.apache.org/jira/browse/FLINK-35035
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Task
>Affects Versions: 1.19.0
>Reporter: yuanfenghu
>Priority: Minor
>
> When 'jobmanager.scheduler = adaptive' , job graph changes triggered by 
> cluster expansion will cause long-term task stagnation. We should reduce this 
> impact.
> As an example:
> I have jobgraph for : [v1 (maxp=10 minp = 1)] -> [v2 (maxp=10, minp=1)]
> When my cluster has 5 slots, the job will be executed as [v1 p5]->[v2 p5]
> When I add slots the task will trigger jobgraph changes,by
> org.apache.flink.runtime.scheduler.adaptive.ResourceListener#onNewResourcesAvailable,
> However, the five new slots I added were not discovered at the same time (for 
> convenience, I assume that a taskmanager has one slot), because no matter 
> what environment we add, we cannot guarantee that the new slots will be added 
> at once, so this will cause onNewResourcesAvailable triggers repeatedly
> ,If each new slot action has a certain interval, then the jobgraph will 
> continue to change during this period. What I hope is that there will be a 
> stable time to configure the cluster resources  and then go to it after the 
> number of cluster slots has been stable for a certain period of time. Trigger 
> jobgraph changes to avoid this situation



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-33278) RemotePekkoRpcActorTest.failsRpcResultImmediatelyIfRemoteRpcServiceIsNotAvailable fails on AZP

2024-06-06 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17852700#comment-17852700
 ] 

Matthias Pohl commented on FLINK-33278:
---

What are you seeing on Flink 1.18.x? Are you referring to the test instability 
or some error?

I guess, we haven't continued working on the issue for now because that test 
instability was only observed once so far.

> RemotePekkoRpcActorTest.failsRpcResultImmediatelyIfRemoteRpcServiceIsNotAvailable
>  fails on AZP
> --
>
> Key: FLINK-33278
> URL: https://issues.apache.org/jira/browse/FLINK-33278
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / RPC
>Affects Versions: 1.19.0
>Reporter: Sergey Nuyanzin
>Priority: Critical
>  Labels: test-stability
> Attachments: screenshot-1.png, screenshot-2.png, screenshot-3.png, 
> screenshot-4.png
>
>
> This build 
> [https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=53740=logs=0e7be18f-84f2-53f0-a32d-4a5e4a174679=7c1d86e3-35bd-5fd5-3b7c-30c126a78702=6563]
> fails as
> {noformat}
> Oct 15 01:02:20 Multiple Failures (1 failure)
> Oct 15 01:02:20 -- failure 1 --
> Oct 15 01:02:20 [Any cause is instance of class 'class 
> org.apache.flink.runtime.rpc.exceptions.RecipientUnreachableException'] 
> Oct 15 01:02:20 Expecting any element of:
> Oct 15 01:02:20   [java.util.concurrent.CompletionException: 
> java.util.concurrent.TimeoutException: Invocation of 
> [RemoteRpcInvocation(SerializedValueRespondingGateway.getSerializedValue())] 
> at recipient 
> [pekko.tcp://flink@localhost:38231/user/rpc/8c211f34-41e5-4efe-93bd-8eca6c590a7f]
>  timed out. This is usually caused by: 1) Pekko failed sending the message 
> silently, due to problems like oversized payload or serialization failures. 
> In that case, you should find detailed error information in the logs. 2) The 
> recipient needs more time for responding, due to problems like slow machines 
> or network jitters. In that case, you can try to increase pekko.ask.timeout.
> Oct 15 01:02:20   at 
> java.util.concurrent.CompletableFuture.reportJoin(CompletableFuture.java:375)
> Oct 15 01:02:20   at 
> java.util.concurrent.CompletableFuture.join(CompletableFuture.java:1947)
> Oct 15 01:02:20   at 
> org.apache.flink.runtime.rpc.pekko.RemotePekkoRpcActorTest.lambda$failsRpcResultImmediatelyIfRemoteRpcServiceIsNotAvailable$1(RemotePekkoRpcActorTest.java:168)
> Oct 15 01:02:20   ...(63 remaining lines not displayed - this can be 
> changed with Assertions.setMaxStackTraceElementsDisplayed),
> ...
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-34989) Apache Infra requests to reduce the runner usage for a project

2024-06-05 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17852519#comment-17852519
 ] 

Matthias Pohl commented on FLINK-34989:
---

Quote from today's Infra roundtable:
* The job concurrency policies are not enforced for now
* The FT runner policy items are monitored and enforced by Infra

> Apache Infra requests to reduce the runner usage for a project
> --
>
> Key: FLINK-34989
> URL: https://issues.apache.org/jira/browse/FLINK-34989
> Project: Flink
>  Issue Type: Sub-task
>  Components: Build System / CI
>Affects Versions: 1.19.0, 1.18.1, 1.20.0
>Reporter: Matthias Pohl
>Priority: Major
>  Labels: pull-request-available
>
> The GitHub Actions CI utilizes runners that are hosted by Apache Infra right 
> now. These runners are limited. The runner usage can be monitored via the 
> following links:
> * [Flink-specific 
> report|https://infra-reports.apache.org/#ghactions=flink=168] 
> (needs ASF committer rights) This project-specific report can only be 
> modified through the HTTP GET parameters of the URL.
> * [Global report|https://infra-reports.apache.org/#ghactions] (needs ASF 
> membership)
> There was a policy change announced recently:
> {quote}
> Policy change on use of GitHub Actions
> Due to misconfigurations in their builds, some projects have been using 
> unsupportable numbers of GitHub Actions. As part of fixing this situation, 
> Infra has added a 'resource use' section to the policy on GitHub Actions. 
> This section of the policy will come into effect on April 20, 2024:
> All workflows MUST have a job concurrency level less than or equal to 20. 
> This means a workflow cannot have more than 20 jobs running at the same time 
> across all matrices.
> All workflows SHOULD have a job concurrency level less than or equal to 15. 
> Just because 20 is the max, doesn't mean you should strive for 20.
> The average number of minutes a project uses per calendar week MUST NOT 
> exceed the equivalent of 25 full-time runners (250,000 minutes, or 4,200 
> hours).
> The average number of minutes a project uses in any consecutive five-day 
> period MUST NOT exceed the equivalent of 30 full-time runners (216,000 
> minutes, or 3,600 hours).
> Projects whose builds consistently cross the maximum use limits will lose 
> their access to GitHub Actions until they fix their build configurations.
> The full policy is at  
> https://infra.apache.org/github-actions-policy.html.
> {quote}
> Currently (last week of March 2024) Flink was ranked at #19 of projects that 
> used the Apache Infra runner resources the most which doesn't seem too bad. 
> This contained not only Apache Flink but also the Kubernetes operator, 
> connectors and other resources. According to [this 
> source|https://infra.apache.org/github-actions-secrets.html] Apache Infra 
> manages 180 runners right now.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-35485) JobMaster failed with "the job xx has not been finished"

2024-05-30 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-35485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17850600#comment-17850600
 ] 

Matthias Pohl commented on FLINK-35485:
---

Hi [~xccui], thanks for reporting the issue. Can you provide the JobManager 
logs for this case? That would help getting a better understanding of what was 
going on.

> JobMaster failed with "the job xx has not been finished"
> 
>
> Key: FLINK-35485
> URL: https://issues.apache.org/jira/browse/FLINK-35485
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.18.1
>Reporter: Xingcan Cui
>Priority: Major
>
> We ran a session cluster on K8s and used Flink SQL gateway to submit queries. 
> Hit the following rare exception once which caused the job manager to restart.
> {code:java}
> org.apache.flink.util.FlinkException: JobMaster for job 
> 50d681ae1e8170f77b4341dda6aba9bc failed.
>   at 
> org.apache.flink.runtime.dispatcher.Dispatcher.jobMasterFailed(Dispatcher.java:1454)
>   at 
> org.apache.flink.runtime.dispatcher.Dispatcher.jobManagerRunnerFailed(Dispatcher.java:776)
>   at 
> org.apache.flink.runtime.dispatcher.Dispatcher.lambda$runJob$6(Dispatcher.java:698)
>   at java.base/java.util.concurrent.CompletableFuture.uniHandle(Unknown 
> Source)
>   at 
> java.base/java.util.concurrent.CompletableFuture$UniHandle.tryFire(Unknown 
> Source)
>   at java.base/java.util.concurrent.CompletableFuture$Completion.run(Unknown 
> Source)
>   at 
> org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.lambda$handleRunAsync$4(PekkoRpcActor.java:451)
>   at 
> org.apache.flink.runtime.concurrent.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68)
>   at 
> org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleRunAsync(PekkoRpcActor.java:451)
>   at 
> org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleRpcMessage(PekkoRpcActor.java:218)
>   at 
> org.apache.flink.runtime.rpc.pekko.FencedPekkoRpcActor.handleRpcMessage(FencedPekkoRpcActor.java:85)
>   at 
> org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleMessage(PekkoRpcActor.java:168)
>   at org.apache.pekko.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:33)
>   at org.apache.pekko.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:29)
>   at scala.PartialFunction.applyOrElse(PartialFunction.scala:127)
>   at scala.PartialFunction.applyOrElse$(PartialFunction.scala:126)
>   at 
> org.apache.pekko.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:29)
>   at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:175)
>   at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:176)
>   at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:176)
>   at org.apache.pekko.actor.Actor.aroundReceive(Actor.scala:547)
>   at org.apache.pekko.actor.Actor.aroundReceive$(Actor.scala:545)
>   at 
> org.apache.pekko.actor.AbstractActor.aroundReceive(AbstractActor.scala:229)
>   at org.apache.pekko.actor.ActorCell.receiveMessage(ActorCell.scala:590)
>   at org.apache.pekko.actor.ActorCell.invoke(ActorCell.scala:557)
>   at org.apache.pekko.dispatch.Mailbox.processMailbox(Mailbox.scala:280)
>   at org.apache.pekko.dispatch.Mailbox.run(Mailbox.scala:241)
>   at org.apache.pekko.dispatch.Mailbox.exec(Mailbox.scala:253)
>   at java.base/java.util.concurrent.ForkJoinTask.doExec(Unknown Source)
>   at 
> java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(Unknown 
> Source)
>   at java.base/java.util.concurrent.ForkJoinPool.scan(Unknown Source)
>   at java.base/java.util.concurrent.ForkJoinPool.runWorker(Unknown Source)
>   at java.base/java.util.concurrent.ForkJoinWorkerThread.run(Unknown Source)
> Caused by: org.apache.flink.runtime.jobmaster.JobNotFinishedException: The 
> job (50d681ae1e8170f77b4341dda6aba9bc) has not been finished.
>   at 
> org.apache.flink.runtime.jobmaster.DefaultJobMasterServiceProcess.closeAsync(DefaultJobMasterServiceProcess.java:157)
>   at 
> org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.stopJobMasterServiceProcess(JobMasterServiceLeadershipRunner.java:431)
>   at 
> org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.callIfRunning(JobMasterServiceLeadershipRunner.java:476)
>   at 
> org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.lambda$stopJobMasterServiceProcessAsync$12(JobMasterServiceLeadershipRunner.java:407)
>   at java.base/java.util.concurrent.CompletableFuture.uniComposeStage(Unknown 
> Source)
>   at java.base/java.util.concurrent.CompletableFuture.thenCompose(Unknown 
> Source)
>   at 
> org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.stopJobMasterServiceProcessAsync(JobMasterServiceLeadershipRunner.java:405)
>   at 
> 

[jira] [Commented] (FLINK-34513) GroupAggregateRestoreTest.testRestore fails

2024-05-29 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17850387#comment-17850387
 ] 

Matthias Pohl commented on FLINK-34513:
---

* 1.20 (Java 8): 
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=59935=logs=0c940707-2659-5648-cbe6-a1ad63045f0a=075c2716-8010-5565-fe08-3c4bb45824a4=11686

> GroupAggregateRestoreTest.testRestore fails
> ---
>
> Key: FLINK-34513
> URL: https://issues.apache.org/jira/browse/FLINK-34513
> Project: Flink
>  Issue Type: Bug
>  Components: Table SQL / Planner
>Affects Versions: 1.20.0
>Reporter: Matthias Pohl
>Assignee: Bonnie Varghese
>Priority: Critical
>  Labels: pull-request-available, test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=57828=logs=26b84117-e436-5720-913e-3e280ce55cae=77cc7e77-39a0-5007-6d65-4137ac13a471=10881
> {code}
> Feb 24 01:12:01 01:12:01.384 [ERROR] Tests run: 10, Failures: 1, Errors: 0, 
> Skipped: 1, Time elapsed: 2.957 s <<< FAILURE! -- in 
> org.apache.flink.table.planner.plan.nodes.exec.stream.GroupAggregateRestoreTest
> Feb 24 01:12:01 01:12:01.384 [ERROR] 
> org.apache.flink.table.planner.plan.nodes.exec.stream.GroupAggregateRestoreTest.testRestore(TableTestProgram,
>  ExecNodeMetadata)[4] -- Time elapsed: 0.653 s <<< FAILURE!
> Feb 24 01:12:01 java.lang.AssertionError: 
> Feb 24 01:12:01 
> Feb 24 01:12:01 Expecting actual:
> Feb 24 01:12:01   ["+I[3, 1, 2, 8, 31, 10.0, 3]",
> Feb 24 01:12:01 "+I[2, 1, 4, 14, 42, 7.0, 6]",
> Feb 24 01:12:01 "+I[1, 1, 4, 12, 24, 6.0, 4]",
> Feb 24 01:12:01 "+U[2, 1, 4, 14, 57, 8.0, 7]",
> Feb 24 01:12:01 "+U[1, 1, 4, 12, 32, 6.0, 5]",
> Feb 24 01:12:01 "+I[7, 0, 1, 7, 7, 7.0, 1]",
> Feb 24 01:12:01 "+U[2, 1, 4, 14, 57, 7.0, 7]",
> Feb 24 01:12:01 "+U[1, 1, 4, 12, 32, 5.0, 5]",
> Feb 24 01:12:01 "+U[3, 1, 2, 8, 31, 9.0, 3]",
> Feb 24 01:12:01 "+U[7, 0, 1, 7, 7, 7.0, 2]"]
> Feb 24 01:12:01 to contain exactly in any order:
> Feb 24 01:12:01   ["+I[3, 1, 2, 8, 31, 10.0, 3]",
> Feb 24 01:12:01 "+I[2, 1, 4, 14, 42, 7.0, 6]",
> Feb 24 01:12:01 "+I[1, 1, 4, 12, 24, 6.0, 4]",
> Feb 24 01:12:01 "+U[2, 1, 4, 14, 57, 8.0, 7]",
> Feb 24 01:12:01 "+U[1, 1, 4, 12, 32, 6.0, 5]",
> Feb 24 01:12:01 "+U[3, 1, 2, 8, 31, 9.0, 3]",
> Feb 24 01:12:01 "+U[2, 1, 4, 14, 57, 7.0, 7]",
> Feb 24 01:12:01 "+I[7, 0, 1, 7, 7, 7.0, 2]",
> Feb 24 01:12:01 "+U[1, 1, 4, 12, 32, 5.0, 5]"]
> Feb 24 01:12:01 elements not found:
> Feb 24 01:12:01   ["+I[7, 0, 1, 7, 7, 7.0, 2]"]
> Feb 24 01:12:01 and elements not expected:
> Feb 24 01:12:01   ["+I[7, 0, 1, 7, 7, 7.0, 1]", "+U[7, 0, 1, 7, 7, 7.0, 2]"]
> Feb 24 01:12:01 
> Feb 24 01:12:01   at 
> org.apache.flink.table.planner.plan.nodes.exec.testutils.RestoreTestBase.testRestore(RestoreTestBase.java:313)
> Feb 24 01:12:01   at 
> java.base/java.lang.reflect.Method.invoke(Method.java:580)
> [...]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-34582) release build tools lost the newly added py3.11 packages for mac

2024-05-24 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849185#comment-17849185
 ] 

Matthias Pohl commented on FLINK-34582:
---

You're checking [~hxb]'s fork where the {{master}} branch doesn't seem to be 
up-to-date. 
[apache/flink:flink-python/dev/build-wheels.sh|https://github.com/apache/flink/blob/master/flink-python/dev/build-wheels.sh#L19-L26]
 does, indeed, have 3.11 added to the python version list.

> release build tools lost the newly added py3.11 packages for mac
> 
>
> Key: FLINK-34582
> URL: https://issues.apache.org/jira/browse/FLINK-34582
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 1.19.0, 1.20.0
>Reporter: lincoln lee
>Assignee: Xingbo Huang
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.19.0, 1.20.0
>
> Attachments: image-2024-03-07-10-39-49-341.png
>
>
> during 1.19.0-rc1 building binaries via 
> tools/releasing/create_binary_release.sh
> lost the newly added py3.11  2 packages for mac



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-34672) HA deadlock between JobMasterServiceLeadershipRunner and DefaultLeaderElectionService

2024-05-22 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17848648#comment-17848648
 ] 

Matthias Pohl commented on FLINK-34672:
---

I'm still trying to find a reviewer. It's on my plate. But it's not a blocker 
because the issue already existed in older versions of Flink:
{quote}
I also verified that this is not something that was introduced in Flink 1.18 
with the FLIP-285 changes. AFAIS, it can also happen in 1.17- (I didn't check 
the pre-FLINK-24038 code but only looked into release-1.17).
{quote}

> HA deadlock between JobMasterServiceLeadershipRunner and 
> DefaultLeaderElectionService
> -
>
> Key: FLINK-34672
> URL: https://issues.apache.org/jira/browse/FLINK-34672
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.17.2, 1.19.0, 1.18.1, 1.20.0
>Reporter: Chesnay Schepler
>Assignee: Matthias Pohl
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.18.2, 1.20.0, 1.19.1
>
>
> We recently observed a deadlock in the JM within the HA system.
> (see below for the thread dump)
> [~mapohl] and I looked a bit into it and there appears to be a race condition 
> when leadership is revoked while a JobMaster is being started.
> It appears to be caused by 
> {{JobMasterServiceLeadershipRunner#createNewJobMasterServiceProcess}} 
> forwarding futures while holding a lock; depending on whether the forwarded 
> future is already complete the next stage may or may not run while holding 
> that same lock.
> We haven't determined yet whether we should be holding that lock or not.
> {code}
> "DefaultLeaderElectionService-leadershipOperationExecutor-thread-1" #131 
> daemon prio=5 os_prio=0 cpu=157.44ms elapsed=78749.65s tid=0x7f531f43d000 
> nid=0x19d waiting for monitor entry  [0x7f53084fd000]
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.runIfStateRunning(JobMasterServiceLeadershipRunner.java:462)
> - waiting to lock <0xf1c0e088> (a java.lang.Object)
> at 
> org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.revokeLeadership(JobMasterServiceLeadershipRunner.java:397)
> at 
> org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService.notifyLeaderContenderOfLeadershipLoss(DefaultLeaderElectionService.java:484)
> at 
> org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService$$Lambda$1252/0x000840ddec40.accept(Unknown
>  Source)
> at java.util.HashMap.forEach(java.base@11.0.22/HashMap.java:1337)
> at 
> org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService.onRevokeLeadershipInternal(DefaultLeaderElectionService.java:452)
> at 
> org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService$$Lambda$1251/0x000840dcf840.run(Unknown
>  Source)
> at 
> org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService.lambda$runInLeaderEventThread$3(DefaultLeaderElectionService.java:549)
> - locked <0xf0e3f4d8> (a java.lang.Object)
> at 
> org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService$$Lambda$1075/0x000840c23040.run(Unknown
>  Source)
> at 
> java.util.concurrent.CompletableFuture$AsyncRun.run(java.base@11.0.22/CompletableFuture.java:1736)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.22/ThreadPoolExecutor.java:1128)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.22/ThreadPoolExecutor.java:628)
> at java.lang.Thread.run(java.base@11.0.22/Thread.java:829)
> {code}
> {code}
> "jobmanager-io-thread-1" #636 daemon prio=5 os_prio=0 cpu=125.56ms 
> elapsed=78699.01s tid=0x7f5321c6e800 nid=0x396 waiting for monitor entry  
> [0x7f530567d000]
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService.hasLeadership(DefaultLeaderElectionService.java:366)
> - waiting to lock <0xf0e3f4d8> (a java.lang.Object)
> at 
> org.apache.flink.runtime.leaderelection.DefaultLeaderElection.hasLeadership(DefaultLeaderElection.java:52)
> at 
> org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.isValidLeader(JobMasterServiceLeadershipRunner.java:509)
> at 
> org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.lambda$forwardIfValidLeader$15(JobMasterServiceLeadershipRunner.java:520)
> - locked <0xf1c0e088> (a java.lang.Object)
> at 
> 

[jira] [Assigned] (FLINK-20402) Migrate test_tpch.sh

2024-05-21 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-20402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl reassigned FLINK-20402:
-

Assignee: Muhammet Orazov

> Migrate test_tpch.sh
> 
>
> Key: FLINK-20402
> URL: https://issues.apache.org/jira/browse/FLINK-20402
> Project: Flink
>  Issue Type: Sub-task
>  Components: Table SQL / Ecosystem, Tests
>Reporter: Jark Wu
>Assignee: Muhammet Orazov
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-20392) Migrating bash e2e tests to Java/Docker

2024-05-16 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-20392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846924#comment-17846924
 ] 

Matthias Pohl commented on FLINK-20392:
---

Sure, sounds reasonable. Feel free to update it.

> Migrating bash e2e tests to Java/Docker
> ---
>
> Key: FLINK-20392
> URL: https://issues.apache.org/jira/browse/FLINK-20392
> Project: Flink
>  Issue Type: Technical Debt
>  Components: Test Infrastructure, Tests
>Reporter: Matthias Pohl
>Priority: Minor
>  Labels: auto-deprioritized-major, auto-deprioritized-minor, 
> starter
>
> This Jira issue serves as an umbrella ticket for single e2e test migration 
> tasks. This should enable us to migrate all bash-based e2e tests step-by-step.
> The goal is to utilize the e2e test framework (see 
> [flink-end-to-end-tests-common|https://github.com/apache/flink/tree/master/flink-end-to-end-tests/flink-end-to-end-tests-common]).
>  Ideally, the test should use Docker containers as much as possible 
> disconnect the execution from the environment. A good source to achieve that 
> is [testcontainers.org|https://www.testcontainers.org/].
> The related ML discussion is [Stop adding new bash-based e2e tests to 
> Flink|http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Stop-adding-new-bash-based-e2e-tests-to-Flink-td46607.html].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-20392) Migrating bash e2e tests to Java/Docker

2024-05-16 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-20392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846838#comment-17846838
 ] 

Matthias Pohl commented on FLINK-20392:
---

This discussion feels similar to our efforts around migrating to JUnit5 and 
assertj as the standard JUnit tests. It costed (and is still costing) quite a 
bit of resources with the risk of missing things when reviewing the tests.
That is why I still see value in just keeping both options around. That 
requires less resources and we're not losing much. The pros and cons are still 
a good guideline for developers to decide on which technology to use if they 
are planning to create a new e2e test in Java. WDYT?

> Migrating bash e2e tests to Java/Docker
> ---
>
> Key: FLINK-20392
> URL: https://issues.apache.org/jira/browse/FLINK-20392
> Project: Flink
>  Issue Type: Technical Debt
>  Components: Test Infrastructure, Tests
>Reporter: Matthias Pohl
>Priority: Minor
>  Labels: auto-deprioritized-major, auto-deprioritized-minor, 
> starter
>
> This Jira issue serves as an umbrella ticket for single e2e test migration 
> tasks. This should enable us to migrate all bash-based e2e tests step-by-step.
> The goal is to utilize the e2e test framework (see 
> [flink-end-to-end-tests-common|https://github.com/apache/flink/tree/master/flink-end-to-end-tests/flink-end-to-end-tests-common]).
>  Ideally, the test should use Docker containers as much as possible 
> disconnect the execution from the environment. A good source to achieve that 
> is [testcontainers.org|https://www.testcontainers.org/].
> The related ML discussion is [Stop adding new bash-based e2e tests to 
> Flink|http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Stop-adding-new-bash-based-e2e-tests-to-Flink-td46607.html].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-20392) Migrating bash e2e tests to Java/Docker

2024-05-15 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-20392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846540#comment-17846540
 ] 

Matthias Pohl commented on FLINK-20392:
---

Thanks for the write-up. I'm just wondering whether we gain anything from only 
allowing one of the two approaches. What about allowing both options?

> Migrating bash e2e tests to Java/Docker
> ---
>
> Key: FLINK-20392
> URL: https://issues.apache.org/jira/browse/FLINK-20392
> Project: Flink
>  Issue Type: Technical Debt
>  Components: Test Infrastructure, Tests
>Reporter: Matthias Pohl
>Priority: Minor
>  Labels: auto-deprioritized-major, auto-deprioritized-minor, 
> starter
>
> This Jira issue serves as an umbrella ticket for single e2e test migration 
> tasks. This should enable us to migrate all bash-based e2e tests step-by-step.
> The goal is to utilize the e2e test framework (see 
> [flink-end-to-end-tests-common|https://github.com/apache/flink/tree/master/flink-end-to-end-tests/flink-end-to-end-tests-common]).
>  Ideally, the test should use Docker containers as much as possible 
> disconnect the execution from the environment. A good source to achieve that 
> is [testcontainers.org|https://www.testcontainers.org/].
> The related ML discussion is [Stop adding new bash-based e2e tests to 
> Flink|http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Stop-adding-new-bash-based-e2e-tests-to-Flink-td46607.html].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-34324) s3_setup is called in test_file_sink.sh even if the common_s3.sh is not sourced

2024-05-10 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845238#comment-17845238
 ] 

Matthias Pohl edited comment on FLINK-34324 at 5/10/24 8:07 AM:


* master: 
[93526c2f3247598ce80854cf65dd4440eb5aaa43|https://github.com/apache/flink/commit/93526c2f3247598ce80854cf65dd4440eb5aaa43]
* 1.19: 
[8707c63ee147085671a9ae1b294854bac03fc914|https://github.com/apache/flink/commit/8707c63ee147085671a9ae1b294854bac03fc914]
* 1.18: 
[7d98ab060be82fe3684d15501b9eb83373303d18|https://github.com/apache/flink/commit/7d98ab060be82fe3684d15501b9eb83373303d18]


was (Author: mapohl):
* master
** 
[93526c2f3247598ce80854cf65dd4440eb5aaa43|https://github.com/apache/flink/commit/93526c2f3247598ce80854cf65dd4440eb5aaa43]
* 1.19
** 
[8707c63ee147085671a9ae1b294854bac03fc914|https://github.com/apache/flink/commit/8707c63ee147085671a9ae1b294854bac03fc914]
* 1.18
** 
[7d98ab060be82fe3684d15501b9eb83373303d18|https://github.com/apache/flink/commit/7d98ab060be82fe3684d15501b9eb83373303d18]

> s3_setup is called in test_file_sink.sh even if the common_s3.sh is not 
> sourced
> ---
>
> Key: FLINK-34324
> URL: https://issues.apache.org/jira/browse/FLINK-34324
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Hadoop Compatibility, Tests
>Affects Versions: 1.17.2, 1.19.0, 1.18.1
>Reporter: Matthias Pohl
>Assignee: Matthias Pohl
>Priority: Major
>  Labels: pull-request-available, test-stability
> Fix For: 1.18.2, 1.20.0, 1.19.1
>
>
> See example CI run from the FLINK-34150 PR:
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=56570=logs=af184cdd-c6d8-5084-0b69-7e9c67b35f7a=0f3adb59-eefa-51c6-2858-3654d9e0749d=3191
> {code}
> /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/test_file_sink.sh: 
> line 38: s3_setup: command not found
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (FLINK-34324) s3_setup is called in test_file_sink.sh even if the common_s3.sh is not sourced

2024-05-10 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-34324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl resolved FLINK-34324.
---
Fix Version/s: 1.18.2
   1.20.0
   1.19.1
   Resolution: Fixed

* master
** 
[93526c2f3247598ce80854cf65dd4440eb5aaa43|https://github.com/apache/flink/commit/93526c2f3247598ce80854cf65dd4440eb5aaa43]
* 1.19
** 
[8707c63ee147085671a9ae1b294854bac03fc914|https://github.com/apache/flink/commit/8707c63ee147085671a9ae1b294854bac03fc914]
* 1.18
** 
[7d98ab060be82fe3684d15501b9eb83373303d18|https://github.com/apache/flink/commit/7d98ab060be82fe3684d15501b9eb83373303d18]

> s3_setup is called in test_file_sink.sh even if the common_s3.sh is not 
> sourced
> ---
>
> Key: FLINK-34324
> URL: https://issues.apache.org/jira/browse/FLINK-34324
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Hadoop Compatibility, Tests
>Affects Versions: 1.17.2, 1.19.0, 1.18.1
>Reporter: Matthias Pohl
>Assignee: Matthias Pohl
>Priority: Major
>  Labels: pull-request-available, test-stability
> Fix For: 1.18.2, 1.20.0, 1.19.1
>
>
> See example CI run from the FLINK-34150 PR:
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=56570=logs=af184cdd-c6d8-5084-0b69-7e9c67b35f7a=0f3adb59-eefa-51c6-2858-3654d9e0749d=3191
> {code}
> /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/test_file_sink.sh: 
> line 38: s3_setup: command not found
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (FLINK-34937) Apache Infra GHA policy update

2024-05-02 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-34937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl reassigned FLINK-34937:
-

Assignee: Matthias Pohl

> Apache Infra GHA policy update
> --
>
> Key: FLINK-34937
> URL: https://issues.apache.org/jira/browse/FLINK-34937
> Project: Flink
>  Issue Type: Sub-task
>  Components: Build System / CI
>Affects Versions: 1.19.0, 1.18.1, 1.20.0
>Reporter: Matthias Pohl
>Assignee: Matthias Pohl
>Priority: Major
>  Labels: pull-request-available
>
> There is a policy update [announced in the infra 
> ML|https://www.mail-archive.com/jdo-dev@db.apache.org/msg13638.html] which 
> asked Apache projects to limit the number of runners per job. Additionally, 
> the [GHA policy|https://infra.apache.org/github-actions-policy.html] is 
> referenced which I wasn't aware of when working on the action workflow.
> This issue is about applying the policy to the Flink GHA workflows.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-34989) Apache Infra requests to reduce the runner usage for a project

2024-04-04 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833910#comment-17833910
 ] 

Matthias Pohl commented on FLINK-34989:
---

[~martijnvisser] pointed out that we might need to fix this in the connector 
repos as well.

> Apache Infra requests to reduce the runner usage for a project
> --
>
> Key: FLINK-34989
> URL: https://issues.apache.org/jira/browse/FLINK-34989
> Project: Flink
>  Issue Type: Sub-task
>  Components: Build System / CI
>Affects Versions: 1.19.0, 1.18.1, 1.20.0
>Reporter: Matthias Pohl
>Priority: Major
>  Labels: pull-request-available
>
> The GitHub Actions CI utilizes runners that are hosted by Apache Infra right 
> now. These runners are limited. The runner usage can be monitored via the 
> following links:
> * [Flink-specific 
> report|https://infra-reports.apache.org/#ghactions=flink=168] 
> (needs ASF committer rights) This project-specific report can only be 
> modified through the HTTP GET parameters of the URL.
> * [Global report|https://infra-reports.apache.org/#ghactions] (needs ASF 
> membership)
> There was a policy change announced recently:
> {quote}
> Policy change on use of GitHub Actions
> Due to misconfigurations in their builds, some projects have been using 
> unsupportable numbers of GitHub Actions. As part of fixing this situation, 
> Infra has added a 'resource use' section to the policy on GitHub Actions. 
> This section of the policy will come into effect on April 20, 2024:
> All workflows MUST have a job concurrency level less than or equal to 20. 
> This means a workflow cannot have more than 20 jobs running at the same time 
> across all matrices.
> All workflows SHOULD have a job concurrency level less than or equal to 15. 
> Just because 20 is the max, doesn't mean you should strive for 20.
> The average number of minutes a project uses per calendar week MUST NOT 
> exceed the equivalent of 25 full-time runners (250,000 minutes, or 4,200 
> hours).
> The average number of minutes a project uses in any consecutive five-day 
> period MUST NOT exceed the equivalent of 30 full-time runners (216,000 
> minutes, or 3,600 hours).
> Projects whose builds consistently cross the maximum use limits will lose 
> their access to GitHub Actions until they fix their build configurations.
> The full policy is at  
> https://infra.apache.org/github-actions-policy.html.
> {quote}
> Currently (last week of March 2024) Flink was ranked at #19 of projects that 
> used the Apache Infra runner resources the most which doesn't seem too bad. 
> This contained not only Apache Flink but also the Kubernetes operator, 
> connectors and other resources. According to [this 
> source|https://infra.apache.org/github-actions-secrets.html] Apache Infra 
> manages 180 runners right now.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (FLINK-34999) PR CI stopped operating

2024-04-04 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-34999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl resolved FLINK-34999.
---
Resolution: Fixed

Thanks for working on it. I verified that [PR 
CI|https://dev.azure.com/apache-flink/apache-flink/_build?definitionId=2] is 
picked up again. (y)

> PR CI stopped operating
> ---
>
> Key: FLINK-34999
> URL: https://issues.apache.org/jira/browse/FLINK-34999
> Project: Flink
>  Issue Type: Bug
>  Components: Build System / CI
>Affects Versions: 1.19.0, 1.18.1, 1.20.0
>Reporter: Matthias Pohl
>Priority: Blocker
>
> There are no [new PR CI 
> runs|https://dev.azure.com/apache-flink/apache-flink/_build?definitionId=2] 
> being picked up anymore. [Recently updated 
> PRs|https://github.com/apache/flink/pulls?q=sort%3Aupdated-desc] are not 
> picked up by the @flinkbot.
> In the meantime there was a notification sent from GitHub that the password 
> of the [@flinkbot|https://github.com/flinkbot] was reset for security 
> reasons. It's quite likely that these two events are related.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-35005) SqlClientITCase Failed to build JobManager image

2024-04-04 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-35005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl updated FLINK-35005:
--
Component/s: Test Infrastructure

> SqlClientITCase Failed to build JobManager image
> 
>
> Key: FLINK-35005
> URL: https://issues.apache.org/jira/browse/FLINK-35005
> Project: Flink
>  Issue Type: Bug
>  Components: Test Infrastructure
>Affects Versions: 1.20.0
>Reporter: Ryan Skraba
>Priority: Critical
>  Labels: test-stability
>
> jdk21 
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58708=logs=dc1bf4ed-4646-531a-f094-e103042be549=fb3d654d-52f8-5b98-fe9d-b18dd2e2b790=15140
> {code}
> Apr 03 02:59:16 02:59:16.247 [INFO] 
> ---
> Apr 03 02:59:16 02:59:16.248 [INFO]  T E S T S
> Apr 03 02:59:16 02:59:16.248 [INFO] 
> ---
> Apr 03 02:59:17 02:59:17.841 [INFO] Running SqlClientITCase
> Apr 03 03:03:15   at 
> java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1312)
> Apr 03 03:03:15   at 
> java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1843)
> Apr 03 03:03:15   at 
> java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1808)
> Apr 03 03:03:15   at 
> java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:188)
> Apr 03 03:03:15 Caused by: 
> org.apache.flink.connector.testframe.container.ImageBuildException: Failed to 
> build image "flink-configured-jobmanager"
> Apr 03 03:03:15   at 
> org.apache.flink.connector.testframe.container.FlinkImageBuilder.build(FlinkImageBuilder.java:234)
> Apr 03 03:03:15   at 
> org.apache.flink.connector.testframe.container.FlinkTestcontainersConfigurator.configureJobManagerContainer(FlinkTestcontainersConfigurator.java:65)
> Apr 03 03:03:15   ... 12 more
> Apr 03 03:03:15 Caused by: java.lang.RuntimeException: 
> com.github.dockerjava.api.exception.DockerClientException: Could not build 
> image: Head 
> "https://registry-1.docker.io/v2/library/eclipse-temurin/manifests/21-jre-jammy":
>  received unexpected HTTP status: 500 Internal Server Error
> Apr 03 03:03:15   at 
> org.rnorth.ducttape.timeouts.Timeouts.callFuture(Timeouts.java:68)
> Apr 03 03:03:15   at 
> org.rnorth.ducttape.timeouts.Timeouts.getWithTimeout(Timeouts.java:43)
> Apr 03 03:03:15   at 
> org.testcontainers.utility.LazyFuture.get(LazyFuture.java:47)
> Apr 03 03:03:15   at 
> org.apache.flink.connector.testframe.container.FlinkImageBuilder.buildBaseImage(FlinkImageBuilder.java:255)
> Apr 03 03:03:15   at 
> org.apache.flink.connector.testframe.container.FlinkImageBuilder.build(FlinkImageBuilder.java:206)
> Apr 03 03:03:15   ... 13 more
> Apr 03 03:03:15 Caused by: 
> com.github.dockerjava.api.exception.DockerClientException: Could not build 
> image: Head 
> "https://registry-1.docker.io/v2/library/eclipse-temurin/manifests/21-jre-jammy":
>  received unexpected HTTP status: 500 Internal Server Error
> Apr 03 03:03:15   at 
> com.github.dockerjava.api.command.BuildImageResultCallback.getImageId(BuildImageResultCallback.java:78)
> Apr 03 03:03:15   at 
> com.github.dockerjava.api.command.BuildImageResultCallback.awaitImageId(BuildImageResultCallback.java:50)
> Apr 03 03:03:15   at 
> org.testcontainers.images.builder.ImageFromDockerfile.resolve(ImageFromDockerfile.java:159)
> Apr 03 03:03:15   at 
> org.testcontainers.images.builder.ImageFromDockerfile.resolve(ImageFromDockerfile.java:40)
> Apr 03 03:03:15   at 
> org.testcontainers.utility.LazyFuture.getResolvedValue(LazyFuture.java:19)
> Apr 03 03:03:15   at 
> org.testcontainers.utility.LazyFuture.get(LazyFuture.java:41)
> Apr 03 03:03:15   at 
> java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
> Apr 03 03:03:15   at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
> Apr 03 03:03:15   at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
> Apr 03 03:03:15   at java.base/java.lang.Thread.run(Thread.java:1583)
> Apr 03 03:03:15 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-35004) SqlGatewayE2ECase could not start container

2024-04-04 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-35004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl updated FLINK-35004:
--
Component/s: Test Infrastructure

> SqlGatewayE2ECase could not start container
> ---
>
> Key: FLINK-35004
> URL: https://issues.apache.org/jira/browse/FLINK-35004
> Project: Flink
>  Issue Type: Bug
>  Components: Test Infrastructure
>Affects Versions: 1.20.0
>Reporter: Ryan Skraba
>Priority: Critical
>  Labels: github-actions, test-stability
>
> 1.20, jdk17: 
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58708=logs=e8e46ef5-75cc-564f-c2bd-1797c35cbebe=60c49903-2505-5c25-7e46-de91b1737bea=15078
> There is an error: "Process failed due to timeout" in 
> {{SqlGatewayE2ECase.testSqlClientExecuteStatement}}.  In the maven logs, we 
> can see:
> {code:java}
> 02:57:26,979 [main] INFO  tc.prestodb/hdp2.6-hive:10  
>  [] - Image prestodb/hdp2.6-hive:10 pull took 
> PT43.59218S
> 02:57:26,991 [main] INFO  tc.prestodb/hdp2.6-hive:10  
>  [] - Creating container for image: 
> prestodb/hdp2.6-hive:10
> 02:57:27,032 [main] INFO  tc.prestodb/hdp2.6-hive:10  
>  [] - Container prestodb/hdp2.6-hive:10 is starting: 
> 162069678c7d03252a42ed81ca43e1911ca7357c476a4a5de294ffe55bd83145
> 02:57:42,846 [main] INFO  tc.prestodb/hdp2.6-hive:10  
>  [] - Container prestodb/hdp2.6-hive:10 started in 
> PT15.855339866S
> 02:57:53,447 [main] ERROR tc.prestodb/hdp2.6-hive:10  
>  [] - Could not start container
> java.lang.RuntimeException: java.net.SocketTimeoutException: timeout
>   at 
> org.apache.flink.table.gateway.containers.HiveContainer.containerIsStarted(HiveContainer.java:94)
>  ~[test-classes/:?]
>   at 
> org.testcontainers.containers.GenericContainer.containerIsStarted(GenericContainer.java:723)
>  ~[testcontainers-1.19.1.jar:1.19.1]
>   at 
> org.testcontainers.containers.GenericContainer.tryStart(GenericContainer.java:543)
>  ~[testcontainers-1.19.1.jar:1.19.1]
>   at 
> org.testcontainers.containers.GenericContainer.lambda$doStart$0(GenericContainer.java:354)
>  ~[testcontainers-1.19.1.jar:1.19.1]
>   at 
> org.rnorth.ducttape.unreliables.Unreliables.retryUntilSuccess(Unreliables.java:81)
>  ~[duct-tape-1.0.8.jar:?]
>   at 
> org.testcontainers.containers.GenericContainer.doStart(GenericContainer.java:344)
>  ~[testcontainers-1.19.1.jar:1.19.1]
>   at 
> org.apache.flink.table.gateway.containers.HiveContainer.doStart(HiveContainer.java:69)
>  ~[test-classes/:?]
>   at 
> org.testcontainers.containers.GenericContainer.start(GenericContainer.java:334)
>  ~[testcontainers-1.19.1.jar:1.19.1]
>   at 
> org.testcontainers.containers.GenericContainer.starting(GenericContainer.java:1144)
>  ~[testcontainers-1.19.1.jar:1.19.1]
>   at 
> org.testcontainers.containers.FailureDetectingExternalResource$1.evaluate(FailureDetectingExternalResource.java:28)
>  ~[testcontainers-1.19.1.jar:1.19.1]
>   at org.junit.rules.RunRules.evaluate(RunRules.java:20) 
> ~[junit-4.13.2.jar:4.13.2]
>   at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) 
> ~[junit-4.13.2.jar:4.13.2]
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:413) 
> ~[junit-4.13.2.jar:4.13.2]
>   at org.junit.runner.JUnitCore.run(JUnitCore.java:137) 
> ~[junit-4.13.2.jar:4.13.2]
>   at org.junit.runner.JUnitCore.run(JUnitCore.java:115) 
> ~[junit-4.13.2.jar:4.13.2]
>   at 
> org.junit.vintage.engine.execution.RunnerExecutor.execute(RunnerExecutor.java:42)
>  ~[junit-vintage-engine-5.10.1.jar:5.10.1]
>   at 
> org.junit.vintage.engine.VintageTestEngine.executeAllChildren(VintageTestEngine.java:80)
>  ~[junit-vintage-engine-5.10.1.jar:5.10.1]
>   at 
> org.junit.vintage.engine.VintageTestEngine.execute(VintageTestEngine.java:72) 
> ~[junit-vintage-engine-5.10.1.jar:5.10.1]
>   at 
> org.junit.platform.launcher.core.EngineExecutionOrchestrator.execute(EngineExecutionOrchestrator.java:198)
>  ~[junit-platform-launcher-1.10.1.jar:1.10.1]
>   at 
> org.junit.platform.launcher.core.EngineExecutionOrchestrator.execute(EngineExecutionOrchestrator.java:169)
>  ~[junit-platform-launcher-1.10.1.jar:1.10.1]
>   at 
> org.junit.platform.launcher.core.EngineExecutionOrchestrator.execute(EngineExecutionOrchestrator.java:93)
>  ~[junit-platform-launcher-1.10.1.jar:1.10.1]
>   at 
> org.junit.platform.launcher.core.EngineExecutionOrchestrator.lambda$execute$0(EngineExecutionOrchestrator.java:58)
>  ~[junit-platform-launcher-1.10.1.jar:1.10.1]
>   at 
> 

[jira] [Resolved] (FLINK-35000) PullRequest template doesn't use the correct format to refer to the testing code convention

2024-04-03 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-35000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl resolved FLINK-35000.
---
Fix Version/s: 1.18.2
   1.20.0
   1.19.1
   Resolution: Fixed

master: 
[d301839dfe2ed9b1313d23f8307bda76868a0c0a|https://github.com/apache/flink/commit/d301839dfe2ed9b1313d23f8307bda76868a0c0a]
1.19: 
[eb58599b434b6c5fe86f6e487ce88315c98b4ec3|https://github.com/apache/flink/commit/eb58599b434b6c5fe86f6e487ce88315c98b4ec3]
1.18: 
[9150f93b18b8694646092a6ed24a14e3653f613f|https://github.com/apache/flink/commit/9150f93b18b8694646092a6ed24a14e3653f613f]

> PullRequest template doesn't use the correct format to refer to the testing 
> code convention
> ---
>
> Key: FLINK-35000
> URL: https://issues.apache.org/jira/browse/FLINK-35000
> Project: Flink
>  Issue Type: Bug
>  Components: Build System / CI, Project Website
>Affects Versions: 1.19.0, 1.18.1, 1.20.0
>Reporter: Matthias Pohl
>Assignee: Matthias Pohl
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.18.2, 1.20.0, 1.19.1
>
>
> The PR template refers to 
> https://flink.apache.org/contributing/code-style-and-quality-common.html#testing
>  rather than 
> https://flink.apache.org/how-to-contribute/code-style-and-quality-common/#7-testing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-35002) GitHub action/upload-artifact@v4 can timeout

2024-04-03 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-35002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl updated FLINK-35002:
--
Labels: github-actions test-stability  (was: test-stability)

> GitHub action/upload-artifact@v4 can timeout
> 
>
> Key: FLINK-35002
> URL: https://issues.apache.org/jira/browse/FLINK-35002
> Project: Flink
>  Issue Type: Bug
>  Components: Build System
>Reporter: Ryan Skraba
>Priority: Major
>  Labels: github-actions, test-stability
>
> A timeout can occur when uploading a successfully built artifact:
>  * [https://github.com/apache/flink/actions/runs/8516411871/job/23325392650]
> {code:java}
> 2024-04-02T02:20:15.6355368Z With the provided path, there will be 1 file 
> uploaded
> 2024-04-02T02:20:15.6360133Z Artifact name is valid!
> 2024-04-02T02:20:15.6362872Z Root directory input is valid!
> 2024-04-02T02:20:20.6975036Z Attempt 1 of 5 failed with error: Request 
> timeout: /twirp/github.actions.results.api.v1.ArtifactService/CreateArtifact. 
> Retrying request in 3000 ms...
> 2024-04-02T02:20:28.7084937Z Attempt 2 of 5 failed with error: Request 
> timeout: /twirp/github.actions.results.api.v1.ArtifactService/CreateArtifact. 
> Retrying request in 4785 ms...
> 2024-04-02T02:20:38.5015936Z Attempt 3 of 5 failed with error: Request 
> timeout: /twirp/github.actions.results.api.v1.ArtifactService/CreateArtifact. 
> Retrying request in 7375 ms...
> 2024-04-02T02:20:50.8901508Z Attempt 4 of 5 failed with error: Request 
> timeout: /twirp/github.actions.results.api.v1.ArtifactService/CreateArtifact. 
> Retrying request in 14988 ms...
> 2024-04-02T02:21:10.9028438Z ##[error]Failed to CreateArtifact: Failed to 
> make request after 5 attempts: Request timeout: 
> /twirp/github.actions.results.api.v1.ArtifactService/CreateArtifact
> 2024-04-02T02:22:59.9893296Z Post job cleanup.
> 2024-04-02T02:22:59.9958844Z Post job cleanup. {code}
> (This is unlikely to be something we can fix, but we can track it.)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-34999) PR CI stopped operating

2024-04-03 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-34999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl updated FLINK-34999:
--
Description: 
There are no [new PR CI 
runs|https://dev.azure.com/apache-flink/apache-flink/_build?definitionId=2] 
being picked up anymore. [Recently updated 
PRs|https://github.com/apache/flink/pulls?q=sort%3Aupdated-desc] are not picked 
up by the @flinkbot.

In the meantime there was a notification sent from GitHub that the password of 
the [@flinkbot|https://github.com/flinkbot] was reset for security reasons. 
It's quite likely that these two events are related.

  was:
There are no [new PR CI 
runs|https://dev.azure.com/apache-flink/apache-flink/_build?definitionId=2] 
being picked up anymore. [Recently updated 
PRs|https://github.com/apache/flink/pulls?q=sort%3Aupdated-desc] are not picked 
up by the @flinkbot.

In the meantime there was a notification sent from GitHub that the password of 
the @flinkbot was reset for security reasons. It's quite likely that these two 
events are related.


> PR CI stopped operating
> ---
>
> Key: FLINK-34999
> URL: https://issues.apache.org/jira/browse/FLINK-34999
> Project: Flink
>  Issue Type: Bug
>  Components: Build System / CI
>Affects Versions: 1.19.0, 1.18.1, 1.20.0
>Reporter: Matthias Pohl
>Priority: Blocker
>
> There are no [new PR CI 
> runs|https://dev.azure.com/apache-flink/apache-flink/_build?definitionId=2] 
> being picked up anymore. [Recently updated 
> PRs|https://github.com/apache/flink/pulls?q=sort%3Aupdated-desc] are not 
> picked up by the @flinkbot.
> In the meantime there was a notification sent from GitHub that the password 
> of the [@flinkbot|https://github.com/flinkbot] was reset for security 
> reasons. It's quite likely that these two events are related.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-35000) PullRequest template doesn't use the correct format to refer to the testing code convention

2024-04-03 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-35000:
-

 Summary: PullRequest template doesn't use the correct format to 
refer to the testing code convention
 Key: FLINK-35000
 URL: https://issues.apache.org/jira/browse/FLINK-35000
 Project: Flink
  Issue Type: Bug
  Components: Build System / CI, Project Website
Affects Versions: 1.18.1, 1.19.0, 1.20.0
Reporter: Matthias Pohl


The PR template refers to 
https://flink.apache.org/contributing/code-style-and-quality-common.html#testing
 rather than 
https://flink.apache.org/how-to-contribute/code-style-and-quality-common/#7-testing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (FLINK-35000) PullRequest template doesn't use the correct format to refer to the testing code convention

2024-04-03 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-35000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl reassigned FLINK-35000:
-

Assignee: Matthias Pohl

> PullRequest template doesn't use the correct format to refer to the testing 
> code convention
> ---
>
> Key: FLINK-35000
> URL: https://issues.apache.org/jira/browse/FLINK-35000
> Project: Flink
>  Issue Type: Bug
>  Components: Build System / CI, Project Website
>Affects Versions: 1.19.0, 1.18.1, 1.20.0
>Reporter: Matthias Pohl
>Assignee: Matthias Pohl
>Priority: Minor
>
> The PR template refers to 
> https://flink.apache.org/contributing/code-style-and-quality-common.html#testing
>  rather than 
> https://flink.apache.org/how-to-contribute/code-style-and-quality-common/#7-testing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-34999) PR CI stopped operating

2024-04-03 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833523#comment-17833523
 ] 

Matthias Pohl commented on FLINK-34999:
---

CC [~uce] [~Weijie Guo] [~fanrui] [~rmetzger]
CC [~jingge] since it might be Ververica infrastructure-related

> PR CI stopped operating
> ---
>
> Key: FLINK-34999
> URL: https://issues.apache.org/jira/browse/FLINK-34999
> Project: Flink
>  Issue Type: Bug
>  Components: Build System / CI
>Affects Versions: 1.19.0, 1.18.1, 1.20.0
>Reporter: Matthias Pohl
>Priority: Blocker
>
> There are no [new PR CI 
> runs|https://dev.azure.com/apache-flink/apache-flink/_build?definitionId=2] 
> being picked up anymore. [Recently updated 
> PRs|https://github.com/apache/flink/pulls?q=sort%3Aupdated-desc] are not 
> picked up by the @flinkbot.
> In the meantime there was a notification sent from GitHub that the password 
> of the @flinkbot was reset for security reasons. It's quite likely that these 
> two events are related.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34999) PR CI stopped operating

2024-04-03 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34999:
-

 Summary: PR CI stopped operating
 Key: FLINK-34999
 URL: https://issues.apache.org/jira/browse/FLINK-34999
 Project: Flink
  Issue Type: Bug
  Components: Build System / CI
Affects Versions: 1.18.1, 1.19.0, 1.20.0
Reporter: Matthias Pohl


There are no [new PR CI 
runs|https://dev.azure.com/apache-flink/apache-flink/_build?definitionId=2] 
being picked up anymore. [Recently updated 
PRs|https://github.com/apache/flink/pulls?q=sort%3Aupdated-desc] are not picked 
up by the @flinkbot.

In the meantime there was a notification sent from GitHub that the password of 
the @flinkbot was reset for security reasons. It's quite likely that these two 
events are related.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-34997) PyFlink YARN per-job on Docker test failed on azure

2024-04-03 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833505#comment-17833505
 ] 

Matthias Pohl commented on FLINK-34997:
---

The issue seems to be that {{docker-compose}} binaries are missing in the Azure 
VMs.

> PyFlink YARN per-job on Docker test failed on azure
> ---
>
> Key: FLINK-34997
> URL: https://issues.apache.org/jira/browse/FLINK-34997
> Project: Flink
>  Issue Type: Bug
>  Components: Build System / CI
>Affects Versions: 1.20.0
>Reporter: Weijie Guo
>Priority: Blocker
>  Labels: test-stability
>
> {code}
> Apr 03 03:12:37 
> ==
> Apr 03 03:12:37 Running 'PyFlink YARN per-job on Docker test'
> Apr 03 03:12:37 
> ==
> Apr 03 03:12:37 TEST_DATA_DIR: 
> /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-37046085202
> Apr 03 03:12:37 Flink dist directory: 
> /home/vsts/work/1/s/flink-dist/target/flink-1.19-SNAPSHOT-bin/flink-1.19-SNAPSHOT
> Apr 03 03:12:38 Flink dist directory: 
> /home/vsts/work/1/s/flink-dist/target/flink-1.19-SNAPSHOT-bin/flink-1.19-SNAPSHOT
> Apr 03 03:12:38 Docker version 24.0.9, build 2936816
> /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/common_docker.sh: 
> line 24: docker-compose: command not found
> Apr 03 03:12:38 [FAIL] Test script contains errors.
> Apr 03 03:12:38 Checking of logs skipped.
> Apr 03 03:12:38 
> Apr 03 03:12:38 [FAIL] 'PyFlink YARN per-job on Docker test' failed after 0 
> minutes and 1 seconds! Test exited with exit code 1
> {code}
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58709=logs=f8e16326-dc75-5ba0-3e95-6178dd55bf6c=94ccd692-49fc-5c64-8775-d427c6e65440=10226



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-34997) PyFlink YARN per-job on Docker test failed on azure

2024-04-03 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-34997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl updated FLINK-34997:
--
Labels: test-stability  (was: )

> PyFlink YARN per-job on Docker test failed on azure
> ---
>
> Key: FLINK-34997
> URL: https://issues.apache.org/jira/browse/FLINK-34997
> Project: Flink
>  Issue Type: Bug
>  Components: Build System / CI
>Affects Versions: 1.20.0
>Reporter: Weijie Guo
>Priority: Major
>  Labels: test-stability
>
> {code}
> Apr 03 03:12:37 
> ==
> Apr 03 03:12:37 Running 'PyFlink YARN per-job on Docker test'
> Apr 03 03:12:37 
> ==
> Apr 03 03:12:37 TEST_DATA_DIR: 
> /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-37046085202
> Apr 03 03:12:37 Flink dist directory: 
> /home/vsts/work/1/s/flink-dist/target/flink-1.19-SNAPSHOT-bin/flink-1.19-SNAPSHOT
> Apr 03 03:12:38 Flink dist directory: 
> /home/vsts/work/1/s/flink-dist/target/flink-1.19-SNAPSHOT-bin/flink-1.19-SNAPSHOT
> Apr 03 03:12:38 Docker version 24.0.9, build 2936816
> /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/common_docker.sh: 
> line 24: docker-compose: command not found
> Apr 03 03:12:38 [FAIL] Test script contains errors.
> Apr 03 03:12:38 Checking of logs skipped.
> Apr 03 03:12:38 
> Apr 03 03:12:38 [FAIL] 'PyFlink YARN per-job on Docker test' failed after 0 
> minutes and 1 seconds! Test exited with exit code 1
> {code}
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58709=logs=f8e16326-dc75-5ba0-3e95-6178dd55bf6c=94ccd692-49fc-5c64-8775-d427c6e65440=10226



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-34998) Wordcount on Docker test failed on azure

2024-04-03 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833504#comment-17833504
 ] 

Matthias Pohl commented on FLINK-34998:
---

I guess, this one is a duplicate of FLINK-34997. In the end, the error happens 
due to the missing {{docker-compose}} binaries in the Azure VMs. WDYT?

> Wordcount on Docker test failed on azure
> 
>
> Key: FLINK-34998
> URL: https://issues.apache.org/jira/browse/FLINK-34998
> Project: Flink
>  Issue Type: Bug
>  Components: Build System / CI
>Affects Versions: 1.20.0
>Reporter: Weijie Guo
>Priority: Major
>
> /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/test_docker_embedded_job.sh:
>  line 65: docker-compose: command not found
> /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/test_docker_embedded_job.sh:
>  line 66: docker-compose: command not found
> /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/test_docker_embedded_job.sh:
>  line 67: docker-compose: command not found
> sort: cannot read: 
> '/home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-24250435151/out/docker_wc_out*':
>  No such file or directory
> Apr 03 02:08:14 FAIL WordCount: Output hash mismatch.  Got 
> d41d8cd98f00b204e9800998ecf8427e, expected 0e5bd0a3dd7d5a7110aa85ff70adb54b.
> Apr 03 02:08:14 head hexdump of actual:
> head: cannot open 
> '/home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-24250435151/out/docker_wc_out*'
>  for reading: No such file or directory
> Apr 03 02:08:14 Stopping job timeout watchdog (with pid=244913)
> Apr 03 02:08:14 [FAIL] Test script contains errors.
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58709=logs=e9d3d34f-3d15-59f4-0e3e-35067d100dfe=5d91035e-8022-55f2-2d4f-ab121508bf7e=6043



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-34997) PyFlink YARN per-job on Docker test failed on azure

2024-04-03 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-34997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl updated FLINK-34997:
--
Description: 
{code}
Apr 03 03:12:37 
==
Apr 03 03:12:37 Running 'PyFlink YARN per-job on Docker test'
Apr 03 03:12:37 
==
Apr 03 03:12:37 TEST_DATA_DIR: 
/home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-37046085202
Apr 03 03:12:37 Flink dist directory: 
/home/vsts/work/1/s/flink-dist/target/flink-1.19-SNAPSHOT-bin/flink-1.19-SNAPSHOT
Apr 03 03:12:38 Flink dist directory: 
/home/vsts/work/1/s/flink-dist/target/flink-1.19-SNAPSHOT-bin/flink-1.19-SNAPSHOT
Apr 03 03:12:38 Docker version 24.0.9, build 2936816
/home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/common_docker.sh: line 
24: docker-compose: command not found
Apr 03 03:12:38 [FAIL] Test script contains errors.
Apr 03 03:12:38 Checking of logs skipped.
Apr 03 03:12:38 
Apr 03 03:12:38 [FAIL] 'PyFlink YARN per-job on Docker test' failed after 0 
minutes and 1 seconds! Test exited with exit code 1
{code}

https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58709=logs=f8e16326-dc75-5ba0-3e95-6178dd55bf6c=94ccd692-49fc-5c64-8775-d427c6e65440=10226

  was:
Apr 03 03:12:37 
==
Apr 03 03:12:37 Running 'PyFlink YARN per-job on Docker test'
Apr 03 03:12:37 
==
Apr 03 03:12:37 TEST_DATA_DIR: 
/home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-37046085202
Apr 03 03:12:37 Flink dist directory: 
/home/vsts/work/1/s/flink-dist/target/flink-1.19-SNAPSHOT-bin/flink-1.19-SNAPSHOT
Apr 03 03:12:38 Flink dist directory: 
/home/vsts/work/1/s/flink-dist/target/flink-1.19-SNAPSHOT-bin/flink-1.19-SNAPSHOT
Apr 03 03:12:38 Docker version 24.0.9, build 2936816
/home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/common_docker.sh: line 
24: docker-compose: command not found
Apr 03 03:12:38 [FAIL] Test script contains errors.
Apr 03 03:12:38 Checking of logs skipped.
Apr 03 03:12:38 
Apr 03 03:12:38 [FAIL] 'PyFlink YARN per-job on Docker test' failed after 0 
minutes and 1 seconds! Test exited with exit code 1




https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58709=logs=f8e16326-dc75-5ba0-3e95-6178dd55bf6c=94ccd692-49fc-5c64-8775-d427c6e65440=10226


> PyFlink YARN per-job on Docker test failed on azure
> ---
>
> Key: FLINK-34997
> URL: https://issues.apache.org/jira/browse/FLINK-34997
> Project: Flink
>  Issue Type: Bug
>  Components: Build System / CI
>Affects Versions: 1.20.0
>Reporter: Weijie Guo
>Priority: Major
>
> {code}
> Apr 03 03:12:37 
> ==
> Apr 03 03:12:37 Running 'PyFlink YARN per-job on Docker test'
> Apr 03 03:12:37 
> ==
> Apr 03 03:12:37 TEST_DATA_DIR: 
> /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-37046085202
> Apr 03 03:12:37 Flink dist directory: 
> /home/vsts/work/1/s/flink-dist/target/flink-1.19-SNAPSHOT-bin/flink-1.19-SNAPSHOT
> Apr 03 03:12:38 Flink dist directory: 
> /home/vsts/work/1/s/flink-dist/target/flink-1.19-SNAPSHOT-bin/flink-1.19-SNAPSHOT
> Apr 03 03:12:38 Docker version 24.0.9, build 2936816
> /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/common_docker.sh: 
> line 24: docker-compose: command not found
> Apr 03 03:12:38 [FAIL] Test script contains errors.
> Apr 03 03:12:38 Checking of logs skipped.
> Apr 03 03:12:38 
> Apr 03 03:12:38 [FAIL] 'PyFlink YARN per-job on Docker test' failed after 0 
> minutes and 1 seconds! Test exited with exit code 1
> {code}
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58709=logs=f8e16326-dc75-5ba0-3e95-6178dd55bf6c=94ccd692-49fc-5c64-8775-d427c6e65440=10226



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-34997) PyFlink YARN per-job on Docker test failed on azure

2024-04-03 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-34997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl updated FLINK-34997:
--
Priority: Blocker  (was: Major)

> PyFlink YARN per-job on Docker test failed on azure
> ---
>
> Key: FLINK-34997
> URL: https://issues.apache.org/jira/browse/FLINK-34997
> Project: Flink
>  Issue Type: Bug
>  Components: Build System / CI
>Affects Versions: 1.20.0
>Reporter: Weijie Guo
>Priority: Blocker
>  Labels: test-stability
>
> {code}
> Apr 03 03:12:37 
> ==
> Apr 03 03:12:37 Running 'PyFlink YARN per-job on Docker test'
> Apr 03 03:12:37 
> ==
> Apr 03 03:12:37 TEST_DATA_DIR: 
> /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-37046085202
> Apr 03 03:12:37 Flink dist directory: 
> /home/vsts/work/1/s/flink-dist/target/flink-1.19-SNAPSHOT-bin/flink-1.19-SNAPSHOT
> Apr 03 03:12:38 Flink dist directory: 
> /home/vsts/work/1/s/flink-dist/target/flink-1.19-SNAPSHOT-bin/flink-1.19-SNAPSHOT
> Apr 03 03:12:38 Docker version 24.0.9, build 2936816
> /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/common_docker.sh: 
> line 24: docker-compose: command not found
> Apr 03 03:12:38 [FAIL] Test script contains errors.
> Apr 03 03:12:38 Checking of logs skipped.
> Apr 03 03:12:38 
> Apr 03 03:12:38 [FAIL] 'PyFlink YARN per-job on Docker test' failed after 0 
> minutes and 1 seconds! Test exited with exit code 1
> {code}
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58709=logs=f8e16326-dc75-5ba0-3e95-6178dd55bf6c=94ccd692-49fc-5c64-8775-d427c6e65440=10226



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-34643) JobIDLoggingITCase failed

2024-04-03 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833414#comment-17833414
 ] 

Matthias Pohl commented on FLINK-34643:
---

I guess, reopening the issue would be fine. But for the sake of not putting too 
much into a single ticket, it wouldn't be wrong to create a new ticket and 
linking FLINK-34643 as the cause, either. I personally would go for the latter 
option.

> JobIDLoggingITCase failed
> -
>
> Key: FLINK-34643
> URL: https://issues.apache.org/jira/browse/FLINK-34643
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.20.0
>Reporter: Matthias Pohl
>Assignee: Roman Khachatryan
>Priority: Major
>  Labels: pull-request-available, test-stability
> Fix For: 1.20.0
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58187=logs=8fd9202e-fd17-5b26-353c-ac1ff76c8f28=ea7cf968-e585-52cb-e0fc-f48de023a7ca=7897
> {code}
> Mar 09 01:24:23 01:24:23.498 [ERROR] Tests run: 1, Failures: 0, Errors: 1, 
> Skipped: 0, Time elapsed: 4.209 s <<< FAILURE! -- in 
> org.apache.flink.test.misc.JobIDLoggingITCase
> Mar 09 01:24:23 01:24:23.498 [ERROR] 
> org.apache.flink.test.misc.JobIDLoggingITCase.testJobIDLogging(ClusterClient) 
> -- Time elapsed: 1.459 s <<< ERROR!
> Mar 09 01:24:23 java.lang.IllegalStateException: Too few log events recorded 
> for org.apache.flink.runtime.jobmaster.JobMaster (12) - this must be a bug in 
> the test code
> Mar 09 01:24:23   at 
> org.apache.flink.util.Preconditions.checkState(Preconditions.java:215)
> Mar 09 01:24:23   at 
> org.apache.flink.test.misc.JobIDLoggingITCase.assertJobIDPresent(JobIDLoggingITCase.java:148)
> Mar 09 01:24:23   at 
> org.apache.flink.test.misc.JobIDLoggingITCase.testJobIDLogging(JobIDLoggingITCase.java:132)
> Mar 09 01:24:23   at java.lang.reflect.Method.invoke(Method.java:498)
> Mar 09 01:24:23   at 
> java.util.concurrent.RecursiveAction.exec(RecursiveAction.java:189)
> Mar 09 01:24:23   at 
> java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
> Mar 09 01:24:23   at 
> java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
> Mar 09 01:24:23   at 
> java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
> Mar 09 01:24:23   at 
> java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175)
> Mar 09 01:24:23 
> {code}
> The other test failures of this build were also caused by the same test:
> * 
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58187=logs=2c3cbe13-dee0-5837-cf47-3053da9a8a78=b78d9d30-509a-5cea-1fef-db7abaa325ae=8349
> * 
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58187=logs=a596f69e-60d2-5a4b-7d39-dc69e4cdaed3=712ade8c-ca16-5b76-3acd-14df33bc1cb1=8209



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-34989) Apache Infra requests to reduce the runner usage for a project

2024-04-02 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833154#comment-17833154
 ] 

Matthias Pohl edited comment on FLINK-34989 at 4/2/24 12:18 PM:


This Jira issue is about adding job concurrency support. Ideally, we should 
make it configurable in an easy way and set it to a concurrency level >20 as 
requested by Apache Infra. This affects the nightly builds which run per branch 
with 5 different test profiles and each test profile having 11 runners (10 
stages + a short-running license check) being occupied in parallel.

Generally, we should make CI be more selective anyway. Apache Infra constantly 
criticizes projects for running heavy-load CI on changes like simple doc 
changes (see [here|https://infra.apache.org/github-actions-secrets.html]).


was (Author: mapohl):
This Jira issue is about adding job concurrency support. Ideally, we should 
make it configurable in an easy way and set it to a concurrency level >20 as 
requested by Apache Infra. This affects the nightly builds which run per branch 
with 5 different test profiles and each test profile having 11 runners (10 
stages + a short-running license check) being occupied in parallel.

Generally, we should make CI be more selective anyway. Apache Infra constantly 
criticizes projects to run heavy-load CI for things like simple doc changes.

> Apache Infra requests to reduce the runner usage for a project
> --
>
> Key: FLINK-34989
> URL: https://issues.apache.org/jira/browse/FLINK-34989
> Project: Flink
>  Issue Type: Sub-task
>  Components: Build System / CI
>Affects Versions: 1.19.0, 1.18.1, 1.20.0
>Reporter: Matthias Pohl
>Priority: Major
>  Labels: pull-request-available
>
> The GitHub Actions CI utilizes runners that are hosted by Apache Infra right 
> now. These runners are limited. The runner usage can be monitored via the 
> following links:
> * [Flink-specific 
> report|https://infra-reports.apache.org/#ghactions=flink=168] 
> (needs ASF committer rights) This project-specific report can only be 
> modified through the HTTP GET parameters of the URL.
> * [Global report|https://infra-reports.apache.org/#ghactions] (needs ASF 
> membership)
> There was a policy change announced recently:
> {quote}
> Policy change on use of GitHub Actions
> Due to misconfigurations in their builds, some projects have been using 
> unsupportable numbers of GitHub Actions. As part of fixing this situation, 
> Infra has added a 'resource use' section to the policy on GitHub Actions. 
> This section of the policy will come into effect on April 20, 2024:
> All workflows MUST have a job concurrency level less than or equal to 20. 
> This means a workflow cannot have more than 20 jobs running at the same time 
> across all matrices.
> All workflows SHOULD have a job concurrency level less than or equal to 15. 
> Just because 20 is the max, doesn't mean you should strive for 20.
> The average number of minutes a project uses per calendar week MUST NOT 
> exceed the equivalent of 25 full-time runners (250,000 minutes, or 4,200 
> hours).
> The average number of minutes a project uses in any consecutive five-day 
> period MUST NOT exceed the equivalent of 30 full-time runners (216,000 
> minutes, or 3,600 hours).
> Projects whose builds consistently cross the maximum use limits will lose 
> their access to GitHub Actions until they fix their build configurations.
> The full policy is at  
> https://infra.apache.org/github-actions-policy.html.
> {quote}
> Currently (last week of March 2024) Flink was ranked at #19 of projects that 
> used the Apache Infra runner resources the most which doesn't seem too bad. 
> This contained not only Apache Flink but also the Kubernetes operator, 
> connectors and other resources. According to [this 
> source|https://infra.apache.org/github-actions-secrets.html] Apache Infra 
> manages 180 runners right now.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-34989) Apache Infra requests to reduce the runner usage for a project

2024-04-02 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-34989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl updated FLINK-34989:
--
Description: 
The GitHub Actions CI utilizes runners that are hosted by Apache Infra right 
now. These runners are limited. The runner usage can be monitored via the 
following links:
* [Flink-specific 
report|https://infra-reports.apache.org/#ghactions=flink=168] 
(needs ASF committer rights) This project-specific report can only be modified 
through the HTTP GET parameters of the URL.
* [Global report|https://infra-reports.apache.org/#ghactions] (needs ASF 
membership)

There was a policy change announced recently:
{quote}
Policy change on use of GitHub Actions

Due to misconfigurations in their builds, some projects have been using 
unsupportable numbers of GitHub Actions. As part of fixing this situation, 
Infra has added a 'resource use' section to the policy on GitHub Actions. 
This section of the policy will come into effect on April 20, 2024:

All workflows MUST have a job concurrency level less than or equal to 20. 
This means a workflow cannot have more than 20 jobs running at the same time 
across all matrices.
All workflows SHOULD have a job concurrency level less than or equal to 15. 
Just because 20 is the max, doesn't mean you should strive for 20.
The average number of minutes a project uses per calendar week MUST NOT 
exceed the equivalent of 25 full-time runners (250,000 minutes, or 4,200 
hours).
The average number of minutes a project uses in any consecutive five-day 
period MUST NOT exceed the equivalent of 30 full-time runners (216,000 
minutes, or 3,600 hours).
Projects whose builds consistently cross the maximum use limits will lose 
their access to GitHub Actions until they fix their build configurations.
The full policy is at  
https://infra.apache.org/github-actions-policy.html.
{quote}

Currently (last week of March 2024) Flink was ranked at #19 of projects that 
used the Apache Infra runner resources the most which doesn't seem too bad. 
This contained not only Apache Flink but also the Kubernetes operator, 
connectors and other resources. According to [this 
source|https://infra.apache.org/github-actions-secrets.html] Apache Infra 
manages 180 runners right now.

  was:
The GitHub Actions CI utilizes runners that are hosted by Apache Infra right 
now. These runners are limited. The runner usage can be monitored via the 
following links:
* [Flink-specific 
report|https://infra-reports.apache.org/#ghactions=flink=168] 
(needs ASF committer rights) This project-specific report can only be modified 
through the HTTP GET parameters of the URL.
* [Global report|https://infra-reports.apache.org/#ghactions] (needs ASF 
membership)

There was a policy change announced recently:
{quote}
Policy change on use of GitHub Actions

Due to misconfigurations in their builds, some projects have been using 
unsupportable numbers of GitHub Actions. As part of fixing this situation, 
Infra has added a 'resource use' section to the policy on GitHub Actions. 
This section of the policy will come into effect on April 20, 2024:

All workflows MUST have a job concurrency level less than or equal to 20. 
This means a workflow cannot have more than 20 jobs running at the same time 
across all matrices.
All workflows SHOULD have a job concurrency level less than or equal to 15. 
Just because 20 is the max, doesn't mean you should strive for 20.
The average number of minutes a project uses per calendar week MUST NOT 
exceed the equivalent of 25 full-time runners (250,000 minutes, or 4,200 
hours).
The average number of minutes a project uses in any consecutive five-day 
period MUST NOT exceed the equivalent of 30 full-time runners (216,000 
minutes, or 3,600 hours).
Projects whose builds consistently cross the maximum use limits will lose 
their access to GitHub Actions until they fix their build configurations.
The full policy is at  
https://infra.apache.org/github-actions-policy.html.
{quote}

Currently (last week of March 2024) Flink was ranked at #19 of projects that 
used the Apache Infra runner resources the most which doesn't seem too bad. 
This contained not only Apache Flink but also the Kubernetes operator, 
connectors and other resources.


> Apache Infra requests to reduce the runner usage for a project
> --
>
> Key: FLINK-34989
> URL: https://issues.apache.org/jira/browse/FLINK-34989
> Project: Flink
>  Issue Type: Sub-task
>  Components: Build System / CI
>Affects Versions: 1.19.0, 1.18.1, 1.20.0
>Reporter: Matthias Pohl
>Priority: Major
>  Labels: pull-request-available
>
> The GitHub Actions CI utilizes runners that are hosted by Apache Infra right 
> now. These runners are limited. The runner usage can be monitored via the 
> 

[jira] [Commented] (FLINK-34989) Apache Infra requests to reduce the runner usage for a project

2024-04-02 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833155#comment-17833155
 ] 

Matthias Pohl commented on FLINK-34989:
---

For this issue, we should keep in mind that it is only affecting the 
non-ephemeral runners. FLINK-34331 works on enabling ephemeral runners for 
Apache Flink. Ephemeral runners would allow us to donate project specific 
runners, i.e. someone could donate hardware to allow Flink to have its own 
runners and not to worry to much about blocking other projects with CI.

> Apache Infra requests to reduce the runner usage for a project
> --
>
> Key: FLINK-34989
> URL: https://issues.apache.org/jira/browse/FLINK-34989
> Project: Flink
>  Issue Type: Sub-task
>  Components: Build System / CI
>Affects Versions: 1.19.0, 1.18.1, 1.20.0
>Reporter: Matthias Pohl
>Priority: Major
>
> The GitHub Actions CI utilizes runners that are hosted by Apache Infra right 
> now. These runners are limited. The runner usage can be monitored via the 
> following links:
> * [Flink-specific 
> report|https://infra-reports.apache.org/#ghactions=flink=168] 
> (needs ASF committer rights) This project-specific report can only be 
> modified through the HTTP GET parameters of the URL.
> * [Global report|https://infra-reports.apache.org/#ghactions] (needs ASF 
> membership)
> There was a policy change announced recently:
> {quote}
> Policy change on use of GitHub Actions
> Due to misconfigurations in their builds, some projects have been using 
> unsupportable numbers of GitHub Actions. As part of fixing this situation, 
> Infra has added a 'resource use' section to the policy on GitHub Actions. 
> This section of the policy will come into effect on April 20, 2024:
> All workflows MUST have a job concurrency level less than or equal to 20. 
> This means a workflow cannot have more than 20 jobs running at the same time 
> across all matrices.
> All workflows SHOULD have a job concurrency level less than or equal to 15. 
> Just because 20 is the max, doesn't mean you should strive for 20.
> The average number of minutes a project uses per calendar week MUST NOT 
> exceed the equivalent of 25 full-time runners (250,000 minutes, or 4,200 
> hours).
> The average number of minutes a project uses in any consecutive five-day 
> period MUST NOT exceed the equivalent of 30 full-time runners (216,000 
> minutes, or 3,600 hours).
> Projects whose builds consistently cross the maximum use limits will lose 
> their access to GitHub Actions until they fix their build configurations.
> The full policy is at  
> https://infra.apache.org/github-actions-policy.html.
> {quote}
> Currently (last week of March 2024) Flink was ranked at #19 of projects that 
> used the Apache Infra runner resources the most which doesn't seem too bad. 
> This contained not only Apache Flink but also the Kubernetes operator, 
> connectors and other resources.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-34331) Enable Apache INFRA ephemeral runners for nightly builds

2024-04-02 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-34331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl updated FLINK-34331:
--
Summary: Enable Apache INFRA ephemeral runners for nightly builds  (was: 
Enable Apache INFRA runners for nightly builds)

> Enable Apache INFRA ephemeral runners for nightly builds
> 
>
> Key: FLINK-34331
> URL: https://issues.apache.org/jira/browse/FLINK-34331
> Project: Flink
>  Issue Type: Sub-task
>  Components: Build System / CI
>Affects Versions: 1.19.0, 1.18.1
>Reporter: Matthias Pohl
>Assignee: Matthias Pohl
>Priority: Major
>  Labels: pull-request-available
>
> The nightly CI is currently still utilizing the GitHub runners. We want to 
> switch to Apache INFRA's ephemeral runners (see 
> [docs|https://cwiki.apache.org/confluence/display/INFRA/ASF+Infra+provided+self-hosted+runners]).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-34989) Apache Infra requests to reduce the runner usage for a project

2024-04-02 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833154#comment-17833154
 ] 

Matthias Pohl commented on FLINK-34989:
---

This Jira issue is about adding job concurrency support. Ideally, we should 
make it configurable in an easy way and set it to a concurrency level >20 as 
requested by Apache Infra. This affects the nightly builds which run per branch 
with 5 different test profiles and each test profile having 11 runners (10 
stages + a short-running license check) being occupied in parallel.

Generally, we should make CI be more selective anyway. Apache Infra constantly 
criticizes projects to run heavy-load CI for things like simple doc changes.

> Apache Infra requests to reduce the runner usage for a project
> --
>
> Key: FLINK-34989
> URL: https://issues.apache.org/jira/browse/FLINK-34989
> Project: Flink
>  Issue Type: Sub-task
>  Components: Build System / CI
>Affects Versions: 1.19.0, 1.18.1, 1.20.0
>Reporter: Matthias Pohl
>Priority: Major
>
> The GitHub Actions CI utilizes runners that are hosted by Apache Infra right 
> now. These runners are limited. The runner usage can be monitored via the 
> following links:
> * [Flink-specific 
> report|https://infra-reports.apache.org/#ghactions=flink=168] 
> (needs ASF committer rights) This project-specific report can only be 
> modified through the HTTP GET parameters of the URL.
> * [Global report|https://infra-reports.apache.org/#ghactions] (needs ASF 
> membership)
> There was a policy change announced recently:
> {quote}
> Policy change on use of GitHub Actions
> Due to misconfigurations in their builds, some projects have been using 
> unsupportable numbers of GitHub Actions. As part of fixing this situation, 
> Infra has added a 'resource use' section to the policy on GitHub Actions. 
> This section of the policy will come into effect on April 20, 2024:
> All workflows MUST have a job concurrency level less than or equal to 20. 
> This means a workflow cannot have more than 20 jobs running at the same time 
> across all matrices.
> All workflows SHOULD have a job concurrency level less than or equal to 15. 
> Just because 20 is the max, doesn't mean you should strive for 20.
> The average number of minutes a project uses per calendar week MUST NOT 
> exceed the equivalent of 25 full-time runners (250,000 minutes, or 4,200 
> hours).
> The average number of minutes a project uses in any consecutive five-day 
> period MUST NOT exceed the equivalent of 30 full-time runners (216,000 
> minutes, or 3,600 hours).
> Projects whose builds consistently cross the maximum use limits will lose 
> their access to GitHub Actions until they fix their build configurations.
> The full policy is at  
> https://infra.apache.org/github-actions-policy.html.
> {quote}
> Currently (last week of March 2024) Flink was ranked at #19 of projects that 
> used the Apache Infra runner resources the most which doesn't seem too bad. 
> This contained not only Apache Flink but also the Kubernetes operator, 
> connectors and other resources.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-34989) Apache Infra requests to reduce the runner usage for a project

2024-04-02 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833153#comment-17833153
 ] 

Matthias Pohl commented on FLINK-34989:
---

Here's a summary of the requirements and whether we meet them based on the 
most-recent report:

|| Requirement || Flink CI ||
| Job concurrency level 20 (or better 15) or below | (n) |
| Do not exceed 25 full-time runners (FT runner), i.e. 4200hours per 7 days | 
(y) |
| Avg number of minutes should not exceed 3600 hours per 5 days | (y) |


> Apache Infra requests to reduce the runner usage for a project
> --
>
> Key: FLINK-34989
> URL: https://issues.apache.org/jira/browse/FLINK-34989
> Project: Flink
>  Issue Type: Sub-task
>  Components: Build System / CI
>Affects Versions: 1.19.0, 1.18.1, 1.20.0
>Reporter: Matthias Pohl
>Priority: Major
>
> The GitHub Actions CI utilizes runners that are hosted by Apache Infra right 
> now. These runners are limited. The runner usage can be monitored via the 
> following links:
> * [Flink-specific 
> report|https://infra-reports.apache.org/#ghactions=flink=168] 
> (needs ASF committer rights) This project-specific report can only be 
> modified through the HTTP GET parameters of the URL.
> * [Global report|https://infra-reports.apache.org/#ghactions] (needs ASF 
> membership)
> There was a policy change announced recently:
> {quote}
> Policy change on use of GitHub Actions
> Due to misconfigurations in their builds, some projects have been using 
> unsupportable numbers of GitHub Actions. As part of fixing this situation, 
> Infra has added a 'resource use' section to the policy on GitHub Actions. 
> This section of the policy will come into effect on April 20, 2024:
> All workflows MUST have a job concurrency level less than or equal to 20. 
> This means a workflow cannot have more than 20 jobs running at the same time 
> across all matrices.
> All workflows SHOULD have a job concurrency level less than or equal to 15. 
> Just because 20 is the max, doesn't mean you should strive for 20.
> The average number of minutes a project uses per calendar week MUST NOT 
> exceed the equivalent of 25 full-time runners (250,000 minutes, or 4,200 
> hours).
> The average number of minutes a project uses in any consecutive five-day 
> period MUST NOT exceed the equivalent of 30 full-time runners (216,000 
> minutes, or 3,600 hours).
> Projects whose builds consistently cross the maximum use limits will lose 
> their access to GitHub Actions until they fix their build configurations.
> The full policy is at  
> https://infra.apache.org/github-actions-policy.html.
> {quote}
> Currently (last week of March 2024) Flink was ranked at #19 of projects that 
> used the Apache Infra runner resources the most which doesn't seem too bad. 
> This contained not only Apache Flink but also the Kubernetes operator, 
> connectors and other resources.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-34989) Apache Infra requests to reduce the runner usage for a project

2024-04-02 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-34989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl updated FLINK-34989:
--
Description: 
The GitHub Actions CI utilizes runners that are hosted by Apache Infra right 
now. These runners are limited. The runner usage can be monitored via the 
following links:
* [Flink-specific 
report|https://infra-reports.apache.org/#ghactions=flink=168] 
(needs ASF committer rights) This project-specific report can only be modified 
through the HTTP GET parameters of the URL.
* [Global report|https://infra-reports.apache.org/#ghactions] (needs ASF 
membership)

There was a policy change announced recently:
{quote}
Policy change on use of GitHub Actions

Due to misconfigurations in their builds, some projects have been using 
unsupportable numbers of GitHub Actions. As part of fixing this situation, 
Infra has added a 'resource use' section to the policy on GitHub Actions. 
This section of the policy will come into effect on April 20, 2024:

All workflows MUST have a job concurrency level less than or equal to 20. 
This means a workflow cannot have more than 20 jobs running at the same time 
across all matrices.
All workflows SHOULD have a job concurrency level less than or equal to 15. 
Just because 20 is the max, doesn't mean you should strive for 20.
The average number of minutes a project uses per calendar week MUST NOT 
exceed the equivalent of 25 full-time runners (250,000 minutes, or 4,200 
hours).
The average number of minutes a project uses in any consecutive five-day 
period MUST NOT exceed the equivalent of 30 full-time runners (216,000 
minutes, or 3,600 hours).
Projects whose builds consistently cross the maximum use limits will lose 
their access to GitHub Actions until they fix their build configurations.
The full policy is at  
https://infra.apache.org/github-actions-policy.html.
{quote}

Currently (last week of March 2024) Flink was ranked at #19 of projects that 
used the Apache Infra runner resources the most which doesn't seem too bad. 
This contained not only Apache Flink but also the Kubernetes operator, 
connectors and other resources.

  was:
The GitHub Actions CI utilizes runners that are hosted by Apache Infra right 
now. These runners are limited. The runner usage can be monitored via the 
following links:
* [Flink-specific 
report|https://infra-reports.apache.org/#ghactions=flink=168] 
(needs ASF committer rights) This project-specific report can only be modified 
through the HTTP GET parameters of the URL.
* [Global report|https://infra-reports.apache.org/#ghactions] (needs ASF 
membership)

There was a policy change announced recently:
{quote}
Policy change on use of GitHub Actions

Due to misconfigurations in their builds, some projects have been using 
unsupportable numbers of GitHub Actions. As part of fixing this situation, 
Infra has added a 'resource use' section to the policy on GitHub Actions. 
This section of the policy will come into effect on April 20, 2024:

All workflows MUST have a job concurrency level less than or equal to 20. 
This means a workflow cannot have more than 20 jobs running at the same time 
across all matrices.
All workflows SHOULD have a job concurrency level less than or equal to 15. 
Just because 20 is the max, doesn't mean you should strive for 20.
The average number of minutes a project uses per calendar week MUST NOT 
exceed the equivalent of 25 full-time runners (250,000 minutes, or 4,200 
hours).
The average number of minutes a project uses in any consecutive five-day 
period MUST NOT exceed the equivalent of 30 full-time runners (216,000 
minutes, or 3,600 hours).
Projects whose builds consistently cross the maximum use limits will lose 
their access to GitHub Actions until they fix their build configurations.
The full policy is at  
https://infra.apache.org/github-actions-policy.html.
{quote}


> Apache Infra requests to reduce the runner usage for a project
> --
>
> Key: FLINK-34989
> URL: https://issues.apache.org/jira/browse/FLINK-34989
> Project: Flink
>  Issue Type: Sub-task
>  Components: Build System / CI
>Affects Versions: 1.19.0, 1.18.1, 1.20.0
>Reporter: Matthias Pohl
>Priority: Major
>
> The GitHub Actions CI utilizes runners that are hosted by Apache Infra right 
> now. These runners are limited. The runner usage can be monitored via the 
> following links:
> * [Flink-specific 
> report|https://infra-reports.apache.org/#ghactions=flink=168] 
> (needs ASF committer rights) This project-specific report can only be 
> modified through the HTTP GET parameters of the URL.
> * [Global report|https://infra-reports.apache.org/#ghactions] (needs ASF 
> membership)
> There was a policy change announced recently:
> {quote}
> Policy change on use of GitHub Actions
> Due to 

[jira] [Updated] (FLINK-34989) Apache Infra requests to reduce the runner usage for a project

2024-04-02 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-34989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl updated FLINK-34989:
--
Description: 
The GitHub Actions CI utilizes runners that are hosted by Apache Infra right 
now. These runners are limited. The runner usage can be monitored via the 
following links:
* [Flink-specific 
report|https://infra-reports.apache.org/#ghactions=flink=168] 
(needs ASF committer rights) This project-specific report can only be modified 
through the HTTP GET parameters of the URL.
* [Global report|https://infra-reports.apache.org/#ghactions] (needs ASF 
membership)

There was a policy change announced recently:
{quote}
Policy change on use of GitHub Actions

Due to misconfigurations in their builds, some projects have been using 
unsupportable numbers of GitHub Actions. As part of fixing this situation, 
Infra has added a 'resource use' section to the policy on GitHub Actions. 
This section of the policy will come into effect on April 20, 2024:

All workflows MUST have a job concurrency level less than or equal to 20. 
This means a workflow cannot have more than 20 jobs running at the same time 
across all matrices.
All workflows SHOULD have a job concurrency level less than or equal to 15. 
Just because 20 is the max, doesn't mean you should strive for 20.
The average number of minutes a project uses per calendar week MUST NOT 
exceed the equivalent of 25 full-time runners (250,000 minutes, or 4,200 
hours).
The average number of minutes a project uses in any consecutive five-day 
period MUST NOT exceed the equivalent of 30 full-time runners (216,000 
minutes, or 3,600 hours).
Projects whose builds consistently cross the maximum use limits will lose 
their access to GitHub Actions until they fix their build configurations.
The full policy is at  
https://infra.apache.org/github-actions-policy.html.
{quote}

  was:
The GitHub Actions CI utilizes runners that are hosted by Apache Infra right 
now. These runners are limited. The runner usage can be monitored via the 
following links:
* [Flink-specific 
report|https://infra-reports.apache.org/#ghactions=flink=168] 
(needs ASF committer rights) This project-specific report can only be modified 
through the HTTP GET parameters of the URL.
* [Global report|https://infra-reports.apache.org/#ghactions] (needs ASF 
membership)


> Apache Infra requests to reduce the runner usage for a project
> --
>
> Key: FLINK-34989
> URL: https://issues.apache.org/jira/browse/FLINK-34989
> Project: Flink
>  Issue Type: Sub-task
>  Components: Build System / CI
>Affects Versions: 1.19.0, 1.18.1, 1.20.0
>Reporter: Matthias Pohl
>Priority: Major
>
> The GitHub Actions CI utilizes runners that are hosted by Apache Infra right 
> now. These runners are limited. The runner usage can be monitored via the 
> following links:
> * [Flink-specific 
> report|https://infra-reports.apache.org/#ghactions=flink=168] 
> (needs ASF committer rights) This project-specific report can only be 
> modified through the HTTP GET parameters of the URL.
> * [Global report|https://infra-reports.apache.org/#ghactions] (needs ASF 
> membership)
> There was a policy change announced recently:
> {quote}
> Policy change on use of GitHub Actions
> Due to misconfigurations in their builds, some projects have been using 
> unsupportable numbers of GitHub Actions. As part of fixing this situation, 
> Infra has added a 'resource use' section to the policy on GitHub Actions. 
> This section of the policy will come into effect on April 20, 2024:
> All workflows MUST have a job concurrency level less than or equal to 20. 
> This means a workflow cannot have more than 20 jobs running at the same time 
> across all matrices.
> All workflows SHOULD have a job concurrency level less than or equal to 15. 
> Just because 20 is the max, doesn't mean you should strive for 20.
> The average number of minutes a project uses per calendar week MUST NOT 
> exceed the equivalent of 25 full-time runners (250,000 minutes, or 4,200 
> hours).
> The average number of minutes a project uses in any consecutive five-day 
> period MUST NOT exceed the equivalent of 30 full-time runners (216,000 
> minutes, or 3,600 hours).
> Projects whose builds consistently cross the maximum use limits will lose 
> their access to GitHub Actions until they fix their build configurations.
> The full policy is at  
> https://infra.apache.org/github-actions-policy.html.
> {quote}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-34937) Apache Infra GHA policy update

2024-04-02 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833149#comment-17833149
 ] 

Matthias Pohl commented on FLINK-34937:
---

I moved the runner usage discussion into FLINK-34989

> Apache Infra GHA policy update
> --
>
> Key: FLINK-34937
> URL: https://issues.apache.org/jira/browse/FLINK-34937
> Project: Flink
>  Issue Type: Sub-task
>  Components: Build System / CI
>Affects Versions: 1.19.0, 1.18.1, 1.20.0
>Reporter: Matthias Pohl
>Priority: Major
>
> There is a policy update [announced in the infra 
> ML|https://www.mail-archive.com/jdo-dev@db.apache.org/msg13638.html] which 
> asked Apache projects to limit the number of runners per job. Additionally, 
> the [GHA policy|https://infra.apache.org/github-actions-policy.html] is 
> referenced which I wasn't aware of when working on the action workflow.
> This issue is about applying the policy to the Flink GHA workflows.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34989) Apache Infra requests to reduce the runner usage for a project

2024-04-02 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34989:
-

 Summary: Apache Infra requests to reduce the runner usage for a 
project
 Key: FLINK-34989
 URL: https://issues.apache.org/jira/browse/FLINK-34989
 Project: Flink
  Issue Type: Sub-task
  Components: Build System / CI
Affects Versions: 1.18.1, 1.19.0, 1.20.0
Reporter: Matthias Pohl


The GitHub Actions CI utilizes runners that are hosted by Apache Infra right 
now. These runners are limited. The runner usage can be monitored via the 
following links:
* [Flink-specific 
report|https://infra-reports.apache.org/#ghactions=flink=168] 
(needs ASF committer rights) This project-specific report can only be modified 
through the HTTP GET parameters of the URL.
* [Global report|https://infra-reports.apache.org/#ghactions] (needs ASF 
membership)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-34427) FineGrainedSlotManagerTest fails fatally (exit code 239)

2024-04-02 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833098#comment-17833098
 ] 

Matthias Pohl commented on FLINK-34427:
---

Copied over from FLINK-33416:
* https://github.com/XComp/flink/actions/runs/6472726326/job/17575765131
* 1.19: 
https://github.com/apache/flink/actions/runs/8467681781/job/23199435037#step:10:8909

> FineGrainedSlotManagerTest fails fatally (exit code 239)
> 
>
> Key: FLINK-34427
> URL: https://issues.apache.org/jira/browse/FLINK-34427
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.19.0, 1.18.1, 1.20.0
>Reporter: Matthias Pohl
>Assignee: Matthias Pohl
>Priority: Critical
>  Labels: pull-request-available, test-stability
>
> https://github.com/apache/flink/actions/runs/7866453350/job/21460921911#step:10:8959
> {code}
> Error: 02:28:53 02:28:53.220 [ERROR] Process Exit Code: 239
> Error: 02:28:53 02:28:53.220 [ERROR] Crashed tests:
> Error: 02:28:53 02:28:53.220 [ERROR] 
> org.apache.flink.runtime.resourcemanager.ResourceManagerTaskExecutorTest
> Error: 02:28:53 02:28:53.220 [ERROR] 
> org.apache.maven.surefire.booter.SurefireBooterForkException: 
> ExecutionException The forked VM terminated without properly saying goodbye. 
> VM crash or System.exit called?
> Error: 02:28:53 02:28:53.220 [ERROR] Command was /bin/sh -c cd 
> '/root/flink/flink-runtime' && 
> '/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java' '-XX:+UseG1GC' '-Xms256m' 
> '-XX:+IgnoreUnrecognizedVMOptions' 
> '--add-opens=java.base/java.util=ALL-UNNAMED' 
> '--add-opens=java.base/java.lang=ALL-UNNAMED' 
> '--add-opens=java.base/java.net=ALL-UNNAMED' 
> '--add-opens=java.base/java.io=ALL-UNNAMED' 
> '--add-opens=java.base/java.util.concurrent=ALL-UNNAMED' '-Xmx768m' '-jar' 
> '/root/flink/flink-runtime/target/surefire/surefirebooter-20240212022332296_94.jar'
>  '/root/flink/flink-runtime/target/surefire' 
> '2024-02-12T02-21-39_495-jvmRun3' 'surefire-20240212022332296_88tmp' 
> 'surefire_26-20240212022332296_91tmp'
> Error: 02:28:53 02:28:53.220 [ERROR] Error occurred in starting fork, check 
> output in log
> Error: 02:28:53 02:28:53.220 [ERROR] Process Exit Code: 239
> Error: 02:28:53 02:28:53.220 [ERROR] Crashed tests:
> Error: 02:28:53 02:28:53.221 [ERROR] 
> org.apache.flink.runtime.resourcemanager.ResourceManagerTaskExecutorTest
> Error: 02:28:53 02:28:53.221 [ERROR]  at 
> org.apache.maven.plugin.surefire.booterclient.ForkStarter.awaitResultsDone(ForkStarter.java:456)
> [...]
> {code}
> The fatal error is triggered most likely within the 
> {{FineGrainedSlotManagerTest}}:
> {code}
> 02:26:39,362 [   pool-643-thread-1] ERROR 
> org.apache.flink.util.FatalExitExceptionHandler  [] - FATAL: 
> Thread 'pool-643-thread-1' produced an uncaught exception. Stopping the 
> process...
> java.util.concurrent.CompletionException: 
> java.util.concurrent.RejectedExecutionException: Task 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@4bbc0b10 
> rejected from 
> java.util.concurrent.ScheduledThreadPoolExecutor@7a45cd9a[Shutting down, pool 
> size = 1, active threads = 1, queued tasks = 1, completed tasks = 194]
> at 
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273)
>  ~[?:1.8.0_392]
> at 
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280)
>  ~[?:1.8.0_392]
> at 
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:838) 
> ~[?:1.8.0_392]
> at 
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)
>  ~[?:1.8.0_392]
> at 
> java.util.concurrent.CompletableFuture.uniHandleStage(CompletableFuture.java:851)
>  ~[?:1.8.0_392]
> at 
> java.util.concurrent.CompletableFuture.handleAsync(CompletableFuture.java:2178)
>  ~[?:1.8.0_392]
> at 
> org.apache.flink.runtime.resourcemanager.slotmanager.DefaultSlotStatusSyncer.allocateSlot(DefaultSlotStatusSyncer.java:138)
>  ~[classes/:?]
> at 
> org.apache.flink.runtime.resourcemanager.slotmanager.FineGrainedSlotManager.allocateSlotsAccordingTo(FineGrainedSlotManager.java:722)
>  ~[classes/:?]
> at 
> org.apache.flink.runtime.resourcemanager.slotmanager.FineGrainedSlotManager.checkResourceRequirements(FineGrainedSlotManager.java:645)
>  ~[classes/:?]
> at 
> org.apache.flink.runtime.resourcemanager.slotmanager.FineGrainedSlotManager.lambda$null$12(FineGrainedSlotManager.java:603)
>  ~[classes/:?]
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> [?:1.8.0_392]
> at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
> [?:1.8.0_392]
> at 
> 

[jira] [Closed] (FLINK-33416) FineGrainedSlotManagerTest failed with fatal error

2024-04-02 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-33416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl closed FLINK-33416.
-
Resolution: Duplicate

This issue is addressed in FLINK-34427. I'm closing FLINK-33416 in favor of 
FLINK-34427 because the investigation happened there.

> FineGrainedSlotManagerTest failed with fatal error
> --
>
> Key: FLINK-33416
> URL: https://issues.apache.org/jira/browse/FLINK-33416
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Reporter: Matthias Pohl
>Priority: Major
>  Labels: github-actions, test-stability
>
> In FLINK-33245, we reported an error of the 
> {{ZooKeeperLeaderElectionConnectionHandlingTest}} failure due to a fatal 
> error. The corresponding build is [this 
> one|https://github.com/XComp/flink/actions/runs/6472726326/job/17575765131].
> But the stacktrace indicates that it's actually 
> {{FineGrainedSlotManagerTest}} which ran before the ZK-related test:
> {code}
> Test 
> org.apache.flink.runtime.resourcemanager.slotmanager.FineGrainedSlotManagerTest.testSlotAllocationAccordingToStrategyResult[testSlotAllocationAccordingToStrategyResult()]
>  successfully run.
> 
> 19:30:11,463 [   pool-752-thread-1] ERROR 
> org.apache.flink.util.FatalExitExceptionHandler  [] - FATAL: 
> Thread 'pool-752-thread-1' produced an uncaught exception. Stopping the 
> process...
> java.util.concurrent.CompletionException: 
> java.util.concurrent.RejectedExecutionException: Task 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@1201ef67[Not
>  completed, task = 
> java.util.concurrent.Executors$RunnableAdapter@1ea6ccfa[Wrapped task = 
> java.util.concurrent.CompletableFuture$UniHandle@36f84d94]] rejected from 
> java.util.concurrent.ScheduledThreadPoolExecutor@4642c78d[Shutting down, pool 
> size = 1, active threads = 1, queued tasks = 1, completed tasks = 194]
> at 
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:314)
>  ~[?:?]
> at 
> java.util.concurrent.CompletableFuture.uniHandleStage(CompletableFuture.java:951)
>  ~[?:?]
> at 
> java.util.concurrent.CompletableFuture.handleAsync(CompletableFuture.java:2276)
>  ~[?:?]
> at 
> org.apache.flink.runtime.resourcemanager.slotmanager.DefaultSlotStatusSyncer.allocateSlot(DefaultSlotStatusSyncer.java:138)
>  ~[classes/:?]
> at 
> org.apache.flink.runtime.resourcemanager.slotmanager.FineGrainedSlotManager.allocateSlotsAccordingTo(FineGrainedSlotManager.java:722)
>  ~[classes/:?]
> at 
> org.apache.flink.runtime.resourcemanager.slotmanager.FineGrainedSlotManager.checkResourceRequirements(FineGrainedSlotManager.java:645)
>  ~[classes/:?]
> at 
> org.apache.flink.runtime.resourcemanager.slotmanager.FineGrainedSlotManager.lambda$checkResourceRequirementsWithDelay$12(FineGrainedSlotManager.java:603)
>  ~[classes/:?]
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
> at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
>  [?:?]
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>  [?:?]
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>  [?:?]
> at java.lang.Thread.run(Thread.java:829) [?:?]
> Caused by: java.util.concurrent.RejectedExecutionException: Task 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@1201ef67[Not
>  completed, task = 
> java.util.concurrent.Executors$RunnableAdapter@1ea6ccfa[Wrapped task = 
> java.util.concurrent.CompletableFuture$UniHandle@36f84d94]] rejected from 
> java.util.concurrent.ScheduledThreadPoolExecutor@4642c78d[Shutting down, pool 
> size = 1, active threads = 1, queued tasks = 1, completed tasks = 194]
> at 
> java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
>  ~[?:?]
> at 
> java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825) 
> ~[?:?]
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(ScheduledThreadPoolExecutor.java:340)
>  ~[?:?]
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor.schedule(ScheduledThreadPoolExecutor.java:562)
>  ~[?:?]
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor.execute(ScheduledThreadPoolExecutor.java:705)
>  ~[?:?]
> at 
> java.util.concurrent.Executors$DelegatedExecutorService.execute(Executors.java:687)
>  ~[?:?]
> at 
> 

[jira] [Comment Edited] (FLINK-34988) Class loading issues in JDK17 and JDK21

2024-04-02 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833095#comment-17833095
 ] 

Matthias Pohl edited comment on FLINK-34988 at 4/2/24 10:07 AM:


It's most likely caused by FLINK-34548 based on the git history between the 
most recent successful nightly run on master 
[20240331.1|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58645=results]
 (based on {{3841f062}}) and 
[20240402.1|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58676=results]
 (based on {{d271495c}}):
{code}
$ git log 3841f062..d271495c --oneline
d271495c5be [hotfix] Fix compile error in DataStreamV2SinkTransformation
28762497bdf [FLINK-34548][API] Supports sink-v2 Sink
056660e0b69 [FLINK-34548][API] Supports FLIP-27 Source
ceafa5a5705 [FLINK-34548][API] Implement datastream
4f71c5b4660 [FLINK-34548][API] Implement process function's underlying operators
e1147ca7e39 [FLINK-34548][API] Introduce ExecutionEnvironment
9fa74a8a706 [FLINK-34548][API] Introduce stream interface and move KeySelector 
to flink-core-api
cedbcce6eff [FLINK-34548][API] Introduce variants of ProcessFunction
13cfaa76b5e [FLINK-34548][API] Introduce ProcessFunction and RuntimeContext 
related interfaces
13790e03207 [FLINK-34548][API] Move Function interface to flink-core-api
59525e460af [FLINK-34548][API] Create flink-core-api module and let flink-core 
depend on it
5b2e923be0a [FLINK-34548][API] Initialize the datastream v2 related modules
{code}


was (Author: mapohl):
It's most likely caused by FLINK-34548 based on the git history between the 
most recent successful nightly run on master 
[20240331.1|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58645=results]
 (based on {{3841f062}}) and 
[20240402.1|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58676=results]
 (based on {{d271495c}}):
{code}
$ git log 3841f062..d271495c5be34f4e4a518207ca7716f4e8907e5f --oneline
d271495c5be [hotfix] Fix compile error in DataStreamV2SinkTransformation
28762497bdf [FLINK-34548][API] Supports sink-v2 Sink
056660e0b69 [FLINK-34548][API] Supports FLIP-27 Source
ceafa5a5705 [FLINK-34548][API] Implement datastream
4f71c5b4660 [FLINK-34548][API] Implement process function's underlying operators
e1147ca7e39 [FLINK-34548][API] Introduce ExecutionEnvironment
9fa74a8a706 [FLINK-34548][API] Introduce stream interface and move KeySelector 
to flink-core-api
cedbcce6eff [FLINK-34548][API] Introduce variants of ProcessFunction
13cfaa76b5e [FLINK-34548][API] Introduce ProcessFunction and RuntimeContext 
related interfaces
13790e03207 [FLINK-34548][API] Move Function interface to flink-core-api
59525e460af [FLINK-34548][API] Create flink-core-api module and let flink-core 
depend on it
5b2e923be0a [FLINK-34548][API] Initialize the datastream v2 related modules
{code}

> Class loading issues in JDK17 and JDK21
> ---
>
> Key: FLINK-34988
> URL: https://issues.apache.org/jira/browse/FLINK-34988
> Project: Flink
>  Issue Type: Bug
>  Components: API / DataStream
>Affects Versions: 1.20.0
>Reporter: Matthias Pohl
>Priority: Major
>  Labels: test-stability
>
> * JDK 17 (core; NoClassDefFoundError caused by ExceptionInInitializeError): 
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58676=logs=675bf62c-8558-587e-2555-dcad13acefb5=5878eed3-cc1e-5b12-1ed0-9e7139ce0992=12942
> * JDK 17 (misc; ExceptionInInitializeError): 
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58676=logs=d871f0ce-7328-5d00-023b-e7391f5801c8=77cbea27-feb9-5cf5-53f7-3267f9f9c6b6=22548
> * JDK 21 (core; same as above): 
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58676=logs=d06b80b4-9e88-5d40-12a2-18072cf60528=609ecd5a-3f6e-5d0c-2239-2096b155a4d0=12963
> * JDK 21 (misc; same as above): 
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58676=logs=59a2b95a-736b-5c46-b3e0-cee6e587fd86=c301da75-e699-5c06-735f-778207c16f50=22506



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-34988) Class loading issues in JDK17 and JDK21

2024-04-02 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833095#comment-17833095
 ] 

Matthias Pohl commented on FLINK-34988:
---

It's most likely caused by FLINK-34548 based on the git history between the 
most recent successful nightly run on master 
[20240331.1|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58645=results]
 (based on {{3841f062}}) and 
[20240402.1|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58676=results]
 (based on {{d271495c}}):
{code}
$ git log 3841f062..d271495c5be34f4e4a518207ca7716f4e8907e5f --oneline
d271495c5be [hotfix] Fix compile error in DataStreamV2SinkTransformation
28762497bdf [FLINK-34548][API] Supports sink-v2 Sink
056660e0b69 [FLINK-34548][API] Supports FLIP-27 Source
ceafa5a5705 [FLINK-34548][API] Implement datastream
4f71c5b4660 [FLINK-34548][API] Implement process function's underlying operators
e1147ca7e39 [FLINK-34548][API] Introduce ExecutionEnvironment
9fa74a8a706 [FLINK-34548][API] Introduce stream interface and move KeySelector 
to flink-core-api
cedbcce6eff [FLINK-34548][API] Introduce variants of ProcessFunction
13cfaa76b5e [FLINK-34548][API] Introduce ProcessFunction and RuntimeContext 
related interfaces
13790e03207 [FLINK-34548][API] Move Function interface to flink-core-api
59525e460af [FLINK-34548][API] Create flink-core-api module and let flink-core 
depend on it
5b2e923be0a [FLINK-34548][API] Initialize the datastream v2 related modules
{code}

> Class loading issues in JDK17 and JDK21
> ---
>
> Key: FLINK-34988
> URL: https://issues.apache.org/jira/browse/FLINK-34988
> Project: Flink
>  Issue Type: Bug
>  Components: API / DataStream
>Affects Versions: 1.20.0
>Reporter: Matthias Pohl
>Priority: Major
>  Labels: test-stability
>
> * JDK 17 (core; NoClassDefFoundError caused by ExceptionInInitializeError): 
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58676=logs=675bf62c-8558-587e-2555-dcad13acefb5=5878eed3-cc1e-5b12-1ed0-9e7139ce0992=12942
> * JDK 17 (misc; ExceptionInInitializeError): 
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58676=logs=d871f0ce-7328-5d00-023b-e7391f5801c8=77cbea27-feb9-5cf5-53f7-3267f9f9c6b6=22548
> * JDK 21 (core; same as above): 
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58676=logs=d06b80b4-9e88-5d40-12a2-18072cf60528=609ecd5a-3f6e-5d0c-2239-2096b155a4d0=12963
> * JDK 21 (misc; same as above): 
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58676=logs=59a2b95a-736b-5c46-b3e0-cee6e587fd86=c301da75-e699-5c06-735f-778207c16f50=22506



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34988) Class loading issues in JDK17 and JDK21

2024-04-02 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34988:
-

 Summary: Class loading issues in JDK17 and JDK21
 Key: FLINK-34988
 URL: https://issues.apache.org/jira/browse/FLINK-34988
 Project: Flink
  Issue Type: Bug
  Components: API / DataStream
Affects Versions: 1.20.0
Reporter: Matthias Pohl


* JDK 17 (core; NoClassDefFoundError caused by ExceptionInInitializeError): 
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58676=logs=675bf62c-8558-587e-2555-dcad13acefb5=5878eed3-cc1e-5b12-1ed0-9e7139ce0992=12942
* JDK 17 (misc; ExceptionInInitializeError): 
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58676=logs=d871f0ce-7328-5d00-023b-e7391f5801c8=77cbea27-feb9-5cf5-53f7-3267f9f9c6b6=22548
* JDK 21 (core; same as above): 
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58676=logs=d06b80b4-9e88-5d40-12a2-18072cf60528=609ecd5a-3f6e-5d0c-2239-2096b155a4d0=12963
* JDK 21 (misc; same as above): 
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58676=logs=59a2b95a-736b-5c46-b3e0-cee6e587fd86=c301da75-e699-5c06-735f-778207c16f50=22506



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-33816) SourceStreamTaskTest.testTriggeringStopWithSavepointWithDrain failed due async checkpoint triggering not being completed

2024-04-02 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-33816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl updated FLINK-33816:
--
Fix Version/s: 1.19.1

> SourceStreamTaskTest.testTriggeringStopWithSavepointWithDrain failed due 
> async checkpoint triggering not being completed 
> -
>
> Key: FLINK-33816
> URL: https://issues.apache.org/jira/browse/FLINK-33816
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Checkpointing, Runtime / Coordination
>Affects Versions: 1.19.0
>Reporter: Matthias Pohl
>Assignee: jiabao.sun
>Priority: Major
>  Labels: github-actions, pull-request-available, test-stability
> Fix For: 1.20.0, 1.19.1
>
> Attachments: screenshot-1.png
>
>
> [https://github.com/XComp/flink/actions/runs/7182604625/job/19559947894#step:12:9430]
> {code:java}
> rror: 14:39:01 14:39:01.930 [ERROR] Tests run: 16, Failures: 1, Errors: 0, 
> Skipped: 0, Time elapsed: 1.878 s <<< FAILURE! - in 
> org.apache.flink.streaming.runtime.tasks.SourceStreamTaskTest
> 9426Error: 14:39:01 14:39:01.930 [ERROR] 
> org.apache.flink.streaming.runtime.tasks.SourceStreamTaskTest.testTriggeringStopWithSavepointWithDrain
>   Time elapsed: 0.034 s  <<< FAILURE!
> 9427Dec 12 14:39:01 org.opentest4j.AssertionFailedError: 
> 9428Dec 12 14:39:01 
> 9429Dec 12 14:39:01 Expecting value to be true but was false
> 9430Dec 12 14:39:01   at 
> java.base/jdk.internal.reflect.DirectConstructorHandleAccessor.newInstance(DirectConstructorHandleAccessor.java:62)
> 9431Dec 12 14:39:01   at 
> java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:502)
> 9432Dec 12 14:39:01   at 
> org.apache.flink.streaming.runtime.tasks.SourceStreamTaskTest.testTriggeringStopWithSavepointWithDrain(SourceStreamTaskTest.java:710)
> 9433Dec 12 14:39:01   at 
> java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
> 9434Dec 12 14:39:01   at 
> java.base/java.lang.reflect.Method.invoke(Method.java:580)
> [...] {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-33816) SourceStreamTaskTest.testTriggeringStopWithSavepointWithDrain failed due async checkpoint triggering not being completed

2024-04-02 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833056#comment-17833056
 ] 

Matthias Pohl commented on FLINK-33816:
---

master: 
[5aebb04b3055fbec6a74eaf4226c4a88d3fd2d6e|https://github.com/apache/flink/commit/5aebb04b3055fbec6a74eaf4226c4a88d3fd2d6e]
1.19: 
[ece4faee055b3797b39e9c0b55f3e94a3db2f912|https://github.com/apache/flink/commit/ece4faee055b3797b39e9c0b55f3e94a3db2f912]

> SourceStreamTaskTest.testTriggeringStopWithSavepointWithDrain failed due 
> async checkpoint triggering not being completed 
> -
>
> Key: FLINK-33816
> URL: https://issues.apache.org/jira/browse/FLINK-33816
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Checkpointing, Runtime / Coordination
>Affects Versions: 1.19.0
>Reporter: Matthias Pohl
>Assignee: jiabao.sun
>Priority: Major
>  Labels: github-actions, pull-request-available, test-stability
> Fix For: 1.20.0
>
> Attachments: screenshot-1.png
>
>
> [https://github.com/XComp/flink/actions/runs/7182604625/job/19559947894#step:12:9430]
> {code:java}
> rror: 14:39:01 14:39:01.930 [ERROR] Tests run: 16, Failures: 1, Errors: 0, 
> Skipped: 0, Time elapsed: 1.878 s <<< FAILURE! - in 
> org.apache.flink.streaming.runtime.tasks.SourceStreamTaskTest
> 9426Error: 14:39:01 14:39:01.930 [ERROR] 
> org.apache.flink.streaming.runtime.tasks.SourceStreamTaskTest.testTriggeringStopWithSavepointWithDrain
>   Time elapsed: 0.034 s  <<< FAILURE!
> 9427Dec 12 14:39:01 org.opentest4j.AssertionFailedError: 
> 9428Dec 12 14:39:01 
> 9429Dec 12 14:39:01 Expecting value to be true but was false
> 9430Dec 12 14:39:01   at 
> java.base/jdk.internal.reflect.DirectConstructorHandleAccessor.newInstance(DirectConstructorHandleAccessor.java:62)
> 9431Dec 12 14:39:01   at 
> java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:502)
> 9432Dec 12 14:39:01   at 
> org.apache.flink.streaming.runtime.tasks.SourceStreamTaskTest.testTriggeringStopWithSavepointWithDrain(SourceStreamTaskTest.java:710)
> 9433Dec 12 14:39:01   at 
> java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
> 9434Dec 12 14:39:01   at 
> java.base/java.lang.reflect.Method.invoke(Method.java:580)
> [...] {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-34953) Add github ci for flink-web to auto commit build files

2024-04-02 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833036#comment-17833036
 ] 

Matthias Pohl commented on FLINK-34953:
---

Hi [~gongzhongqiang], it sounds like we reached consensus in this matter 
already. But you can bring this up in the dev ML to check whether there are 
some objections against this approach before going ahead with this ticket to 
have a proper backing from the community.

> Add github ci for flink-web to auto commit build files
> --
>
> Key: FLINK-34953
> URL: https://issues.apache.org/jira/browse/FLINK-34953
> Project: Flink
>  Issue Type: Improvement
>  Components: Project Website
>Reporter: Zhongqiang Gong
>Priority: Minor
>  Labels: website
>
> Currently, https://github.com/apache/flink-web commit build files by local 
> build. So I want use github ci to build docs and commit.
>  
> Changes:
>  * Add website build check for pr
>  * Auto build and commit build files after pr was merged to `asf-site`
>  * Optinal: this ci can triggered by manual



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-34961) GitHub Actions runner statistcs can be monitored per workflow name

2024-03-28 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-34961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl updated FLINK-34961:
--
Labels: starter  (was: )

> GitHub Actions runner statistcs can be monitored per workflow name
> --
>
> Key: FLINK-34961
> URL: https://issues.apache.org/jira/browse/FLINK-34961
> Project: Flink
>  Issue Type: Improvement
>  Components: Build System / CI
>Reporter: Matthias Pohl
>Priority: Major
>  Labels: starter
>
> Apache Infra allows the monitoring of runner usage per workflow (see [report 
> for 
> Flink|https://infra-reports.apache.org/#ghactions=flink=168=10];
>   only accessible with Apache committer rights). They accumulate the data by 
> workflow name. The Flink space has multiple repositories that use the generic 
> workflow name {{CI}}). That makes the differentiation in the report harder.
> This Jira issue is about identifying all Flink-related projects with a CI 
> workflow (Kubernetes operator and the JDBC connector were identified, for 
> instance) and adding a more distinct name.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34961) GitHub Actions statistcs can be monitored per workflow name

2024-03-28 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34961:
-

 Summary: GitHub Actions statistcs can be monitored per workflow 
name
 Key: FLINK-34961
 URL: https://issues.apache.org/jira/browse/FLINK-34961
 Project: Flink
  Issue Type: Improvement
  Components: Build System / CI
Reporter: Matthias Pohl


Apache Infra allows the monitoring of runner usage per workflow (see [report 
for 
Flink|https://infra-reports.apache.org/#ghactions=flink=168=10];
  only accessible with Apache committer rights). They accumulate the data by 
workflow name. The Flink space has multiple repositories that use the generic 
workflow name {{CI}}). That makes the differentiation in the report harder.

This Jira issue is about identifying all Flink-related projects with a CI 
workflow (Kubernetes operator and the JDBC connector were identified, for 
instance) and adding a more distinct name.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-34961) GitHub Actions runner statistcs can be monitored per workflow name

2024-03-28 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-34961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl updated FLINK-34961:
--
Summary: GitHub Actions runner statistcs can be monitored per workflow name 
 (was: GitHub Actions statistcs can be monitored per workflow name)

> GitHub Actions runner statistcs can be monitored per workflow name
> --
>
> Key: FLINK-34961
> URL: https://issues.apache.org/jira/browse/FLINK-34961
> Project: Flink
>  Issue Type: Improvement
>  Components: Build System / CI
>Reporter: Matthias Pohl
>Priority: Major
>
> Apache Infra allows the monitoring of runner usage per workflow (see [report 
> for 
> Flink|https://infra-reports.apache.org/#ghactions=flink=168=10];
>   only accessible with Apache committer rights). They accumulate the data by 
> workflow name. The Flink space has multiple repositories that use the generic 
> workflow name {{CI}}). That makes the differentiation in the report harder.
> This Jira issue is about identifying all Flink-related projects with a CI 
> workflow (Kubernetes operator and the JDBC connector were identified, for 
> instance) and adding a more distinct name.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-34937) Apache Infra GHA policy update

2024-03-28 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831844#comment-17831844
 ] 

Matthias Pohl commented on FLINK-34937:
---

Looks like Flink is on rank 19 in terms of runner minutes used for the past 7 
days:

[Flink-specific 
report|https://infra-reports.apache.org/#ghactions=flink=168] 
(needs ASF committer rights)

[Global report|https://infra-reports.apache.org/#ghactions] (needs ASF 
membership)

> Apache Infra GHA policy update
> --
>
> Key: FLINK-34937
> URL: https://issues.apache.org/jira/browse/FLINK-34937
> Project: Flink
>  Issue Type: Sub-task
>  Components: Build System / CI
>Affects Versions: 1.19.0, 1.18.1, 1.20.0
>Reporter: Matthias Pohl
>Priority: Major
>
> There is a policy update [announced in the infra 
> ML|https://www.mail-archive.com/jdo-dev@db.apache.org/msg13638.html] which 
> asked Apache projects to limit the number of runners per job. Additionally, 
> the [GHA policy|https://infra.apache.org/github-actions-policy.html] is 
> referenced which I wasn't aware of when working on the action workflow.
> This issue is about applying the policy to the Flink GHA workflows.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (FLINK-34933) JobMasterServiceLeadershipRunnerTest#testResultFutureCompletionOfOutdatedLeaderIsIgnored isn't implemented properly

2024-03-28 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-34933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl resolved FLINK-34933.
---
Fix Version/s: 1.18.2
   1.20.0
   1.19.1
   Resolution: Fixed

master: 
[1668a07276929416469392a35a77ba7699aac30b|https://github.com/apache/flink/commit/1668a07276929416469392a35a77ba7699aac30b]
1.19: 
[c11656a2406f07e2ae7cd6f80c46afb14385ee0e|https://github.com/apache/flink/commit/c11656a2406f07e2ae7cd6f80c46afb14385ee0e]
1.18: 
[94d1363c27e26fc8313721e138c7b4de744ca69e|https://github.com/apache/flink/commit/94d1363c27e26fc8313721e138c7b4de744ca69e]

> JobMasterServiceLeadershipRunnerTest#testResultFutureCompletionOfOutdatedLeaderIsIgnored
>  isn't implemented properly
> ---
>
> Key: FLINK-34933
> URL: https://issues.apache.org/jira/browse/FLINK-34933
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.17.2, 1.19.0, 1.18.1, 1.20.0
>Reporter: Matthias Pohl
>Assignee: Matthias Pohl
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.18.2, 1.20.0, 1.19.1
>
>
> {{testResultFutureCompletionOfOutdatedLeaderIsIgnored}} doesn't test the 
> desired behavior: The {{TestingJobMasterService#closeAsync()}} callback 
> throws an {{UnsupportedOperationException}} by default which prevents the 
> test from properly finalizing the leadership revocation.
> The test is still passing because the test checks implicitly for this error. 
> Instead, we should verify that the runner's resultFuture doesn't complete 
> until the runner is closed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (FLINK-33376) Extend Curator config option for Zookeeper configuration

2024-03-28 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-33376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl resolved FLINK-33376.
---
Fix Version/s: 1.20.0
 Release Note: Adds support for the following curator parameters: 
high-availability.zookeeper.client.authorization (curator parameter: 
authorization), high-availability.zookeeper.client.max-close-wait (curator 
parameter: maxCloseWaitMs), 
high-availability.zookeeper.client.simulated-session-expiration-percent 
(curator parameter: simulatedSessionExpirationPercent)
   Resolution: Fixed

master: 
[83f82ab0c865a4fa9e119c96e11e0fb3df4a5ecd|https://github.com/apache/flink/commit/83f82ab0c865a4fa9e119c96e11e0fb3df4a5ecd]

> Extend Curator config option for Zookeeper configuration
> 
>
> Key: FLINK-33376
> URL: https://issues.apache.org/jira/browse/FLINK-33376
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Reporter: Oleksandr Nitavskyi
>Assignee: Oleksandr Nitavskyi
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.20.0
>
>
> In certain cases ZooKeeper requires additional Authentication information. 
> For example list of valid [names for 
> ensemble|https://zookeeper.apache.org/doc/r3.8.0/zookeeperAdmin.html#:~:text=for%20secure%20authentication.-,zookeeper.ensembleAuthName,-%3A%20(Java%20system%20property]
>  in order to prevent the accidental connecting to a wrong ensemble.
> Curator allows to add additional AuthInfo object for such configuration. Thus 
> it would be useful to add one more additional Map property which would allow 
> to pass AuthInfo objects during Curator client creation.
> *Acceptance Criteria:* For Flink users it is possible to configure auth info 
> list for Curator framework client.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-33376) Extend Curator config option for Zookeeper configuration

2024-03-28 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-33376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl updated FLINK-33376:
--
Release Note: Adds support for the following curator parameters: 
high-availability.zookeeper.client.authorization (corresponding curator 
parameter: authorization), high-availability.zookeeper.client.max-close-wait 
(corresponding curator parameter: maxCloseWaitMs), 
high-availability.zookeeper.client.simulated-session-expiration-percent 
(corresponding curator parameter: simulatedSessionExpirationPercent).  (was: 
Adds support for the following curator parameters: 
high-availability.zookeeper.client.authorization (curator parameter: 
authorization), high-availability.zookeeper.client.max-close-wait (curator 
parameter: maxCloseWaitMs), 
high-availability.zookeeper.client.simulated-session-expiration-percent 
(curator parameter: simulatedSessionExpirationPercent))

> Extend Curator config option for Zookeeper configuration
> 
>
> Key: FLINK-33376
> URL: https://issues.apache.org/jira/browse/FLINK-33376
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Reporter: Oleksandr Nitavskyi
>Assignee: Oleksandr Nitavskyi
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.20.0
>
>
> In certain cases ZooKeeper requires additional Authentication information. 
> For example list of valid [names for 
> ensemble|https://zookeeper.apache.org/doc/r3.8.0/zookeeperAdmin.html#:~:text=for%20secure%20authentication.-,zookeeper.ensembleAuthName,-%3A%20(Java%20system%20property]
>  in order to prevent the accidental connecting to a wrong ensemble.
> Curator allows to add additional AuthInfo object for such configuration. Thus 
> it would be useful to add one more additional Map property which would allow 
> to pass AuthInfo objects during Curator client creation.
> *Acceptance Criteria:* For Flink users it is possible to configure auth info 
> list for Curator framework client.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Reopened] (FLINK-34953) Add github ci for flink-web to auto commit build files

2024-03-28 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-34953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl reopened FLINK-34953:
---

> Add github ci for flink-web to auto commit build files
> --
>
> Key: FLINK-34953
> URL: https://issues.apache.org/jira/browse/FLINK-34953
> Project: Flink
>  Issue Type: Improvement
>  Components: Project Website
>Reporter: Zhongqiang Gong
>Priority: Minor
>  Labels: website
>
> Currently, https://github.com/apache/flink-web commit build files by local 
> build. So I want use github ci to build docs and commit.
>  
> Changes:
>  * Add website build check for pr
>  * Auto build and commit build files after pr was merged to `asf-site`
>  * Optinal: this ci can triggered by manual



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-34953) Add github ci for flink-web to auto commit build files

2024-03-28 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831665#comment-17831665
 ] 

Matthias Pohl edited comment on FLINK-34953 at 3/28/24 9:52 AM:


I guess we could do it. The [GitHub Actions 
Policy|https://infra.apache.org/github-actions-policy.html] excludes 
non-released artifacts like websites from the restriction:
{quote}Automated services such as GitHub Actions (and Jenkins, BuildBot, etc.) 
MAY work on website content and other non-released data such as documentation 
and convenience binaries. Automated services MUST NOT push data to a repository 
or branch that is subject to official release as a software package by the 
project, unless the project secures specific prior authorization of the 
workflow from Infrastructure.
{quote}
Not sure whether they updated that one recently. Or do you have another source 
which is stricter, [~martijnvisser] ?


was (Author: mapohl):
I guess we could do it. The [GitHub Actions 
Policy|https://infra.apache.org/github-actions-policy.html] excludes 
non-released artifacts like website from the restriction:
{quote}Automated services such as GitHub Actions (and Jenkins, BuildBot, etc.) 
MAY work on website content and other non-released data such as documentation 
and convenience binaries. Automated services MUST NOT push data to a repository 
or branch that is subject to official release as a software package by the 
project, unless the project secures specific prior authorization of the 
workflow from Infrastructure.
{quote}
Not sure whether they updated that one recently. Or do you have another source 
which is stricter, [~martijnvisser] ?

> Add github ci for flink-web to auto commit build files
> --
>
> Key: FLINK-34953
> URL: https://issues.apache.org/jira/browse/FLINK-34953
> Project: Flink
>  Issue Type: Improvement
>  Components: Project Website
>Reporter: Zhongqiang Gong
>Priority: Minor
>  Labels: website
>
> Currently, https://github.com/apache/flink-web commit build files by local 
> build. So I want use github ci to build docs and commit.
>  
> Changes:
>  * Add website build check for pr
>  * Auto build and commit build files after pr was merged to `asf-site`
>  * Optinal: this ci can triggered by manual



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-34953) Add github ci for flink-web to auto commit build files

2024-03-28 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831665#comment-17831665
 ] 

Matthias Pohl commented on FLINK-34953:
---

I guess we could do it. The [GitHub Actions 
Policy|https://infra.apache.org/github-actions-policy.html] excludes 
non-released artifacts like website from the restriction:
{quote}Automated services such as GitHub Actions (and Jenkins, BuildBot, etc.) 
MAY work on website content and other non-released data such as documentation 
and convenience binaries. Automated services MUST NOT push data to a repository 
or branch that is subject to official release as a software package by the 
project, unless the project secures specific prior authorization of the 
workflow from Infrastructure.
{quote}
Not sure whether they updated that one recently. Or do you have another source 
which is stricter, [~martijnvisser] ?

> Add github ci for flink-web to auto commit build files
> --
>
> Key: FLINK-34953
> URL: https://issues.apache.org/jira/browse/FLINK-34953
> Project: Flink
>  Issue Type: Improvement
>  Components: Project Website
>Reporter: Zhongqiang Gong
>Priority: Minor
>  Labels: website
>
> Currently, https://github.com/apache/flink-web commit build files by local 
> build. So I want use github ci to build docs and commit.
>  
> Changes:
>  * Add website build check for pr
>  * Auto build and commit build files after pr was merged to `asf-site`
>  * Optinal: this ci can triggered by manual



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-34937) Apache Infra GHA policy update

2024-03-28 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831659#comment-17831659
 ] 

Matthias Pohl commented on FLINK-34937:
---

let's check https://github.com/assignUser/stash (which is provided by 
[~assignuser] from the Apache Arrow project and promoted in Apache Infra's 
roundtable group) whether our CI can benefit from it

> Apache Infra GHA policy update
> --
>
> Key: FLINK-34937
> URL: https://issues.apache.org/jira/browse/FLINK-34937
> Project: Flink
>  Issue Type: Sub-task
>  Components: Build System / CI
>Affects Versions: 1.19.0, 1.18.1, 1.20.0
>Reporter: Matthias Pohl
>Priority: Major
>
> There is a policy update [announced in the infra 
> ML|https://www.mail-archive.com/jdo-dev@db.apache.org/msg13638.html] which 
> asked Apache projects to limit the number of runners per job. Additionally, 
> the [GHA policy|https://infra.apache.org/github-actions-policy.html] is 
> referenced which I wasn't aware of when working on the action workflow.
> This issue is about applying the policy to the Flink GHA workflows.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (FLINK-34551) Align retry mechanisms of FutureUtils

2024-03-28 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-34551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl reassigned FLINK-34551:
-

Assignee: Matthias Pohl  (was: Kumar Mallikarjuna)

> Align retry mechanisms of FutureUtils
> -
>
> Key: FLINK-34551
> URL: https://issues.apache.org/jira/browse/FLINK-34551
> Project: Flink
>  Issue Type: Technical Debt
>  Components: API / Core
>Affects Versions: 1.20.0
>Reporter: Matthias Pohl
>Assignee: Matthias Pohl
>Priority: Major
>  Labels: pull-request-available
>
> The retry mechanisms of FutureUtils include quite a bit of redundant code 
> which makes it hard to understand and to extend. The logic should be aligned 
> properly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-34551) Align retry mechanisms of FutureUtils

2024-03-28 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831657#comment-17831657
 ] 

Matthias Pohl commented on FLINK-34551:
---

The intention of this ticket came from FLINK-34227 where I wanted to add logic 
for retrying forever. I managed to split the 
{{retrySuccessfulOperationWithDelay}} in FLINK-34227 in a way now that I didn't 
generate too much additional redundant code. I created FLINK-34551 as a 
follow-up anyway because I noticed that {{retrySuccessfulOperationWithDelay}} 
and  {{retryOperation}} share some common logic and that we could improve the 
way how these methods decide on which executor to run the {{operation}} on 
(scheduledExecutor vs calling thread).

Your current proposal has still redundant code. We would need to iterate over 
the change a bit more and discuss the contract of these methods in more detail. 
But unfortunately, I am gone for quite a bit soon. So, I would not be able to 
help you. Additionally, it's not a high-priority task right. I'm wondering 
whether we should unassign the task again. I want to avoid that you spend time 
on it and then get stuck because of missing feedback from my side.

I should have considered it yesterday already. Sorry for that.

> Align retry mechanisms of FutureUtils
> -
>
> Key: FLINK-34551
> URL: https://issues.apache.org/jira/browse/FLINK-34551
> Project: Flink
>  Issue Type: Technical Debt
>  Components: API / Core
>Affects Versions: 1.20.0
>Reporter: Matthias Pohl
>Assignee: Kumar Mallikarjuna
>Priority: Major
>  Labels: pull-request-available
>
> The retry mechanisms of FutureUtils include quite a bit of redundant code 
> which makes it hard to understand and to extend. The logic should be aligned 
> properly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-34937) Apache Infra GHA policy update

2024-03-27 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831422#comment-17831422
 ] 

Matthias Pohl edited comment on FLINK-34937 at 3/27/24 3:45 PM:


We should pin all actions (i.e. use the git SHA rather than a version tag) for 
external actions (anything other than {{actions/\*}}, {{github/\*}} and 
{{apache/\*}} prefixed actions). That's not the case right now.


was (Author: mapohl):
We should pin all actions (i.e. use the git SHA rather than a version tag) for 
external actions (anything other than {{actions/*}}, {{github/*}} and 
{{apache/*}} prefixed actions). That's not the case right now.

> Apache Infra GHA policy update
> --
>
> Key: FLINK-34937
> URL: https://issues.apache.org/jira/browse/FLINK-34937
> Project: Flink
>  Issue Type: Sub-task
>  Components: Build System / CI
>Affects Versions: 1.19.0, 1.18.1, 1.20.0
>Reporter: Matthias Pohl
>Priority: Major
>
> There is a policy update [announced in the infra 
> ML|https://www.mail-archive.com/jdo-dev@db.apache.org/msg13638.html] which 
> asked Apache projects to limit the number of runners per job. Additionally, 
> the [GHA policy|https://infra.apache.org/github-actions-policy.html] is 
> referenced which I wasn't aware of when working on the action workflow.
> This issue is about applying the policy to the Flink GHA workflows.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-34937) Apache Infra GHA policy update

2024-03-27 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831422#comment-17831422
 ] 

Matthias Pohl commented on FLINK-34937:
---

We should pin all actions (i.e. use the git SHA rather than a version tag) for 
external actions (anything other than {{actions/*}}, {{github/*}} and 
{{apache/*}} prefixed actions). That's not the case right now.

> Apache Infra GHA policy update
> --
>
> Key: FLINK-34937
> URL: https://issues.apache.org/jira/browse/FLINK-34937
> Project: Flink
>  Issue Type: Sub-task
>  Components: Build System / CI
>Affects Versions: 1.19.0, 1.18.1, 1.20.0
>Reporter: Matthias Pohl
>Priority: Major
>
> There is a policy update [announced in the infra 
> ML|https://www.mail-archive.com/jdo-dev@db.apache.org/msg13638.html] which 
> asked Apache projects to limit the number of runners per job. Additionally, 
> the [GHA policy|https://infra.apache.org/github-actions-policy.html] is 
> referenced which I wasn't aware of when working on the action workflow.
> This issue is about applying the policy to the Flink GHA workflows.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (FLINK-34419) flink-docker's .github/workflows/snapshot.yml doesn't support JDK 17 and 21

2024-03-27 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-34419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl resolved FLINK-34419.
---
Resolution: Fixed

> flink-docker's .github/workflows/snapshot.yml doesn't support JDK 17 and 21
> ---
>
> Key: FLINK-34419
> URL: https://issues.apache.org/jira/browse/FLINK-34419
> Project: Flink
>  Issue Type: Technical Debt
>  Components: Build System / CI
>Reporter: Matthias Pohl
>Assignee: Muhammet Orazov
>Priority: Major
>  Labels: pull-request-available, starter
>
> [.github/workflows/snapshot.yml|https://github.com/apache/flink-docker/blob/master/.github/workflows/snapshot.yml#L40]
>  needs to be updated: JDK 17 support was added in 1.18 (FLINK-15736). JDK 21 
> support was added in 1.19 (FLINK-33163)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-34419) flink-docker's .github/workflows/snapshot.yml doesn't support JDK 17 and 21

2024-03-27 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831391#comment-17831391
 ] 

Matthias Pohl edited comment on FLINK-34419 at 3/27/24 2:56 PM:


master: 9e0041a2c9dace4bf3f32815e3e24e24385b179b
dev-master: 1460077743b29e17edd0a2d7efd3897fa097988d
dev-1.19: 67d7c46ed382a665e941f0cf1f1606d10f87dee5
dev-1.18: d93d911b015e535fc2b6f1426c3b36229ff3d02a


was (Author: mapohl):
master: 9e0041a2c9dace4bf3f32815e3e24e24385b179b
dev-master: tba
dev-1.19: tba
dev-1.18: tba

> flink-docker's .github/workflows/snapshot.yml doesn't support JDK 17 and 21
> ---
>
> Key: FLINK-34419
> URL: https://issues.apache.org/jira/browse/FLINK-34419
> Project: Flink
>  Issue Type: Technical Debt
>  Components: Build System / CI
>Reporter: Matthias Pohl
>Assignee: Muhammet Orazov
>Priority: Major
>  Labels: pull-request-available, starter
>
> [.github/workflows/snapshot.yml|https://github.com/apache/flink-docker/blob/master/.github/workflows/snapshot.yml#L40]
>  needs to be updated: JDK 17 support was added in 1.18 (FLINK-15736). JDK 21 
> support was added in 1.19 (FLINK-33163)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-34419) flink-docker's .github/workflows/snapshot.yml doesn't support JDK 17 and 21

2024-03-27 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831391#comment-17831391
 ] 

Matthias Pohl commented on FLINK-34419:
---

master: 9e0041a2c9dace4bf3f32815e3e24e24385b179b
dev-master: tba
dev-1.19: tba
dev-1.18: tba

> flink-docker's .github/workflows/snapshot.yml doesn't support JDK 17 and 21
> ---
>
> Key: FLINK-34419
> URL: https://issues.apache.org/jira/browse/FLINK-34419
> Project: Flink
>  Issue Type: Technical Debt
>  Components: Build System / CI
>Reporter: Matthias Pohl
>Assignee: Muhammet Orazov
>Priority: Major
>  Labels: pull-request-available, starter
>
> [.github/workflows/snapshot.yml|https://github.com/apache/flink-docker/blob/master/.github/workflows/snapshot.yml#L40]
>  needs to be updated: JDK 17 support was added in 1.18 (FLINK-15736). JDK 21 
> support was added in 1.19 (FLINK-33163)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


  1   2   3   4   5   6   7   8   9   10   >