[jira] [Commented] (FLINK-31974) JobManager crashes after KubernetesClientException exception with FatalExitExceptionHandler
[ https://issues.apache.org/jira/browse/FLINK-31974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17719525#comment-17719525 ] Sergio Sainz commented on FLINK-31974: -- Hi [~mapohl] - let me setup a new cluster later on to get the full logs. Below please find the thread dump from the Flink 1.17.0 crash: {code:java} 2023-04-28 20:50:50,305 INFO org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] - Received resource requirements from job 0a97c80a173b7ebb619c5b030b607520: [ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, numberOfRequiredSlots=1}] ... 2023-04-28 20:50:50,534 ERROR org.apache.flink.util.FatalExitExceptionHandler [] - FATAL: Thread 'flink-akka.actor.default-dispatcher-15' produced an uncaught exception. Stopping the process... java.util.concurrent.CompletionException: org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://10.96.0.1/api/v1/namespaces/env-my-namespace/pods. Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-2" is forbidden: exceeded quota: my-namespace-realtime-server-resource-quota, requested: limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13. at java.util.concurrent.CompletableFuture.encodeThrowable(Unknown Source) ~[?:?] at java.util.concurrent.CompletableFuture.completeThrowable(Unknown Source) ~[?:?] at java.util.concurrent.CompletableFuture$AsyncRun.run(Unknown Source) ~[?:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) ~[?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) ~[?:?] at java.lang.Thread.run(Unknown Source) ~[?:?] Caused by: org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://10.96.0.1/api/v1/namespaces/env-my-namespace/pods. Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-2" is forbidden: exceeded quota: my-namespace-realtime-server-resource-quota, requested: limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13. at org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:684) ~[flink-dist-1.17.0.jar:1.17.0] at org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:664) ~[flink-dist-1.17.0.jar:1.17.0] at org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:613) ~[flink-dist-1.17.0.jar:1.17.0] at org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:558) ~[flink-dist-1.17.0.jar:1.17.0] at org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521) ~[flink-dist-1.17.0.jar:1.17.0] at org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:308) ~[flink-dist-1.17.0.jar:1.17.0] at org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:644) ~[flink-dist-1.17.0.jar:1.17.0] at org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:83) ~[flink-dist-1.17.0.jar:1.17.0] at org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.CreateOnlyResourceOperation.create(CreateOnlyResourceOperation.java:61) ~[flink-dist-1.17.0.jar:1.17.0] at org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$createTaskManagerPod$1(Fabric8FlinkKubeClient.java:163) ~[flink-dist-1.17.0.jar:1.17.0] ... 4 more 2023-04-28 20:50:50,602 ERROR org.apache.flink.util.FatalExitExceptionHandler [] - Thread dump: "main" prio=5 Id=1 WAITING on java.util.concurrent.CompletableFuture$Signaller@2897b146 at java.base@11.0.19/jdk.internal.misc.Unsafe.park(Native Method) - waiting on java.util.concurrent.CompletableFuture$Signaller@2897b146 at java.base@11.0.19/java.util.concurrent.locks.LockSupport.park(Unknown Source) at java.base@11.0.19/java.util.concurrent.CompletableFuture$Signaller.block(Unknown Source) at java.base@11.0.19/java.util.concurrent.ForkJoinPool.managedBlock(Unknown Source) at java.base@11.0.19/java.util.concurrent.CompletableFuture.waitingGet(Unknown Source) at
[jira] [Created] (FLINK-31978) flink-connector-jdbc v.1.17.0 not published
Sergio Sainz created FLINK-31978: Summary: flink-connector-jdbc v.1.17.0 not published Key: FLINK-31978 URL: https://issues.apache.org/jira/browse/FLINK-31978 Project: Flink Issue Type: Bug Components: Connectors / JDBC Affects Versions: 1.17.0 Reporter: Sergio Sainz The connector flink-connector-jdbc version 1.17.0 is not being published in public maven repositories: [https://mvnrepository.com/artifact/org.apache.flink/flink-connector-jdbc] Maybe related to FLINK-29642? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-31974) JobManager crashes after KubernetesClientException exception with FatalExitExceptionHandler
Sergio Sainz created FLINK-31974: Summary: JobManager crashes after KubernetesClientException exception with FatalExitExceptionHandler Key: FLINK-31974 URL: https://issues.apache.org/jira/browse/FLINK-31974 Project: Flink Issue Type: Bug Components: Deployment / Kubernetes Affects Versions: 1.17.0 Reporter: Sergio Sainz When resource quota limit is reached JobManager will throw org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-2" is forbidden: exceeded quota: my-namespace-resource-quota, requested: limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13. In {*}1.16.1 , this is handled gracefully{*}: 2023-04-28 22:07:24,631 WARN org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Failed requesting worker with resource spec WorkerResourceSpec \{cpuCores=1.0, taskHeapSize=25.600mb (26843542 bytes), taskOffHeapSize=0 bytes, networkMemSize=64.000mb (67108864 bytes), managedMemSize=230.400mb (241591914 bytes), numSlots=4}, current pending count: 0 java.util.concurrent.CompletionException: io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-138" is forbidden: exceeded quota: my-namespace-resource-quota, requested: limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13. at java.util.concurrent.CompletableFuture.encodeThrowable(Unknown Source) ~[?:?] at java.util.concurrent.CompletableFuture.completeThrowable(Unknown Source) ~[?:?] at java.util.concurrent.CompletableFuture$AsyncRun.run(Unknown Source) ~[?:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) ~[?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) ~[?:?] at java.lang.Thread.run(Unknown Source) ~[?:?] aused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-138" is forbidden: exceeded quota: my-namespace-resource-quota, requested: limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13. at io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:684) ~[flink-dist-1.16.1.jar:1.16.1] at io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:664) ~[flink-dist-1.16.1.jar:1.16.1] at io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:613) ~[flink-dist-1.16.1.jar:1.16.1] at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:558) ~[flink-dist-1.16.1.jar:1.16.1] at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521) ~[flink-dist-1.16.1.jar:1.16.1] at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:308) ~[flink-dist-1.16.1.jar:1.16.1] at io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:644) ~[flink-dist-1.16.1.jar:1.16.1] at io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:83) ~[flink-dist-1.16.1.jar:1.16.1] at io.fabric8.kubernetes.client.dsl.base.CreateOnlyResourceOperation.create(CreateOnlyResourceOperation.java:61) ~[flink-dist-1.16.1.jar:1.16.1] at org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$createTaskManagerPod$1(Fabric8FlinkKubeClient.java:163) ~[flink-dist-1.16.1.jar:1.16.1] ... 4 more But , {*}in Flink 1.17.0 , Job Manager crashes{*}: 2023-04-28 20:50:50,534 ERROR org.apache.flink.util.FatalExitExceptionHandler [] - FATAL: Thread 'flink-akka.actor.default-dispatcher-15' produced an uncaught exception. Stopping the process... java.util.concurrent.CompletionException: org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-2" is forbidden: exceeded quota:
[jira] [Updated] (FLINK-31796) Support service mesh istio with Flink kubernetes (both native and operator) for secure communications
[ https://issues.apache.org/jira/browse/FLINK-31796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergio Sainz updated FLINK-31796: - Description: Currently Flink Native Kubernetes does not support istio + TLS cleanly : Flink assumes that pods will be able to communicate by ip-address, meanwhile istio + TLS does not allow routing by pod's ip-address. This ticket is to track effort to support Flink with istio. Some workaround is to disable the istio sidecar container, but this is not secure [https://doc.akka.io/docs/akka-management/current/bootstrap/istio.html]. Akka allows to secure the channel manually, but there is no documentation how to do this in the context of Flink. One potential solution for this is to have the documentation about how to configure akka cluster + TLS in flink. For example when using native kubernetes deployment mode with high-availability (HA), and when new TaskManager pod is started to process a job, the TaskManager pod will attempt to register itself to the resource manager (JobManager). the TaskManager looks up the resource manager per ip-address (akka.tcp://flink@192.168.140.164:6123/user/rpc/resourcemanager_1). Other affected features is metric collection. Please see FLINK-31775 and FLINK-28171. Especially comment "{_}Flink currently just doesn't support Istio.{_}" was: Currently Flink Native Kubernetes does not support istio + TLS cleanly : Flink assumes that pods will be able to communicate by ip-address, meanwhile istio + TLS does not allow routing by pod's ip-address. This ticket is to track effort to support Flink with istio. Some workaround is to disable the istio sidecar container, but this is not secure [https://doc.akka.io/docs/akka-management/current/bootstrap/istio.html]. Akka allows to secure the channel manually, but there is no documentation how to do this in the context of Flink. One potential solution for this is to have the documentation about how to configure akka cluster + TLS in flink. For example when using native kubernetes deployment mode with high-availability (HA), and when new TaskManager pod is started to process a job, the TaskManager pod will attempt to register itself to the resource manager (JobManager). the TaskManager looks up the resource manager per ip-address (akka.tcp://flink@192.168.140.164:6123/user/rpc/resourcemanager_1). Other affected features is metric collection. Please see FLINK-31775 and FLINK-28171. Especially comment "{_}Flink currently just doesn't support Istio. If anything, it's a new feature/improvement.{_} " > Support service mesh istio with Flink kubernetes (both native and operator) > for secure communications > - > > Key: FLINK-31796 > URL: https://issues.apache.org/jira/browse/FLINK-31796 > Project: Flink > Issue Type: New Feature > Components: Deployment / Kubernetes >Affects Versions: 1.17.0 >Reporter: Sergio Sainz >Priority: Major > > Currently Flink Native Kubernetes does not support istio + TLS cleanly : > Flink assumes that pods will be able to communicate by ip-address, meanwhile > istio + TLS does not allow routing by pod's ip-address. > This ticket is to track effort to support Flink with istio. Some workaround > is to disable the istio sidecar container, but this is not secure > [https://doc.akka.io/docs/akka-management/current/bootstrap/istio.html]. Akka > allows to secure the channel manually, but there is no documentation how to > do this in the context of Flink. One potential solution for this is to have > the documentation about how to configure akka cluster + TLS in flink. > For example when using native kubernetes deployment mode with > high-availability (HA), and when new TaskManager pod is started to process a > job, the TaskManager pod will attempt to register itself to the resource > manager (JobManager). the TaskManager looks up the resource manager per > ip-address (akka.tcp://flink@192.168.140.164:6123/user/rpc/resourcemanager_1). > Other affected features is metric collection. > Please see FLINK-31775 and FLINK-28171. Especially comment "{_}Flink > currently just doesn't support Istio.{_}" -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-31796) Support service mesh istio with Flink kubernetes (both native and operator) for secure communications
Sergio Sainz created FLINK-31796: Summary: Support service mesh istio with Flink kubernetes (both native and operator) for secure communications Key: FLINK-31796 URL: https://issues.apache.org/jira/browse/FLINK-31796 Project: Flink Issue Type: New Feature Components: Deployment / Kubernetes Affects Versions: 1.17.0 Reporter: Sergio Sainz Currently Flink Native Kubernetes does not support istio + TLS cleanly : Flink assumes that pods will be able to communicate by ip-address, meanwhile istio + TLS does not allow routing by pod's ip-address. This ticket is to track effort to support Flink with istio. Some workaround is to disable the istio sidecar container, but this is not secure [https://doc.akka.io/docs/akka-management/current/bootstrap/istio.html]. Akka allows to secure the channel manually, but there is no documentation how to do this in the context of Flink. One potential solution for this is to have the documentation about how to configure akka cluster + TLS in flink. For example when using native kubernetes deployment mode with high-availability (HA), and when new TaskManager pod is started to process a job, the TaskManager pod will attempt to register itself to the resource manager (JobManager). the TaskManager looks up the resource manager per ip-address (akka.tcp://flink@192.168.140.164:6123/user/rpc/resourcemanager_1). Other affected features is metric collection. Please see FLINK-31775 and FLINK-28171. Especially comment "{_}Flink currently just doesn't support Istio. If anything, it's a new feature/improvement.{_} " -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (FLINK-31775) High-Availability not supported in kubernetes when istio enabled
[ https://issues.apache.org/jira/browse/FLINK-31775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17711440#comment-17711440 ] Sergio Sainz edited comment on FLINK-31775 at 4/12/23 3:18 PM: --- Another confusion about how the fix for FLINK-28171 can help this case is becase *appProtocol* applies to services. Meanwhile this defect is about ip-addresses. Could not resolve ResourceManager address akka.tcp://flink@{*}192.168.140.164:6123{*}/user/rpc/resourcemanager_1, Then, we are not sure adding appProtocol could solve the ip-address routing (as there is no service involved) was (Author: sergiosp): Another confusion about how the fix for FLINK-28171 can help this case is becase *appProtocol* affects services. Meanwhile this defect is about ip-addresses. Could not resolve ResourceManager address akka.tcp://flink@{*}192.168.140.164:6123{*}/user/rpc/resourcemanager_1, Then, we are not sure adding appProtocol could solve the ip-address routing (as there is no service involved) > High-Availability not supported in kubernetes when istio enabled > > > Key: FLINK-31775 > URL: https://issues.apache.org/jira/browse/FLINK-31775 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes >Affects Versions: 1.16.1 >Reporter: Sergio Sainz >Priority: Major > > When using native kubernetes deployment mode with high-availability (HA), and > when new TaskManager pod is started to process a job, the TaskManager pod > will attempt to register itself to the resource manager (JobManager). the > TaskManager looks up the resource manager per ip-address > (akka.tcp://flink@192.168.140.164:6123/user/rpc/resourcemanager_1) > > Nevertheless when istio is enabled, the resolution by ip address is blocked, > and hence we see that the job cannot start because task manager cannot > register with the resource manager: > 2023-04-10 23:24:19,752 INFO > org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Could not > resolve ResourceManager address > akka.tcp://flink@192.168.140.164:6123/user/rpc/resourcemanager_1, retrying in > 1 ms: Could not connect to rpc endpoint under address > akka.tcp://flink@192.168.140.164:6123/user/rpc/resourcemanager_1. > > Notice that when HA is disabled, the resolution of the resource manager is > made by service name and so the resource manager can be found > > 2023-04-11 00:49:34,162 INFO > org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Successful > registration at resource manager > akka.tcp://flink@myenv-dev-flink-cluster.myenv-dev:6123/user/rpc/resourcemanager_* > under registration id 83ad942597f86aa880ee96f1c2b8b923. > > Notice in my case , it is not possible to disable istio as explained here: > [https://doc.akka.io/docs/akka-management/current/bootstrap/istio.html] > > Although similar to https://issues.apache.org/jira/browse/FLINK-28171 , > logging as separate defect as I believe the fix of FLINK-28171 won't fix this > case. FLINK-28171 is about Flink Kubernetes Operator and this is about > native kubernetes deployment. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-31775) High-Availability not supported in kubernetes when istio enabled
[ https://issues.apache.org/jira/browse/FLINK-31775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17711440#comment-17711440 ] Sergio Sainz commented on FLINK-31775: -- Another confusion about how the fix for FLINK-28171 can help this case is becase *appProtocol* affects services. Meanwhile this defect is about ip-addresses. Could not resolve ResourceManager address akka.tcp://flink@{*}192.168.140.164:6123{*}/user/rpc/resourcemanager_1, Then, we are not sure adding appProtocol could solve the ip-address routing (as there is no service involved) > High-Availability not supported in kubernetes when istio enabled > > > Key: FLINK-31775 > URL: https://issues.apache.org/jira/browse/FLINK-31775 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes >Affects Versions: 1.16.1 >Reporter: Sergio Sainz >Priority: Major > > When using native kubernetes deployment mode with high-availability (HA), and > when new TaskManager pod is started to process a job, the TaskManager pod > will attempt to register itself to the resource manager (JobManager). the > TaskManager looks up the resource manager per ip-address > (akka.tcp://flink@192.168.140.164:6123/user/rpc/resourcemanager_1) > > Nevertheless when istio is enabled, the resolution by ip address is blocked, > and hence we see that the job cannot start because task manager cannot > register with the resource manager: > 2023-04-10 23:24:19,752 INFO > org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Could not > resolve ResourceManager address > akka.tcp://flink@192.168.140.164:6123/user/rpc/resourcemanager_1, retrying in > 1 ms: Could not connect to rpc endpoint under address > akka.tcp://flink@192.168.140.164:6123/user/rpc/resourcemanager_1. > > Notice that when HA is disabled, the resolution of the resource manager is > made by service name and so the resource manager can be found > > 2023-04-11 00:49:34,162 INFO > org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Successful > registration at resource manager > akka.tcp://flink@myenv-dev-flink-cluster.myenv-dev:6123/user/rpc/resourcemanager_* > under registration id 83ad942597f86aa880ee96f1c2b8b923. > > Notice in my case , it is not possible to disable istio as explained here: > [https://doc.akka.io/docs/akka-management/current/bootstrap/istio.html] > > Although similar to https://issues.apache.org/jira/browse/FLINK-28171 , > logging as separate defect as I believe the fix of FLINK-28171 won't fix this > case. FLINK-28171 is about Flink Kubernetes Operator and this is about > native kubernetes deployment. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (FLINK-31775) High-Availability not supported in kubernetes when istio enabled
[ https://issues.apache.org/jira/browse/FLINK-31775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17711413#comment-17711413 ] Sergio Sainz edited comment on FLINK-31775 at 4/12/23 2:32 PM: --- Hi [~huwh] , [~martijnvisser] Thanks for the help triaging ! One question on the solution of adding "appProtocol" into the podTemplate from FLINK-28171. Even after we set the appProtocol, do we need to bypass the istio sidecar? Because we could not bypass istio sidecar in our case ~ was (Author: sergiosp): Hi [~huwh] , [~martijnvisser] Thanks for the help triaging ! One question on the solution of adding "appProtocol" into the podTemplate from FLINK-28171. Even after we set the appProtocol, do we need to bypass the istio sidecar? > High-Availability not supported in kubernetes when istio enabled > > > Key: FLINK-31775 > URL: https://issues.apache.org/jira/browse/FLINK-31775 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes >Affects Versions: 1.16.1 >Reporter: Sergio Sainz >Priority: Major > > When using native kubernetes deployment mode with high-availability (HA), and > when new TaskManager pod is started to process a job, the TaskManager pod > will attempt to register itself to the resource manager (JobManager). the > TaskManager looks up the resource manager per ip-address > (akka.tcp://flink@192.168.140.164:6123/user/rpc/resourcemanager_1) > > Nevertheless when istio is enabled, the resolution by ip address is blocked, > and hence we see that the job cannot start because task manager cannot > register with the resource manager: > 2023-04-10 23:24:19,752 INFO > org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Could not > resolve ResourceManager address > akka.tcp://flink@192.168.140.164:6123/user/rpc/resourcemanager_1, retrying in > 1 ms: Could not connect to rpc endpoint under address > akka.tcp://flink@192.168.140.164:6123/user/rpc/resourcemanager_1. > > Notice that when HA is disabled, the resolution of the resource manager is > made by service name and so the resource manager can be found > > 2023-04-11 00:49:34,162 INFO > org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Successful > registration at resource manager > akka.tcp://flink@myenv-dev-flink-cluster.myenv-dev:6123/user/rpc/resourcemanager_* > under registration id 83ad942597f86aa880ee96f1c2b8b923. > > Notice in my case , it is not possible to disable istio as explained here: > [https://doc.akka.io/docs/akka-management/current/bootstrap/istio.html] > > Although similar to https://issues.apache.org/jira/browse/FLINK-28171 , > logging as separate defect as I believe the fix of FLINK-28171 won't fix this > case. FLINK-28171 is about Flink Kubernetes Operator and this is about > native kubernetes deployment. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (FLINK-31775) High-Availability not supported in kubernetes when istio enabled
[ https://issues.apache.org/jira/browse/FLINK-31775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17711413#comment-17711413 ] Sergio Sainz edited comment on FLINK-31775 at 4/12/23 2:13 PM: --- Hi [~huwh] , [~martijnvisser] Thanks for the help triaging ! One question on the solution of adding "appProtocol" into the podTemplate from FLINK-28171. Even after we set the appProtocol, do we need to bypass the istio sidecar? was (Author: sergiosp): Hi [~huwh] , [~martijnvisser] Thanks for the help triaging ! One question on the solution of adding "appProtocol" into the podTemplate from FLINK-28171. Even after we set the appProtocol, do we need to bypass the istio sidecar? > High-Availability not supported in kubernetes when istio enabled > > > Key: FLINK-31775 > URL: https://issues.apache.org/jira/browse/FLINK-31775 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes >Affects Versions: 1.16.1 >Reporter: Sergio Sainz >Priority: Major > > When using native kubernetes deployment mode with high-availability (HA), and > when new TaskManager pod is started to process a job, the TaskManager pod > will attempt to register itself to the resource manager (JobManager). the > TaskManager looks up the resource manager per ip-address > (akka.tcp://flink@192.168.140.164:6123/user/rpc/resourcemanager_1) > > Nevertheless when istio is enabled, the resolution by ip address is blocked, > and hence we see that the job cannot start because task manager cannot > register with the resource manager: > 2023-04-10 23:24:19,752 INFO > org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Could not > resolve ResourceManager address > akka.tcp://flink@192.168.140.164:6123/user/rpc/resourcemanager_1, retrying in > 1 ms: Could not connect to rpc endpoint under address > akka.tcp://flink@192.168.140.164:6123/user/rpc/resourcemanager_1. > > Notice that when HA is disabled, the resolution of the resource manager is > made by service name and so the resource manager can be found > > 2023-04-11 00:49:34,162 INFO > org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Successful > registration at resource manager > akka.tcp://flink@myenv-dev-flink-cluster.myenv-dev:6123/user/rpc/resourcemanager_* > under registration id 83ad942597f86aa880ee96f1c2b8b923. > > Notice in my case , it is not possible to disable istio as explained here: > [https://doc.akka.io/docs/akka-management/current/bootstrap/istio.html] > > Although similar to https://issues.apache.org/jira/browse/FLINK-28171 , > logging as separate defect as I believe the fix of FLINK-28171 won't fix this > case. FLINK-28171 is about Flink Kubernetes Operator and this is about > native kubernetes deployment. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-31775) High-Availability not supported in kubernetes when istio enabled
[ https://issues.apache.org/jira/browse/FLINK-31775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17711413#comment-17711413 ] Sergio Sainz commented on FLINK-31775: -- Hi [~huwh] , [~martijnvisser] Thanks for the help triaging ! One question on the solution of adding "appProtocol" into the podTemplate from FLINK-28171. Even after we set the appProtocol, do we need to bypass the istio sidecar? > High-Availability not supported in kubernetes when istio enabled > > > Key: FLINK-31775 > URL: https://issues.apache.org/jira/browse/FLINK-31775 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes >Affects Versions: 1.16.1 >Reporter: Sergio Sainz >Priority: Major > > When using native kubernetes deployment mode with high-availability (HA), and > when new TaskManager pod is started to process a job, the TaskManager pod > will attempt to register itself to the resource manager (JobManager). the > TaskManager looks up the resource manager per ip-address > (akka.tcp://flink@192.168.140.164:6123/user/rpc/resourcemanager_1) > > Nevertheless when istio is enabled, the resolution by ip address is blocked, > and hence we see that the job cannot start because task manager cannot > register with the resource manager: > 2023-04-10 23:24:19,752 INFO > org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Could not > resolve ResourceManager address > akka.tcp://flink@192.168.140.164:6123/user/rpc/resourcemanager_1, retrying in > 1 ms: Could not connect to rpc endpoint under address > akka.tcp://flink@192.168.140.164:6123/user/rpc/resourcemanager_1. > > Notice that when HA is disabled, the resolution of the resource manager is > made by service name and so the resource manager can be found > > 2023-04-11 00:49:34,162 INFO > org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Successful > registration at resource manager > akka.tcp://flink@myenv-dev-flink-cluster.myenv-dev:6123/user/rpc/resourcemanager_* > under registration id 83ad942597f86aa880ee96f1c2b8b923. > > Notice in my case , it is not possible to disable istio as explained here: > [https://doc.akka.io/docs/akka-management/current/bootstrap/istio.html] > > Although similar to https://issues.apache.org/jira/browse/FLINK-28171 , > logging as separate defect as I believe the fix of FLINK-28171 won't fix this > case. FLINK-28171 is about Flink Kubernetes Operator and this is about > native kubernetes deployment. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (FLINK-28171) Adjust Job and Task manager port definitions to work with Istio+mTLS
[ https://issues.apache.org/jira/browse/FLINK-28171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17710375#comment-17710375 ] Sergio Sainz edited comment on FLINK-28171 at 4/11/23 11:46 PM: Hello [~wangyang0918] ! also experiencing this issue with native kubernetes deployment with HA enabled (not in flink k8s operator). In [https://lists.apache.org/thread/yl40s9069wksz66qlf9t6jhmwsn59zft] you mentioned that "If HA enabled, the internal jobmanager rpc service will not be created. Instead, the TaskManager retrieves the JobManager address via HA services and connects to it via pod ip." Do you know whether we can change the way TaskManager connects to HA services: do not use ip address and instead use pod name? I think we cannot add the akka workaround of bypassing the istio sidecar. Alternatively, do you know how to add TLS encryption to this channel (TaskManager->HA service) manually? Thanks for the info ~ was (Author: sergiosp): Hello [~wangyang0918] ! also experiencing this issue with native kubernetes deployment with HA enabled (not in flink k8s operator). In [https://lists.apache.org/thread/yl40s9069wksz66qlf9t6jhmwsn59zft] you mentioned that "If HA enabled, the internal jobmanager rpc service will not be created. Instead, the TaskManager retrieves the JobManager address via HA services and connects to it via pod ip." Do you know whether we can change the way TaskManager connects to HA services: do not use ip address and instead use pod name? I think we cannot add the akka workaround of bypassing the istio sidecar. Thanks for the info ~ > Adjust Job and Task manager port definitions to work with Istio+mTLS > > > Key: FLINK-28171 > URL: https://issues.apache.org/jira/browse/FLINK-28171 > Project: Flink > Issue Type: Improvement > Components: Deployment / Kubernetes >Affects Versions: 1.14.4 > Environment: flink-kubernetes-operator 1.0.0 > Flink 1.14-java11 > Kubernetes v1.19.5 > Istio 1.7.6 >Reporter: Moshe Elisha >Assignee: Moshe Elisha >Priority: Major > Labels: pull-request-available > > Hello, > > We are launching Flink deployments using the [Flink Kubernetes > Operator|https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-stable/] > on a Kubernetes cluster with Istio and mTLS enabled. > > We found that the TaskManager is unable to communicate with the JobManager on > the jobmanager-rpc port: > > {{2022-06-15 15:25:40,508 WARN akka.remote.ReliableDeliverySupervisor > [] - Association with remote system > [akka.tcp://[flink@amf-events-to-inference-and-central.nwdaf-edge|mailto:flink@amf-events-to-inference-and-central.nwdaf-edge]:6123] > has failed, address is now gated for [50] ms. Reason: [Association failed > with > [akka.tcp://[flink@amf-events-to-inference-and-central.nwdaf-edge|mailto:flink@amf-events-to-inference-and-central.nwdaf-edge]:6123]] > Caused by: [The remote system explicitly disassociated (reason unknown).]}} > > The reason for the issue is that the JobManager service port definitions are > not following the Istio guidelines > [https://istio.io/latest/docs/ops/configuration/traffic-management/protocol-selection/] > (see example below). > > There was also an email discussion around this topic in the users mailing > group under the subject "Flink Kubernetes Operator with K8S + Istio + mTLS - > port definitions". > With the help of the community, we were able to work around the issue but it > was very hard and forced us to skip Istio proxy which is not ideal. > > We would like you to consider changing the default port definitions, either > # Rename the ports – I understand it is Istio specific guideline but maybe > it is better to at least be aligned with one (popular) vendor guideline > instead of none at all. > # Add the “appProtocol” property[1] that is not specific to any vendor but > requires Kubernetes >= 1.19 where it was introduced as beta and moved to > stable in >= 1.20. The option to add appProtocol property was added only in > [https://github.com/fabric8io/kubernetes-client/releases/tag/v5.10.0] with > [#3570|https://github.com/fabric8io/kubernetes-client/issues/3570]. > # Or allow a way to override the defaults. > > [https://kubernetes.io/docs/concepts/services-networking/_print/#application-protocol] > > > {{# k get service inference-results-to-analytics-engine -o yaml}} > {{apiVersion: v1}} > {{kind: Service}} > {{...}} > {{spec:}} > {{ clusterIP: None}} > {{ ports:}} > {{ - name: jobmanager-rpc *# should start with “tcp-“ or add "appProtocol" > property*}} > {{ port: 6123}} > {{ protocol: TCP}} > {{ targetPort: 6123}} > {{ - name: blobserver *# should start with
[jira] [Updated] (FLINK-31775) High-Availability not supported in kubernetes when istio enabled
[ https://issues.apache.org/jira/browse/FLINK-31775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergio Sainz updated FLINK-31775: - Description: When using native kubernetes deployment mode with high-availability (HA), and when new TaskManager pod is started to process a job, the TaskManager pod will attempt to register itself to the resource manager (JobManager). the TaskManager looks up the resource manager per ip-address (akka.tcp://flink@192.168.140.164:6123/user/rpc/resourcemanager_1) Nevertheless when istio is enabled, the resolution by ip address is blocked, and hence we see that the job cannot start because task manager cannot register with the resource manager: 2023-04-10 23:24:19,752 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Could not resolve ResourceManager address akka.tcp://flink@192.168.140.164:6123/user/rpc/resourcemanager_1, retrying in 1 ms: Could not connect to rpc endpoint under address akka.tcp://flink@192.168.140.164:6123/user/rpc/resourcemanager_1. Notice that when HA is disabled, the resolution of the resource manager is made by service name and so the resource manager can be found 2023-04-11 00:49:34,162 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Successful registration at resource manager akka.tcp://flink@myenv-dev-flink-cluster.myenv-dev:6123/user/rpc/resourcemanager_* under registration id 83ad942597f86aa880ee96f1c2b8b923. Notice in my case , it is not possible to disable istio as explained here: [https://doc.akka.io/docs/akka-management/current/bootstrap/istio.html] Although similar to https://issues.apache.org/jira/browse/FLINK-28171 , logging as separate defect as I believe the fix of FLINK-28171 won't fix this case. FLINK-28171 is about Flink Kubernetes Operator and this is about native kubernetes deployment. was: When using native kubernetes deployment mode with high-availability, and when new TaskManager pod is started to process a job, the TaskManager pod will attempt to register itself to the resource manager (JobManager). the TaskManager looks up the resource manager per ip-address (akka.tcp://flink@192.168.140.164:6123/user/rpc/resourcemanager_1) Nevertheless when istio is enabled, the resolution by ip address is blocked, and hence we see that the job cannot start because task manager cannot register with the resource manager: 2023-04-10 23:24:19,752 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Could not resolve ResourceManager address akka.tcp://flink@192.168.140.164:6123/user/rpc/resourcemanager_1, retrying in 1 ms: Could not connect to rpc endpoint under address akka.tcp://flink@192.168.140.164:6123/user/rpc/resourcemanager_1. Notice that when HA is disabled, the resolution of the resource manager is made by service name and so the resource manager can be found 2023-04-11 00:49:34,162 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Successful registration at resource manager akka.tcp://flink@myenv-dev-flink-cluster.myenv-dev:6123/user/rpc/resourcemanager_* under registration id 83ad942597f86aa880ee96f1c2b8b923. Notice in my case , it is not possible to disable istio as explained here: [https://doc.akka.io/docs/akka-management/current/bootstrap/istio.html] Although similar to https://issues.apache.org/jira/browse/FLINK-28171 , logging as separate defect as I believe the fix of FLINK-28171 won't fix this case. FLINK-28171 is about Flink Kubernetes Operator and this is about native kubernetes deployment. > High-Availability not supported in kubernetes when istio enabled > > > Key: FLINK-31775 > URL: https://issues.apache.org/jira/browse/FLINK-31775 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes >Affects Versions: 1.16.1 >Reporter: Sergio Sainz >Priority: Major > > When using native kubernetes deployment mode with high-availability (HA), and > when new TaskManager pod is started to process a job, the TaskManager pod > will attempt to register itself to the resource manager (JobManager). the > TaskManager looks up the resource manager per ip-address > (akka.tcp://flink@192.168.140.164:6123/user/rpc/resourcemanager_1) > > Nevertheless when istio is enabled, the resolution by ip address is blocked, > and hence we see that the job cannot start because task manager cannot > register with the resource manager: > 2023-04-10 23:24:19,752 INFO > org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Could not > resolve ResourceManager address > akka.tcp://flink@192.168.140.164:6123/user/rpc/resourcemanager_1, retrying in > 1 ms: Could not connect to rpc endpoint under address >
[jira] [Updated] (FLINK-31775) High-Availability not supported in kubernetes when istio enabled
[ https://issues.apache.org/jira/browse/FLINK-31775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergio Sainz updated FLINK-31775: - Description: When using native kubernetes deployment mode with high-availability, and when new TaskManager pod is started to process a job, the TaskManager pod will attempt to register itself to the resource manager (JobManager). the TaskManager looks up the resource manager per ip-address (akka.tcp://flink@192.168.140.164:6123/user/rpc/resourcemanager_1) Nevertheless when istio is enabled, the resolution by ip address is blocked, and hence we see that the job cannot start because task manager cannot register with the resource manager: 2023-04-10 23:24:19,752 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Could not resolve ResourceManager address akka.tcp://flink@192.168.140.164:6123/user/rpc/resourcemanager_1, retrying in 1 ms: Could not connect to rpc endpoint under address akka.tcp://flink@192.168.140.164:6123/user/rpc/resourcemanager_1. Notice that when HA is disabled, the resolution of the resource manager is made by service name and so the resource manager can be found 2023-04-11 00:49:34,162 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Successful registration at resource manager akka.tcp://flink@myenv-dev-flink-cluster.myenv-dev:6123/user/rpc/resourcemanager_* under registration id 83ad942597f86aa880ee96f1c2b8b923. Notice in my case , it is not possible to disable istio as explained here: [https://doc.akka.io/docs/akka-management/current/bootstrap/istio.html] Although similar to https://issues.apache.org/jira/browse/FLINK-28171 , logging as separate defect as I believe the fix of FLINK-28171 won't fix this case. FLINK-28171 is about Flink Kubernetes Operator and this is about native kubernetes deployment. was: When using native kubernetes deployment mode, and when new TaskManager pod is started to process a job, the TaskManager pod will attempt to register itself to the resource manager (JobManager). the TaskManager looks up the resource manager per ip-address (akka.tcp://flink@192.168.140.164:6123/user/rpc/resourcemanager_1) Nevertheless when istio is enabled, the resolution by ip address is blocked, and hence we see that the job cannot start because task manager cannot register with the resource manager: 2023-04-10 23:24:19,752 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Could not resolve ResourceManager address akka.tcp://flink@192.168.140.164:6123/user/rpc/resourcemanager_1, retrying in 1 ms: Could not connect to rpc endpoint under address akka.tcp://flink@192.168.140.164:6123/user/rpc/resourcemanager_1. Notice that when HA is disabled, the resolution of the resource manager is made by service name and so the resource manager can be found 2023-04-11 00:49:34,162 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Successful registration at resource manager akka.tcp://flink@myenv-dev-flink-cluster.myenv-dev:6123/user/rpc/resourcemanager_* under registration id 83ad942597f86aa880ee96f1c2b8b923. Notice in my case , it is not possible to disable istio as explained here: [https://doc.akka.io/docs/akka-management/current/bootstrap/istio.html] Although similar to https://issues.apache.org/jira/browse/FLINK-28171 , logging as separate defect as I believe the fix of FLINK-28171 won't fix this case. FLINK-28171 is about Flink Kubernetes Operator and this is about native kubernetes deployment. > High-Availability not supported in kubernetes when istio enabled > > > Key: FLINK-31775 > URL: https://issues.apache.org/jira/browse/FLINK-31775 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes >Affects Versions: 1.16.1 >Reporter: Sergio Sainz >Priority: Major > > When using native kubernetes deployment mode with high-availability, and when > new TaskManager pod is started to process a job, the TaskManager pod will > attempt to register itself to the resource manager (JobManager). the > TaskManager looks up the resource manager per ip-address > (akka.tcp://flink@192.168.140.164:6123/user/rpc/resourcemanager_1) > > Nevertheless when istio is enabled, the resolution by ip address is blocked, > and hence we see that the job cannot start because task manager cannot > register with the resource manager: > 2023-04-10 23:24:19,752 INFO > org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Could not > resolve ResourceManager address > akka.tcp://flink@192.168.140.164:6123/user/rpc/resourcemanager_1, retrying in > 1 ms: Could not connect to rpc endpoint under address >
[jira] [Updated] (FLINK-31775) High-Availability not supported in kubernetes when istio enabled
[ https://issues.apache.org/jira/browse/FLINK-31775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergio Sainz updated FLINK-31775: - Description: When using native kubernetes deployment mode, and when new TaskManager pod is started to process a job, the TaskManager pod will attempt to register itself to the resource manager (JobManager). the TaskManager looks up the resource manager per ip-address (akka.tcp://flink@192.168.140.164:6123/user/rpc/resourcemanager_1) Nevertheless when istio is enabled, the resolution by ip address is blocked, and hence we see that the job cannot start because task manager cannot register with the resource manager: 2023-04-10 23:24:19,752 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Could not resolve ResourceManager address akka.tcp://flink@192.168.140.164:6123/user/rpc/resourcemanager_1, retrying in 1 ms: Could not connect to rpc endpoint under address akka.tcp://flink@192.168.140.164:6123/user/rpc/resourcemanager_1. Notice that when HA is disabled, the resolution of the resource manager is made by service name and so the resource manager can be found 2023-04-11 00:49:34,162 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Successful registration at resource manager akka.tcp://flink@myenv-dev-flink-cluster.myenv-dev:6123/user/rpc/resourcemanager_* under registration id 83ad942597f86aa880ee96f1c2b8b923. Notice in my case , it is not possible to disable istio as explained here: [https://doc.akka.io/docs/akka-management/current/bootstrap/istio.html] Although similar to https://issues.apache.org/jira/browse/FLINK-28171 , logging as separate defect as I believe the fix of FLINK-28171 won't fix this case. FLINK-28171 is about Flink Kubernetes Operator and this is about native kubernetes deployment. was: When using native kubernetes deployment mode, and when new TaskManager pod is started to process a job, the TaskManager pod will attempt to register itself to the resource manager (JobManager). the TaskManager looks up the resource manager per ip-address (akka.tcp://flink@192.168.140.164:6123/user/rpc/resourcemanager_1) Nevertheless when istio is enabled, the resolution by ip address is blocked, and hence we see that the job cannot start because task manager cannot register with the resource manager: 2023-04-10 23:24:19,752 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Could not resolve ResourceManager address akka.tcp://flink@192.168.140.164:6123/user/rpc/resourcemanager_1, retrying in 1 ms: Could not connect to rpc endpoint under address akka.tcp://flink@192.168.140.164:6123/user/rpc/resourcemanager_1. Notice that when HA is disabled, the resolution of the resource manager is made by service name and so the resource manager can be found 2023-04-11 00:49:34,162 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Successful registration at resource manager akka.tcp://flink@myenv-dev-flink-cluster.myenv-dev:6123/user/rpc/resourcemanager_* under registration id 83ad942597f86aa880ee96f1c2b8b923. Notice in my case , it is not possible to disable istio as explained here: [https://doc.akka.io/docs/akka-management/current/bootstrap/istio.html] Although similar to https://issues.apache.org/jira/browse/FLINK-28171 , logging as separate defect as I believe the fix of FLINK-28171 won't fix this case. FLINK-28171 is about Flink Kubernetes Operator. > High-Availability not supported in kubernetes when istio enabled > > > Key: FLINK-31775 > URL: https://issues.apache.org/jira/browse/FLINK-31775 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes >Affects Versions: 1.16.1 >Reporter: Sergio Sainz >Priority: Major > > When using native kubernetes deployment mode, and when new TaskManager pod is > started to process a job, the TaskManager pod will attempt to register itself > to the resource manager (JobManager). the TaskManager looks up the resource > manager per ip-address > (akka.tcp://flink@192.168.140.164:6123/user/rpc/resourcemanager_1) > > Nevertheless when istio is enabled, the resolution by ip address is blocked, > and hence we see that the job cannot start because task manager cannot > register with the resource manager: > 2023-04-10 23:24:19,752 INFO > org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Could not > resolve ResourceManager address > akka.tcp://flink@192.168.140.164:6123/user/rpc/resourcemanager_1, retrying in > 1 ms: Could not connect to rpc endpoint under address > akka.tcp://flink@192.168.140.164:6123/user/rpc/resourcemanager_1. > > Notice that when HA is disabled, the resolution of the resource manager
[jira] [Updated] (FLINK-31775) High-Availability not supported in kubernetes when istio enabled
[ https://issues.apache.org/jira/browse/FLINK-31775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergio Sainz updated FLINK-31775: - Description: When using native kubernetes deployment mode, and when new TaskManager pod is started to process a job, the TaskManager pod will attempt to register itself to the resource manager (JobManager). the TaskManager looks up the resource manager per ip-address (akka.tcp://flink@192.168.140.164:6123/user/rpc/resourcemanager_1) Nevertheless when istio is enabled, the resolution by ip address is blocked, and hence we see that the job cannot start because task manager cannot register with the resource manager: 2023-04-10 23:24:19,752 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Could not resolve ResourceManager address akka.tcp://flink@192.168.140.164:6123/user/rpc/resourcemanager_1, retrying in 1 ms: Could not connect to rpc endpoint under address akka.tcp://flink@192.168.140.164:6123/user/rpc/resourcemanager_1. Notice that when HA is disabled, the resolution of the resource manager is made by service name and so the resource manager can be found 2023-04-11 00:49:34,162 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Successful registration at resource manager akka.tcp://flink@myenv-dev-flink-cluster.myenv-dev:6123/user/rpc/resourcemanager_* under registration id 83ad942597f86aa880ee96f1c2b8b923. Notice in my case , it is not possible to disable istio as explained here: [https://doc.akka.io/docs/akka-management/current/bootstrap/istio.html] Although similar to https://issues.apache.org/jira/browse/FLINK-28171 , logging as separate defect as I believe the fix of FLINK-28171 won't fix this case. FLINK-28171 is about Flink Kubernetes Operator. was: When using native kubernetes deployment mode, and when new TaskManager is started to process a job, the TaskManager will attempt to register itself to the resource manager (job manager). the TaskManager looks up the resource manager per ip-address (akka.tcp://flink@192.168.140.164:6123/user/rpc/resourcemanager_1) Nevertheless when istio is enabled, the resolution by ip address is blocked, and hence we see that the job cannot start because task manager cannot register with the resource manager: 2023-04-10 23:24:19,752 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Could not resolve ResourceManager address akka.tcp://flink@192.168.140.164:6123/user/rpc/resourcemanager_1, retrying in 1 ms: Could not connect to rpc endpoint under address akka.tcp://flink@192.168.140.164:6123/user/rpc/resourcemanager_1. Notice that when HA is disabled, the resolution of the resource manager is made by service name and so the resource manager can be found 2023-04-11 00:49:34,162 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Successful registration at resource manager akka.tcp://fl...@local-mci-ar32a-dev-flink-cluster.mstr-env-mci-ar32a-dev:6123/user/rpc/resourcemanager_* under registration id 83ad942597f86aa880ee96f1c2b8b923. Notice it is not possible to disable istio (as explained here : https://doc.akka.io/docs/akka-management/current/bootstrap/istio.html) Although similar to https://issues.apache.org/jira/browse/FLINK-28171 , logging as separate defect as I believe the fix of FLINK-28171 won't fix this case. FLINK-28171 is about Flink Kubernetes Operator. > High-Availability not supported in kubernetes when istio enabled > > > Key: FLINK-31775 > URL: https://issues.apache.org/jira/browse/FLINK-31775 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes >Affects Versions: 1.16.1 >Reporter: Sergio Sainz >Priority: Major > > When using native kubernetes deployment mode, and when new TaskManager pod is > started to process a job, the TaskManager pod will attempt to register itself > to the resource manager (JobManager). the TaskManager looks up the resource > manager per ip-address > (akka.tcp://flink@192.168.140.164:6123/user/rpc/resourcemanager_1) > > Nevertheless when istio is enabled, the resolution by ip address is blocked, > and hence we see that the job cannot start because task manager cannot > register with the resource manager: > 2023-04-10 23:24:19,752 INFO > org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Could not > resolve ResourceManager address > akka.tcp://flink@192.168.140.164:6123/user/rpc/resourcemanager_1, retrying in > 1 ms: Could not connect to rpc endpoint under address > akka.tcp://flink@192.168.140.164:6123/user/rpc/resourcemanager_1. > > Notice that when HA is disabled, the resolution of the resource manager is > made by service name and so the resource
[jira] [Created] (FLINK-31775) High-Availability not supported in kubernetes when istio enabled
Sergio Sainz created FLINK-31775: Summary: High-Availability not supported in kubernetes when istio enabled Key: FLINK-31775 URL: https://issues.apache.org/jira/browse/FLINK-31775 Project: Flink Issue Type: Bug Components: Deployment / Kubernetes Affects Versions: 1.16.1 Reporter: Sergio Sainz When using native kubernetes deployment mode, and when new TaskManager is started to process a job, the TaskManager will attempt to register itself to the resource manager (job manager). the TaskManager looks up the resource manager per ip-address (akka.tcp://flink@192.168.140.164:6123/user/rpc/resourcemanager_1) Nevertheless when istio is enabled, the resolution by ip address is blocked, and hence we see that the job cannot start because task manager cannot register with the resource manager: 2023-04-10 23:24:19,752 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Could not resolve ResourceManager address akka.tcp://flink@192.168.140.164:6123/user/rpc/resourcemanager_1, retrying in 1 ms: Could not connect to rpc endpoint under address akka.tcp://flink@192.168.140.164:6123/user/rpc/resourcemanager_1. Notice that when HA is disabled, the resolution of the resource manager is made by service name and so the resource manager can be found 2023-04-11 00:49:34,162 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Successful registration at resource manager akka.tcp://fl...@local-mci-ar32a-dev-flink-cluster.mstr-env-mci-ar32a-dev:6123/user/rpc/resourcemanager_* under registration id 83ad942597f86aa880ee96f1c2b8b923. Notice it is not possible to disable istio (as explained here : https://doc.akka.io/docs/akka-management/current/bootstrap/istio.html) Although similar to https://issues.apache.org/jira/browse/FLINK-28171 , logging as separate defect as I believe the fix of FLINK-28171 won't fix this case. FLINK-28171 is about Flink Kubernetes Operator. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (FLINK-28171) Adjust Job and Task manager port definitions to work with Istio+mTLS
[ https://issues.apache.org/jira/browse/FLINK-28171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17710375#comment-17710375 ] Sergio Sainz edited comment on FLINK-28171 at 4/11/23 2:42 AM: --- Hello [~wangyang0918] ! also experiencing this issue with native kubernetes deployment with HA enabled (not in flink k8s operator). In [https://lists.apache.org/thread/yl40s9069wksz66qlf9t6jhmwsn59zft] you mentioned that "If HA enabled, the internal jobmanager rpc service will not be created. Instead, the TaskManager retrieves the JobManager address via HA services and connects to it via pod ip." Do you know whether we can change the way TaskManager connects to HA services: do not use ip address and instead use pod name? I think we cannot add the akka workaround of bypassing the istio sidecar. Thanks for the info ~ was (Author: sergiosp): Hello [~wangyang0918] ! also experiencing this issue with native kubernetes deployment with HA enabled (not in flink k8s operator). In [https://lists.apache.org/thread/yl40s9069wksz66qlf9t6jhmwsn59zft] you mentioned that "If HA enabled, the internal jobmanager rpc service will not be created. Instead, the TaskManager retrieves the JobManager address via HA services and connects to it via pod ip." Do you know whether we can change the way TaskManager connects to HA services (do not use ip address and instead use service name? I think we cannot add the akka workaround of bypassing the istio sidecar. Thanks for the info ~ > Adjust Job and Task manager port definitions to work with Istio+mTLS > > > Key: FLINK-28171 > URL: https://issues.apache.org/jira/browse/FLINK-28171 > Project: Flink > Issue Type: Improvement > Components: Deployment / Kubernetes >Affects Versions: 1.14.4 > Environment: flink-kubernetes-operator 1.0.0 > Flink 1.14-java11 > Kubernetes v1.19.5 > Istio 1.7.6 >Reporter: Moshe Elisha >Assignee: Moshe Elisha >Priority: Major > Labels: pull-request-available > > Hello, > > We are launching Flink deployments using the [Flink Kubernetes > Operator|https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-stable/] > on a Kubernetes cluster with Istio and mTLS enabled. > > We found that the TaskManager is unable to communicate with the JobManager on > the jobmanager-rpc port: > > {{2022-06-15 15:25:40,508 WARN akka.remote.ReliableDeliverySupervisor > [] - Association with remote system > [akka.tcp://[flink@amf-events-to-inference-and-central.nwdaf-edge|mailto:flink@amf-events-to-inference-and-central.nwdaf-edge]:6123] > has failed, address is now gated for [50] ms. Reason: [Association failed > with > [akka.tcp://[flink@amf-events-to-inference-and-central.nwdaf-edge|mailto:flink@amf-events-to-inference-and-central.nwdaf-edge]:6123]] > Caused by: [The remote system explicitly disassociated (reason unknown).]}} > > The reason for the issue is that the JobManager service port definitions are > not following the Istio guidelines > [https://istio.io/latest/docs/ops/configuration/traffic-management/protocol-selection/] > (see example below). > > There was also an email discussion around this topic in the users mailing > group under the subject "Flink Kubernetes Operator with K8S + Istio + mTLS - > port definitions". > With the help of the community, we were able to work around the issue but it > was very hard and forced us to skip Istio proxy which is not ideal. > > We would like you to consider changing the default port definitions, either > # Rename the ports – I understand it is Istio specific guideline but maybe > it is better to at least be aligned with one (popular) vendor guideline > instead of none at all. > # Add the “appProtocol” property[1] that is not specific to any vendor but > requires Kubernetes >= 1.19 where it was introduced as beta and moved to > stable in >= 1.20. The option to add appProtocol property was added only in > [https://github.com/fabric8io/kubernetes-client/releases/tag/v5.10.0] with > [#3570|https://github.com/fabric8io/kubernetes-client/issues/3570]. > # Or allow a way to override the defaults. > > [https://kubernetes.io/docs/concepts/services-networking/_print/#application-protocol] > > > {{# k get service inference-results-to-analytics-engine -o yaml}} > {{apiVersion: v1}} > {{kind: Service}} > {{...}} > {{spec:}} > {{ clusterIP: None}} > {{ ports:}} > {{ - name: jobmanager-rpc *# should start with “tcp-“ or add "appProtocol" > property*}} > {{ port: 6123}} > {{ protocol: TCP}} > {{ targetPort: 6123}} > {{ - name: blobserver *# should start with "tcp-" or add "appProtocol" > property*}} > {{ port: 6124}} > {{ protocol: TCP}} > {{ targetPort:
[jira] [Commented] (FLINK-28171) Adjust Job and Task manager port definitions to work with Istio+mTLS
[ https://issues.apache.org/jira/browse/FLINK-28171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17710375#comment-17710375 ] Sergio Sainz commented on FLINK-28171: -- Hello [~wangyang0918] ! also experiencing this issue with native kubernetes deployment with HA enabled (not in flink k8s operator). In [https://lists.apache.org/thread/yl40s9069wksz66qlf9t6jhmwsn59zft] you mentioned that "If HA enabled, the internal jobmanager rpc service will not be created. Instead, the TaskManager retrieves the JobManager address via HA services and connects to it via pod ip." Do you know whether we can change the way TaskManager connects to HA services (do not use ip address and instead use service name? I think we cannot add the akka workaround of bypassing the istio sidecar. Thanks for the info ~ > Adjust Job and Task manager port definitions to work with Istio+mTLS > > > Key: FLINK-28171 > URL: https://issues.apache.org/jira/browse/FLINK-28171 > Project: Flink > Issue Type: Improvement > Components: Deployment / Kubernetes >Affects Versions: 1.14.4 > Environment: flink-kubernetes-operator 1.0.0 > Flink 1.14-java11 > Kubernetes v1.19.5 > Istio 1.7.6 >Reporter: Moshe Elisha >Assignee: Moshe Elisha >Priority: Major > Labels: pull-request-available > > Hello, > > We are launching Flink deployments using the [Flink Kubernetes > Operator|https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-stable/] > on a Kubernetes cluster with Istio and mTLS enabled. > > We found that the TaskManager is unable to communicate with the JobManager on > the jobmanager-rpc port: > > {{2022-06-15 15:25:40,508 WARN akka.remote.ReliableDeliverySupervisor > [] - Association with remote system > [akka.tcp://[flink@amf-events-to-inference-and-central.nwdaf-edge|mailto:flink@amf-events-to-inference-and-central.nwdaf-edge]:6123] > has failed, address is now gated for [50] ms. Reason: [Association failed > with > [akka.tcp://[flink@amf-events-to-inference-and-central.nwdaf-edge|mailto:flink@amf-events-to-inference-and-central.nwdaf-edge]:6123]] > Caused by: [The remote system explicitly disassociated (reason unknown).]}} > > The reason for the issue is that the JobManager service port definitions are > not following the Istio guidelines > [https://istio.io/latest/docs/ops/configuration/traffic-management/protocol-selection/] > (see example below). > > There was also an email discussion around this topic in the users mailing > group under the subject "Flink Kubernetes Operator with K8S + Istio + mTLS - > port definitions". > With the help of the community, we were able to work around the issue but it > was very hard and forced us to skip Istio proxy which is not ideal. > > We would like you to consider changing the default port definitions, either > # Rename the ports – I understand it is Istio specific guideline but maybe > it is better to at least be aligned with one (popular) vendor guideline > instead of none at all. > # Add the “appProtocol” property[1] that is not specific to any vendor but > requires Kubernetes >= 1.19 where it was introduced as beta and moved to > stable in >= 1.20. The option to add appProtocol property was added only in > [https://github.com/fabric8io/kubernetes-client/releases/tag/v5.10.0] with > [#3570|https://github.com/fabric8io/kubernetes-client/issues/3570]. > # Or allow a way to override the defaults. > > [https://kubernetes.io/docs/concepts/services-networking/_print/#application-protocol] > > > {{# k get service inference-results-to-analytics-engine -o yaml}} > {{apiVersion: v1}} > {{kind: Service}} > {{...}} > {{spec:}} > {{ clusterIP: None}} > {{ ports:}} > {{ - name: jobmanager-rpc *# should start with “tcp-“ or add "appProtocol" > property*}} > {{ port: 6123}} > {{ protocol: TCP}} > {{ targetPort: 6123}} > {{ - name: blobserver *# should start with "tcp-" or add "appProtocol" > property*}} > {{ port: 6124}} > {{ protocol: TCP}} > {{ targetPort: 6124}} > {{...}} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-31685) Checkpoint job folder not deleted after job is cancelled
[ https://issues.apache.org/jira/browse/FLINK-31685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergio Sainz updated FLINK-31685: - Description: When flink job is being checkpointed, and after the job is cancelled, the checkpoint is indeed deleted (as per {{{}execution.checkpointing.externalized-checkpoint-retention: DELETE_ON_CANCELLATION{}}}), but the job-id folder still remains: [sergio@flink-cluster-54f7fc7c6-k6km8 JobCheckpoints]$ ls 01eff17aa2910484b5aeb644bc531172 3a59309ef018541fc0c20856d0d89855 78ff2344dd7ef89f9fbcc9789fc0cd79 a6fd7cec89c0af78c3353d4a46a7d273 dbc957868c08ebeb100d708bbd057593 04ff0abb9e860fc85f0e39d722367c3c 3e09166341615b1b4786efd6745a05d6 79efc000aa29522f0a9598661f485f67 a8c42bfe158abd78ebcb4adb135de61f dc8e04b02c9d8a1bc04b21d2c8f21f74 05f48019475de40230900230c63cfe89 3f9fb467c9af91ef41d527fe92f9b590 7a6ad7407d7120eda635d71cd843916a a8db748c1d329407405387ac82040be4 dfb2df1c25056e920d41c94b659dcdab 09d30bc0ff786994a6a3bb06abd3 455525b76a1c6826b6eaebd5649c5b6b 7b1458424496baaf3d020e9fece525a4 aa2ef9587b2e9c123744e8940a66a287 All folders in the above list, like {{01eff17aa2910484b5aeb644bc531172}} , are empty ~ *Expected behaviour:* The job folder id should also be deleted. was: When flink job is being checkpointed, and after the job is cancelled, the checkpoint is indeed deleted (as per {{{}execution.checkpointing.externalized-checkpoint-retention: DELETE_ON_CANCELLATION{}}}), but the job-id folder still remains: sergio@flink-cluster-54f7fc7c6-k6km8 JobCheckpoints]$ ls 01eff17aa2910484b5aeb644bc531172 3a59309ef018541fc0c20856d0d89855 78ff2344dd7ef89f9fbcc9789fc0cd79 a6fd7cec89c0af78c3353d4a46a7d273 dbc957868c08ebeb100d708bbd057593 04ff0abb9e860fc85f0e39d722367c3c 3e09166341615b1b4786efd6745a05d6 79efc000aa29522f0a9598661f485f67 a8c42bfe158abd78ebcb4adb135de61f dc8e04b02c9d8a1bc04b21d2c8f21f74 05f48019475de40230900230c63cfe89 3f9fb467c9af91ef41d527fe92f9b590 7a6ad7407d7120eda635d71cd843916a a8db748c1d329407405387ac82040be4 dfb2df1c25056e920d41c94b659dcdab 09d30bc0ff786994a6a3bb06abd3 455525b76a1c6826b6eaebd5649c5b6b 7b1458424496baaf3d020e9fece525a4 aa2ef9587b2e9c123744e8940a66a287 All folders in the above list, like {{01eff17aa2910484b5aeb644bc531172}} , are empty ~ *Expected behaviour:* The job folder id should also be deleted. > Checkpoint job folder not deleted after job is cancelled > > > Key: FLINK-31685 > URL: https://issues.apache.org/jira/browse/FLINK-31685 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.16.1 >Reporter: Sergio Sainz >Priority: Major > > When flink job is being checkpointed, and after the job is cancelled, the > checkpoint is indeed deleted (as per > {{{}execution.checkpointing.externalized-checkpoint-retention: > DELETE_ON_CANCELLATION{}}}), but the job-id folder still remains: > > [sergio@flink-cluster-54f7fc7c6-k6km8 JobCheckpoints]$ ls > 01eff17aa2910484b5aeb644bc531172 3a59309ef018541fc0c20856d0d89855 > 78ff2344dd7ef89f9fbcc9789fc0cd79 a6fd7cec89c0af78c3353d4a46a7d273 > dbc957868c08ebeb100d708bbd057593 > 04ff0abb9e860fc85f0e39d722367c3c 3e09166341615b1b4786efd6745a05d6 > 79efc000aa29522f0a9598661f485f67 a8c42bfe158abd78ebcb4adb135de61f > dc8e04b02c9d8a1bc04b21d2c8f21f74 > 05f48019475de40230900230c63cfe89 3f9fb467c9af91ef41d527fe92f9b590 > 7a6ad7407d7120eda635d71cd843916a a8db748c1d329407405387ac82040be4 > dfb2df1c25056e920d41c94b659dcdab > 09d30bc0ff786994a6a3bb06abd3 455525b76a1c6826b6eaebd5649c5b6b > 7b1458424496baaf3d020e9fece525a4 aa2ef9587b2e9c123744e8940a66a287 > All folders in the above list, like {{01eff17aa2910484b5aeb644bc531172}} , > are empty ~ > > *Expected behaviour:* > The job folder id should also be deleted. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-31685) Checkpoint job folder not deleted after job is cancelled
[ https://issues.apache.org/jira/browse/FLINK-31685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergio Sainz updated FLINK-31685: - Description: When flink job is being checkpointed, and after the job is cancelled, the checkpoint is indeed deleted (as per {{{}execution.checkpointing.externalized-checkpoint-retention: DELETE_ON_CANCELLATION{}}}), but the job-id folder still remains: sergio@flink-cluster-54f7fc7c6-k6km8 JobCheckpoints]$ ls 01eff17aa2910484b5aeb644bc531172 3a59309ef018541fc0c20856d0d89855 78ff2344dd7ef89f9fbcc9789fc0cd79 a6fd7cec89c0af78c3353d4a46a7d273 dbc957868c08ebeb100d708bbd057593 04ff0abb9e860fc85f0e39d722367c3c 3e09166341615b1b4786efd6745a05d6 79efc000aa29522f0a9598661f485f67 a8c42bfe158abd78ebcb4adb135de61f dc8e04b02c9d8a1bc04b21d2c8f21f74 05f48019475de40230900230c63cfe89 3f9fb467c9af91ef41d527fe92f9b590 7a6ad7407d7120eda635d71cd843916a a8db748c1d329407405387ac82040be4 dfb2df1c25056e920d41c94b659dcdab 09d30bc0ff786994a6a3bb06abd3 455525b76a1c6826b6eaebd5649c5b6b 7b1458424496baaf3d020e9fece525a4 aa2ef9587b2e9c123744e8940a66a287 All folders in the above list, like {{01eff17aa2910484b5aeb644bc531172}} , are empty ~ *Expected behaviour:* The job folder id should also be deleted. was: When flink job is being checkpointed, and after the job is cancelled, the checkpoint is indeed deleted (as per {{{}execution.checkpointing.externalized-checkpoint-retention: DELETE_ON_CANCELLATION{}}}), but the job-id folder still remains: {color:var(--ds-text, #172b4d)}{{{color:var(--ds-text-subtlest, #505f79) }1{color}[sergio@flink-cluster-54f7fc7c6-k6km8 JobCheckpoints]$ ls{color:var(--ds-text-subtlest, #505f79) }2{color}01eff17aa2910484b5aeb644bc531172 3a59309ef018541fc0c20856d0d89855 78ff2344dd7ef89f9fbcc9789fc0cd79 a6fd7cec89c0af78c3353d4a46a7d273 dbc957868c08ebeb100d708bbd057593{color:var(--ds-text-subtlest, #505f79) }3{color}04ff0abb9e860fc85f0e39d722367c3c 3e09166341615b1b4786efd6745a05d6 79efc000aa29522f0a9598661f485f67 a8c42bfe158abd78ebcb4adb135de61f dc8e04b02c9d8a1bc04b21d2c8f21f74{color:var(--ds-text-subtlest, #505f79) }4{color}05f48019475de40230900230c63cfe89 3f9fb467c9af91ef41d527fe92f9b590 7a6ad7407d7120eda635d71cd843916a a8db748c1d329407405387ac82040be4 dfb2df1c25056e920d41c94b659dcdab{color:var(--ds-text-subtlest, #505f79) }5{color}09d30bc0ff786994a6a3bb06abd3 455525b76a1c6826b6eaebd5649c5b6b 7b1458424496baaf3d020e9fece525a4 aa2ef9587b2e9c123744e8940a66a287}}{color} All folders in the above list, like {{01eff17aa2910484b5aeb644bc531172}} , are empty ~ *Expected behaviour:* The job folder id should also be deleted. > Checkpoint job folder not deleted after job is cancelled > > > Key: FLINK-31685 > URL: https://issues.apache.org/jira/browse/FLINK-31685 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.16.1 >Reporter: Sergio Sainz >Priority: Major > > When flink job is being checkpointed, and after the job is cancelled, the > checkpoint is indeed deleted (as per > {{{}execution.checkpointing.externalized-checkpoint-retention: > DELETE_ON_CANCELLATION{}}}), but the job-id folder still remains: > > sergio@flink-cluster-54f7fc7c6-k6km8 JobCheckpoints]$ ls > 01eff17aa2910484b5aeb644bc531172 3a59309ef018541fc0c20856d0d89855 > 78ff2344dd7ef89f9fbcc9789fc0cd79 a6fd7cec89c0af78c3353d4a46a7d273 > dbc957868c08ebeb100d708bbd057593 > 04ff0abb9e860fc85f0e39d722367c3c 3e09166341615b1b4786efd6745a05d6 > 79efc000aa29522f0a9598661f485f67 a8c42bfe158abd78ebcb4adb135de61f > dc8e04b02c9d8a1bc04b21d2c8f21f74 > 05f48019475de40230900230c63cfe89 3f9fb467c9af91ef41d527fe92f9b590 > 7a6ad7407d7120eda635d71cd843916a a8db748c1d329407405387ac82040be4 > dfb2df1c25056e920d41c94b659dcdab > 09d30bc0ff786994a6a3bb06abd3 455525b76a1c6826b6eaebd5649c5b6b > 7b1458424496baaf3d020e9fece525a4 aa2ef9587b2e9c123744e8940a66a287 > All folders in the above list, like {{01eff17aa2910484b5aeb644bc531172}} , > are empty ~ > > *Expected behaviour:* > The job folder id should also be deleted. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-31685) Checkpoint job folder not deleted after job is cancelled
Sergio Sainz created FLINK-31685: Summary: Checkpoint job folder not deleted after job is cancelled Key: FLINK-31685 URL: https://issues.apache.org/jira/browse/FLINK-31685 Project: Flink Issue Type: Bug Components: Runtime / Checkpointing Affects Versions: 1.16.1 Reporter: Sergio Sainz When flink job is being checkpointed, and after the job is cancelled, the checkpoint is indeed deleted (as per {{{}execution.checkpointing.externalized-checkpoint-retention: DELETE_ON_CANCELLATION{}}}), but the job-id folder still remains: {color:var(--ds-text, #172b4d)}{{{color:var(--ds-text-subtlest, #505f79) }1{color}[sergio@flink-cluster-54f7fc7c6-k6km8 JobCheckpoints]$ ls{color:var(--ds-text-subtlest, #505f79) }2{color}01eff17aa2910484b5aeb644bc531172 3a59309ef018541fc0c20856d0d89855 78ff2344dd7ef89f9fbcc9789fc0cd79 a6fd7cec89c0af78c3353d4a46a7d273 dbc957868c08ebeb100d708bbd057593{color:var(--ds-text-subtlest, #505f79) }3{color}04ff0abb9e860fc85f0e39d722367c3c 3e09166341615b1b4786efd6745a05d6 79efc000aa29522f0a9598661f485f67 a8c42bfe158abd78ebcb4adb135de61f dc8e04b02c9d8a1bc04b21d2c8f21f74{color:var(--ds-text-subtlest, #505f79) }4{color}05f48019475de40230900230c63cfe89 3f9fb467c9af91ef41d527fe92f9b590 7a6ad7407d7120eda635d71cd843916a a8db748c1d329407405387ac82040be4 dfb2df1c25056e920d41c94b659dcdab{color:var(--ds-text-subtlest, #505f79) }5{color}09d30bc0ff786994a6a3bb06abd3 455525b76a1c6826b6eaebd5649c5b6b 7b1458424496baaf3d020e9fece525a4 aa2ef9587b2e9c123744e8940a66a287}}{color} All folders in the above list, like {{01eff17aa2910484b5aeb644bc531172}} , are empty ~ *Expected behaviour:* The job folder id should also be deleted. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-15736) Support Java 17 (LTS)
[ https://issues.apache.org/jira/browse/FLINK-15736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17697544#comment-17697544 ] Sergio Sainz commented on FLINK-15736: -- Hi [~chesnay] , saw in June last year there was no timeline to implement the JDK17 support. Just kindly checking if there is a change on the timeline? as there has been some more investigation on this ticket and the different related bugs found by JDK17 upgrade. Thanks! > Support Java 17 (LTS) > - > > Key: FLINK-15736 > URL: https://issues.apache.org/jira/browse/FLINK-15736 > Project: Flink > Issue Type: New Feature > Components: Build System >Reporter: Chesnay Schepler >Assignee: Chesnay Schepler >Priority: Major > Labels: auto-deprioritized-major, pull-request-available, > stale-assigned > > Long-term issue for preparing Flink for Java 17. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-31247) Support for fine-grained resource allocation (slot sharing groups) at the TableAPI level
Sergio Sainz created FLINK-31247: Summary: Support for fine-grained resource allocation (slot sharing groups) at the TableAPI level Key: FLINK-31247 URL: https://issues.apache.org/jira/browse/FLINK-31247 Project: Flink Issue Type: New Feature Components: Runtime / Coordination Affects Versions: 1.16.1 Reporter: Sergio Sainz Currently Flink allows for fine-grained resource allocation at the DataStream API level (please see [https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/deployment/finegrained_resource/)] This ticket is an enhancement request to support the same API at the Table level: org.apache.flink.table.api.Table.setSlotSharingGroup(String ssg) org.apache.flink.table.api.Table.setSlotSharingGroup(SlotSharingGroup ssg) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-29235) CVE-2022-25857 on flink-shaded
[ https://issues.apache.org/jira/browse/FLINK-29235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17629998#comment-17629998 ] Sergio Sainz commented on FLINK-29235: -- Hi , could we evaluate for addition in 1.16.1? > CVE-2022-25857 on flink-shaded > -- > > Key: FLINK-29235 > URL: https://issues.apache.org/jira/browse/FLINK-29235 > Project: Flink > Issue Type: Bug > Components: Build System, BuildSystem / Shaded >Affects Versions: 1.15.2 >Reporter: Sergio Sainz >Assignee: Chesnay Schepler >Priority: Major > Fix For: 1.17.0 > > > flink-shaded-version uses snakeyaml v1.29 which is vulnerable to > CVE-2022-25857 > Ref: > https://nvd.nist.gov/vuln/detail/CVE-2022-25857 > https://repo1.maven.org/maven2/org/apache/flink/flink-shaded-jackson/2.12.4-15.0/flink-shaded-jackson-2.12.4-15.0.pom > https://github.com/apache/flink-shaded/blob/master/flink-shaded-jackson-parent/flink-shaded-jackson-2/pom.xml#L73 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-29892) flink-conf.yaml does not accept hash (#) in the env.java.opts property
[ https://issues.apache.org/jira/browse/FLINK-29892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergio Sainz updated FLINK-29892: - Description: When adding a string with hash (#) character in env.java.opts in flink-conf.yaml , the string will be truncated from the # onwards even when the value is surrounded by single quotes or double quotes. example: (in flink-conf.yaml): env.java.opts: "-Djavax.net.ssl.trustStorePassword=my#pwd" the value shown on the flink taskmanagers or job managers is : env.java.opts: -Djavax.net.ssl.trustStorePassword=my was: When adding a string with hash (#) character in env.java.opts in flink-conf.yaml , the string will be truncated from the # onwards even when the value is surrounded by single quotes or double quotes. example: (in flink-conf.yaml): env.java.opts: -Djavax.net.ssl.trustStorePassword=my#pwd the value shown on the flink taskmanagers or job managers is : env.java.opts: -Djavax.net.ssl.trustStorePassword=my > flink-conf.yaml does not accept hash (#) in the env.java.opts property > -- > > Key: FLINK-29892 > URL: https://issues.apache.org/jira/browse/FLINK-29892 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes >Affects Versions: 1.15.2 >Reporter: Sergio Sainz >Priority: Major > > When adding a string with hash (#) character in env.java.opts in > flink-conf.yaml , the string will be truncated from the # onwards even when > the value is surrounded by single quotes or double quotes. > example: > (in flink-conf.yaml): > env.java.opts: "-Djavax.net.ssl.trustStorePassword=my#pwd" > > the value shown on the flink taskmanagers or job managers is : > env.java.opts: -Djavax.net.ssl.trustStorePassword=my > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-29892) flink-conf.yaml does not accept hash (#) in the env.java.opts property
Sergio Sainz created FLINK-29892: Summary: flink-conf.yaml does not accept hash (#) in the env.java.opts property Key: FLINK-29892 URL: https://issues.apache.org/jira/browse/FLINK-29892 Project: Flink Issue Type: Bug Components: Deployment / Kubernetes Affects Versions: 1.15.2 Reporter: Sergio Sainz When adding a string with hash (#) character in env.java.opts in flink-conf.yaml , the string will be truncated from the # onwards even when the value is surrounded by single quotes or double quotes. example: (in flink-conf.yaml): env.java.opts: -Djavax.net.ssl.trustStorePassword=my#pwd the value shown on the flink taskmanagers or job managers is : env.java.opts: -Djavax.net.ssl.trustStorePassword=my -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (FLINK-29235) CVE-2022-25857 on flink-shaded
[ https://issues.apache.org/jira/browse/FLINK-29235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17623369#comment-17623369 ] Sergio Sainz edited comment on FLINK-29235 at 10/24/22 7:45 PM: Hello [~chesnay] Noticed the flink-shaded-jackson v2.13.4-16.0 already has the fix (it uses jackson's own snakeyaml version which is 1.31). Could we upgrade flink-shaded version in flink version 1.16.0 to use 2.13.4-16.0? [https://github.com/apache/flink/blob/release-1.16.0-rc2/pom.xml#L125] {{16.0 }}{{2.13.4}} Ref: [https://repo1.maven.org/maven2/com/fasterxml/jackson/dataformat/jackson-dataformat-yaml/2.13.4/jackson-dataformat-yaml-2.13.4.pom] [https://repo1.maven.org/maven2/org/apache/flink/flink-shaded-jackson/2.13.4-16.0/flink-shaded-jackson-2.13.4-16.0.pom] was (Author: sergiosp): Hello [~chesnay] Noticed the flink-shaded-jackson v2.13.4-16.0 already has the fix (it uses jackson's own snakeyaml version which is 1.31). Could we upgrade flink-shaded version in flink version 1.16.0 to use 2.13.4-16.0? [https://github.com/apache/flink/blob/release-1.16.0-rc2/pom.xml#L125] ``` 16.0 2.13.4 ``` ref: [https://repo1.maven.org/maven2/com/fasterxml/jackson/dataformat/jackson-dataformat-yaml/2.13.4/jackson-dataformat-yaml-2.13.4.pom] [https://repo1.maven.org/maven2/org/apache/flink/flink-shaded-jackson/2.13.4-16.0/flink-shaded-jackson-2.13.4-16.0.pom] > CVE-2022-25857 on flink-shaded > -- > > Key: FLINK-29235 > URL: https://issues.apache.org/jira/browse/FLINK-29235 > Project: Flink > Issue Type: Bug > Components: Build System, BuildSystem / Shaded >Affects Versions: 1.15.2 >Reporter: Sergio Sainz >Assignee: Chesnay Schepler >Priority: Major > > flink-shaded-version uses snakeyaml v1.29 which is vulnerable to > CVE-2022-25857 > Ref: > https://nvd.nist.gov/vuln/detail/CVE-2022-25857 > https://repo1.maven.org/maven2/org/apache/flink/flink-shaded-jackson/2.12.4-15.0/flink-shaded-jackson-2.12.4-15.0.pom > https://github.com/apache/flink-shaded/blob/master/flink-shaded-jackson-parent/flink-shaded-jackson-2/pom.xml#L73 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (FLINK-29235) CVE-2022-25857 on flink-shaded
[ https://issues.apache.org/jira/browse/FLINK-29235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17623369#comment-17623369 ] Sergio Sainz edited comment on FLINK-29235 at 10/24/22 7:45 PM: Hello [~chesnay] Noticed the flink-shaded-jackson v2.13.4-16.0 already has the fix (it uses jackson's own snakeyaml version which is 1.31). Could we upgrade flink-shaded version in flink version 1.16.0 to use 2.13.4-16.0? [https://github.com/apache/flink/blob/release-1.16.0-rc2/pom.xml#L125] {{16.0 2.13.4}} Ref: [https://repo1.maven.org/maven2/com/fasterxml/jackson/dataformat/jackson-dataformat-yaml/2.13.4/jackson-dataformat-yaml-2.13.4.pom] [https://repo1.maven.org/maven2/org/apache/flink/flink-shaded-jackson/2.13.4-16.0/flink-shaded-jackson-2.13.4-16.0.pom] was (Author: sergiosp): Hello [~chesnay] Noticed the flink-shaded-jackson v2.13.4-16.0 already has the fix (it uses jackson's own snakeyaml version which is 1.31). Could we upgrade flink-shaded version in flink version 1.16.0 to use 2.13.4-16.0? [https://github.com/apache/flink/blob/release-1.16.0-rc2/pom.xml#L125] {{16.0 }}{{2.13.4}} Ref: [https://repo1.maven.org/maven2/com/fasterxml/jackson/dataformat/jackson-dataformat-yaml/2.13.4/jackson-dataformat-yaml-2.13.4.pom] [https://repo1.maven.org/maven2/org/apache/flink/flink-shaded-jackson/2.13.4-16.0/flink-shaded-jackson-2.13.4-16.0.pom] > CVE-2022-25857 on flink-shaded > -- > > Key: FLINK-29235 > URL: https://issues.apache.org/jira/browse/FLINK-29235 > Project: Flink > Issue Type: Bug > Components: Build System, BuildSystem / Shaded >Affects Versions: 1.15.2 >Reporter: Sergio Sainz >Assignee: Chesnay Schepler >Priority: Major > > flink-shaded-version uses snakeyaml v1.29 which is vulnerable to > CVE-2022-25857 > Ref: > https://nvd.nist.gov/vuln/detail/CVE-2022-25857 > https://repo1.maven.org/maven2/org/apache/flink/flink-shaded-jackson/2.12.4-15.0/flink-shaded-jackson-2.12.4-15.0.pom > https://github.com/apache/flink-shaded/blob/master/flink-shaded-jackson-parent/flink-shaded-jackson-2/pom.xml#L73 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-29235) CVE-2022-25857 on flink-shaded
[ https://issues.apache.org/jira/browse/FLINK-29235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17623369#comment-17623369 ] Sergio Sainz commented on FLINK-29235: -- Hello [~chesnay] Noticed the flink-shaded-jackson v2.13.4-16.0 already has the fix (it uses jackson's own snakeyaml version which is 1.31). Could we upgrade flink-shaded version in flink version 1.16.0 to use 2.13.4-16.0? [https://github.com/apache/flink/blob/release-1.16.0-rc2/pom.xml#L125] ``` 16.0 2.13.4 ``` ref: [https://repo1.maven.org/maven2/com/fasterxml/jackson/dataformat/jackson-dataformat-yaml/2.13.4/jackson-dataformat-yaml-2.13.4.pom] [https://repo1.maven.org/maven2/org/apache/flink/flink-shaded-jackson/2.13.4-16.0/flink-shaded-jackson-2.13.4-16.0.pom] > CVE-2022-25857 on flink-shaded > -- > > Key: FLINK-29235 > URL: https://issues.apache.org/jira/browse/FLINK-29235 > Project: Flink > Issue Type: Bug > Components: Build System, BuildSystem / Shaded >Affects Versions: 1.15.2 >Reporter: Sergio Sainz >Assignee: Chesnay Schepler >Priority: Major > > flink-shaded-version uses snakeyaml v1.29 which is vulnerable to > CVE-2022-25857 > Ref: > https://nvd.nist.gov/vuln/detail/CVE-2022-25857 > https://repo1.maven.org/maven2/org/apache/flink/flink-shaded-jackson/2.12.4-15.0/flink-shaded-jackson-2.12.4-15.0.pom > https://github.com/apache/flink-shaded/blob/master/flink-shaded-jackson-parent/flink-shaded-jackson-2/pom.xml#L73 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-29631) [CVE-2022-42003] flink-shaded-jackson
Sergio Sainz created FLINK-29631: Summary: [CVE-2022-42003] flink-shaded-jackson Key: FLINK-29631 URL: https://issues.apache.org/jira/browse/FLINK-29631 Project: Flink Issue Type: Bug Components: BuildSystem / Shaded Affects Versions: shaded-16.0 Reporter: Sergio Sainz flink-shaded-jackson vulnerable to 7.5 (high) [https://nvd.nist.gov/vuln/detail/CVE-2022-42003] Ref: [https://nvd.nist.gov/vuln/detail/CVE-2022-42003] [https://repo1.maven.org/maven2/org/apache/flink/flink-shaded/16.0/flink-shaded-16.0.pom] [https://mvnrepository.com/artifact/org.apache.flink/flink-shaded-jackson-parent/2.13.4-16.0] [https://repo1.maven.org/maven2/org/apache/flink/flink-shaded-jackson/2.13.4-16.0/flink-shaded-jackson-2.13.4-16.0.pom] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-29235) CVE-2022-25857 on flink-shaded
Sergio Sainz created FLINK-29235: Summary: CVE-2022-25857 on flink-shaded Key: FLINK-29235 URL: https://issues.apache.org/jira/browse/FLINK-29235 Project: Flink Issue Type: Bug Components: BuildSystem / Shaded Affects Versions: 1.15.2 Reporter: Sergio Sainz flink-shaded-version uses snakeyaml v1.29 which is vulnerable to CVE-2022-25857 Ref: https://nvd.nist.gov/vuln/detail/CVE-2022-25857 https://repo1.maven.org/maven2/org/apache/flink/flink-shaded-jackson/2.12.4-15.0/flink-shaded-jackson-2.12.4-15.0.pom https://github.com/apache/flink-shaded/blob/master/flink-shaded-jackson-parent/flink-shaded-jackson-2/pom.xml#L73 -- This message was sent by Atlassian Jira (v8.20.10#820010)