Re: Frequent Flink JM restarts due to Kube API server errors.
This might be related with FLINK-28481, which is a bug in fabric8io k8s client. [1]. https://issues.apache.org/jira/browse/FLINK-28481 Best, Yang On Tue, Feb 6, 2024 at 12:30 PM Lavkesh Lahngir wrote: > Hi, Matthias, I was wondering if there are any timeout or heartbeat > configurations for KubeHA available. > > Thanks. > > On Mon, 5 Feb 2024 at 8:58 PM, Matthias Pohl .invalid> > wrote: > > > That's stated in the Jira issue. I didn't have the time to investigate it > > further. > > > > On Mon, Feb 5, 2024 at 1:55 PM Lavkesh Lahngir > wrote: > > > > > Hi Matthias, > > > Thanks for the suggestion. Do we know which part of code caused this > > issue > > > and how it was fixed? > > > > > > Thanks! > > > > > > On Mon, 5 Feb 2024 at 18:06, Matthias Pohl > > .invalid> > > > wrote: > > > > > > > Hi Lavkesh, > > > > FLINK-33998 [1] sounds quite similar to what you describe. > > > > > > > > The solution was to upgrade to Flink version 1.14.6. I didn't have > the > > > > capacity to look into the details considering that the mentioned > Flink > > > > version 1.14 is not officially supported by the community anymore > and a > > > fix > > > > seems to have been provided with a newer version. > > > > > > > > Matthias > > > > > > > > [1] https://issues.apache.org/jira/browse/FLINK-33998 > > > > > > > > On Mon, Feb 5, 2024 at 6:18 AM Lavkesh Lahngir > > > wrote: > > > > > > > > > Hii, Few more details: > > > > > We are running GKE version 1.27.7-gke.1121002. > > > > > and using flink version 1.14.3. > > > > > > > > > > Thanks! > > > > > > > > > > On Mon, 5 Feb 2024 at 12:05, Lavkesh Lahngir > > > wrote: > > > > > > > > > > > Hii All, > > > > > > > > > > > > We run a Flink operator on GKE, deploying one Flink job per job > > > > manager. > > > > > > We utilize > > > > > > > > > > > > org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory > > > > > > for high availability. The JobManager employs config maps for > > > > > checkpointing > > > > > > and leader election. If, at any point, the Kube API server > returns > > an > > > > > error > > > > > > (5xx or 4xx), the JM pod is restarted. This occurrence is > sporadic, > > > > > > happening every 1-2 days for some jobs among the 400 running in > the > > > > same > > > > > > cluster, each with its JobManager pod. > > > > > > > > > > > > What might be causing these errors from the Kube? One possibility > > is > > > > that > > > > > > when the JM writes the config map and attempts to retrieve it > > > > immediately > > > > > > after, it could result in a 404 error. > > > > > > Are there any configurations to increase heartbeat or timeouts > that > > > > might > > > > > > be causing temporary disconnections from the Kube API server? > > > > > > > > > > > > Thank you! > > > > > > > > > > > > > > > > > > > > >
Re: Frequent Flink JM restarts due to Kube API server errors.
Hi, Matthias, I was wondering if there are any timeout or heartbeat configurations for KubeHA available. Thanks. On Mon, 5 Feb 2024 at 8:58 PM, Matthias Pohl wrote: > That's stated in the Jira issue. I didn't have the time to investigate it > further. > > On Mon, Feb 5, 2024 at 1:55 PM Lavkesh Lahngir wrote: > > > Hi Matthias, > > Thanks for the suggestion. Do we know which part of code caused this > issue > > and how it was fixed? > > > > Thanks! > > > > On Mon, 5 Feb 2024 at 18:06, Matthias Pohl > .invalid> > > wrote: > > > > > Hi Lavkesh, > > > FLINK-33998 [1] sounds quite similar to what you describe. > > > > > > The solution was to upgrade to Flink version 1.14.6. I didn't have the > > > capacity to look into the details considering that the mentioned Flink > > > version 1.14 is not officially supported by the community anymore and a > > fix > > > seems to have been provided with a newer version. > > > > > > Matthias > > > > > > [1] https://issues.apache.org/jira/browse/FLINK-33998 > > > > > > On Mon, Feb 5, 2024 at 6:18 AM Lavkesh Lahngir > > wrote: > > > > > > > Hii, Few more details: > > > > We are running GKE version 1.27.7-gke.1121002. > > > > and using flink version 1.14.3. > > > > > > > > Thanks! > > > > > > > > On Mon, 5 Feb 2024 at 12:05, Lavkesh Lahngir > > wrote: > > > > > > > > > Hii All, > > > > > > > > > > We run a Flink operator on GKE, deploying one Flink job per job > > > manager. > > > > > We utilize > > > > > > > > > org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory > > > > > for high availability. The JobManager employs config maps for > > > > checkpointing > > > > > and leader election. If, at any point, the Kube API server returns > an > > > > error > > > > > (5xx or 4xx), the JM pod is restarted. This occurrence is sporadic, > > > > > happening every 1-2 days for some jobs among the 400 running in the > > > same > > > > > cluster, each with its JobManager pod. > > > > > > > > > > What might be causing these errors from the Kube? One possibility > is > > > that > > > > > when the JM writes the config map and attempts to retrieve it > > > immediately > > > > > after, it could result in a 404 error. > > > > > Are there any configurations to increase heartbeat or timeouts that > > > might > > > > > be causing temporary disconnections from the Kube API server? > > > > > > > > > > Thank you! > > > > > > > > > > > > > > >
Re: Frequent Flink JM restarts due to Kube API server errors.
That's stated in the Jira issue. I didn't have the time to investigate it further. On Mon, Feb 5, 2024 at 1:55 PM Lavkesh Lahngir wrote: > Hi Matthias, > Thanks for the suggestion. Do we know which part of code caused this issue > and how it was fixed? > > Thanks! > > On Mon, 5 Feb 2024 at 18:06, Matthias Pohl .invalid> > wrote: > > > Hi Lavkesh, > > FLINK-33998 [1] sounds quite similar to what you describe. > > > > The solution was to upgrade to Flink version 1.14.6. I didn't have the > > capacity to look into the details considering that the mentioned Flink > > version 1.14 is not officially supported by the community anymore and a > fix > > seems to have been provided with a newer version. > > > > Matthias > > > > [1] https://issues.apache.org/jira/browse/FLINK-33998 > > > > On Mon, Feb 5, 2024 at 6:18 AM Lavkesh Lahngir > wrote: > > > > > Hii, Few more details: > > > We are running GKE version 1.27.7-gke.1121002. > > > and using flink version 1.14.3. > > > > > > Thanks! > > > > > > On Mon, 5 Feb 2024 at 12:05, Lavkesh Lahngir > wrote: > > > > > > > Hii All, > > > > > > > > We run a Flink operator on GKE, deploying one Flink job per job > > manager. > > > > We utilize > > > > > > org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory > > > > for high availability. The JobManager employs config maps for > > > checkpointing > > > > and leader election. If, at any point, the Kube API server returns an > > > error > > > > (5xx or 4xx), the JM pod is restarted. This occurrence is sporadic, > > > > happening every 1-2 days for some jobs among the 400 running in the > > same > > > > cluster, each with its JobManager pod. > > > > > > > > What might be causing these errors from the Kube? One possibility is > > that > > > > when the JM writes the config map and attempts to retrieve it > > immediately > > > > after, it could result in a 404 error. > > > > Are there any configurations to increase heartbeat or timeouts that > > might > > > > be causing temporary disconnections from the Kube API server? > > > > > > > > Thank you! > > > > > > > > > >
Re: Frequent Flink JM restarts due to Kube API server errors.
Hi Matthias, Thanks for the suggestion. Do we know which part of code caused this issue and how it was fixed? Thanks! On Mon, 5 Feb 2024 at 18:06, Matthias Pohl wrote: > Hi Lavkesh, > FLINK-33998 [1] sounds quite similar to what you describe. > > The solution was to upgrade to Flink version 1.14.6. I didn't have the > capacity to look into the details considering that the mentioned Flink > version 1.14 is not officially supported by the community anymore and a fix > seems to have been provided with a newer version. > > Matthias > > [1] https://issues.apache.org/jira/browse/FLINK-33998 > > On Mon, Feb 5, 2024 at 6:18 AM Lavkesh Lahngir wrote: > > > Hii, Few more details: > > We are running GKE version 1.27.7-gke.1121002. > > and using flink version 1.14.3. > > > > Thanks! > > > > On Mon, 5 Feb 2024 at 12:05, Lavkesh Lahngir wrote: > > > > > Hii All, > > > > > > We run a Flink operator on GKE, deploying one Flink job per job > manager. > > > We utilize > > > > org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory > > > for high availability. The JobManager employs config maps for > > checkpointing > > > and leader election. If, at any point, the Kube API server returns an > > error > > > (5xx or 4xx), the JM pod is restarted. This occurrence is sporadic, > > > happening every 1-2 days for some jobs among the 400 running in the > same > > > cluster, each with its JobManager pod. > > > > > > What might be causing these errors from the Kube? One possibility is > that > > > when the JM writes the config map and attempts to retrieve it > immediately > > > after, it could result in a 404 error. > > > Are there any configurations to increase heartbeat or timeouts that > might > > > be causing temporary disconnections from the Kube API server? > > > > > > Thank you! > > > > > >
Re: Frequent Flink JM restarts due to Kube API server errors.
Hi Lavkesh, FLINK-33998 [1] sounds quite similar to what you describe. The solution was to upgrade to Flink version 1.14.6. I didn't have the capacity to look into the details considering that the mentioned Flink version 1.14 is not officially supported by the community anymore and a fix seems to have been provided with a newer version. Matthias [1] https://issues.apache.org/jira/browse/FLINK-33998 On Mon, Feb 5, 2024 at 6:18 AM Lavkesh Lahngir wrote: > Hii, Few more details: > We are running GKE version 1.27.7-gke.1121002. > and using flink version 1.14.3. > > Thanks! > > On Mon, 5 Feb 2024 at 12:05, Lavkesh Lahngir wrote: > > > Hii All, > > > > We run a Flink operator on GKE, deploying one Flink job per job manager. > > We utilize > > org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory > > for high availability. The JobManager employs config maps for > checkpointing > > and leader election. If, at any point, the Kube API server returns an > error > > (5xx or 4xx), the JM pod is restarted. This occurrence is sporadic, > > happening every 1-2 days for some jobs among the 400 running in the same > > cluster, each with its JobManager pod. > > > > What might be causing these errors from the Kube? One possibility is that > > when the JM writes the config map and attempts to retrieve it immediately > > after, it could result in a 404 error. > > Are there any configurations to increase heartbeat or timeouts that might > > be causing temporary disconnections from the Kube API server? > > > > Thank you! > > >
Re: Frequent Flink JM restarts due to Kube API server errors.
Hii, Few more details: We are running GKE version 1.27.7-gke.1121002. and using flink version 1.14.3. Thanks! On Mon, 5 Feb 2024 at 12:05, Lavkesh Lahngir wrote: > Hii All, > > We run a Flink operator on GKE, deploying one Flink job per job manager. > We utilize > org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory > for high availability. The JobManager employs config maps for checkpointing > and leader election. If, at any point, the Kube API server returns an error > (5xx or 4xx), the JM pod is restarted. This occurrence is sporadic, > happening every 1-2 days for some jobs among the 400 running in the same > cluster, each with its JobManager pod. > > What might be causing these errors from the Kube? One possibility is that > when the JM writes the config map and attempts to retrieve it immediately > after, it could result in a 404 error. > Are there any configurations to increase heartbeat or timeouts that might > be causing temporary disconnections from the Kube API server? > > Thank you! >
Frequent Flink JM restarts due to Kube API server errors.
Hii All, We run a Flink operator on GKE, deploying one Flink job per job manager. We utilize org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory for high availability. The JobManager employs config maps for checkpointing and leader election. If, at any point, the Kube API server returns an error (5xx or 4xx), the JM pod is restarted. This occurrence is sporadic, happening every 1-2 days for some jobs among the 400 running in the same cluster, each with its JobManager pod. What might be causing these errors from the Kube? One possibility is that when the JM writes the config map and attempts to retrieve it immediately after, it could result in a 404 error. Are there any configurations to increase heartbeat or timeouts that might be causing temporary disconnections from the Kube API server? Thank you!