Re: Frequent Flink JM restarts due to Kube API server errors.

2024-02-05 Thread Yang Wang
This might be related with FLINK-28481, which is a bug in fabric8io k8s
client.

[1]. https://issues.apache.org/jira/browse/FLINK-28481

Best,
Yang

On Tue, Feb 6, 2024 at 12:30 PM Lavkesh Lahngir  wrote:

> Hi, Matthias, I was wondering if there are any timeout or heartbeat
> configurations for KubeHA available.
>
> Thanks.
>
> On Mon, 5 Feb 2024 at 8:58 PM, Matthias Pohl  .invalid>
> wrote:
>
> > That's stated in the Jira issue. I didn't have the time to investigate it
> > further.
> >
> > On Mon, Feb 5, 2024 at 1:55 PM Lavkesh Lahngir 
> wrote:
> >
> > > Hi Matthias,
> > > Thanks for the suggestion. Do we know which part of code caused this
> > issue
> > > and how it was fixed?
> > >
> > > Thanks!
> > >
> > > On Mon, 5 Feb 2024 at 18:06, Matthias Pohl  > > .invalid>
> > > wrote:
> > >
> > > > Hi Lavkesh,
> > > > FLINK-33998 [1] sounds quite similar to what you describe.
> > > >
> > > > The solution was to upgrade to Flink version 1.14.6. I didn't have
> the
> > > > capacity to look into the details considering that the mentioned
> Flink
> > > > version 1.14 is not officially supported by the community anymore
> and a
> > > fix
> > > > seems to have been provided with a newer version.
> > > >
> > > > Matthias
> > > >
> > > > [1] https://issues.apache.org/jira/browse/FLINK-33998
> > > >
> > > > On Mon, Feb 5, 2024 at 6:18 AM Lavkesh Lahngir 
> > > wrote:
> > > >
> > > > > Hii, Few more details:
> > > > > We are running GKE version 1.27.7-gke.1121002.
> > > > > and using flink version 1.14.3.
> > > > >
> > > > > Thanks!
> > > > >
> > > > > On Mon, 5 Feb 2024 at 12:05, Lavkesh Lahngir 
> > > wrote:
> > > > >
> > > > > > Hii All,
> > > > > >
> > > > > > We run a Flink operator on GKE, deploying one Flink job per job
> > > > manager.
> > > > > > We utilize
> > > > > >
> > > >
> > org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory
> > > > > > for high availability. The JobManager employs config maps for
> > > > > checkpointing
> > > > > > and leader election. If, at any point, the Kube API server
> returns
> > an
> > > > > error
> > > > > > (5xx or 4xx), the JM pod is restarted. This occurrence is
> sporadic,
> > > > > > happening every 1-2 days for some jobs among the 400 running in
> the
> > > > same
> > > > > > cluster, each with its JobManager pod.
> > > > > >
> > > > > > What might be causing these errors from the Kube? One possibility
> > is
> > > > that
> > > > > > when the JM writes the config map and attempts to retrieve it
> > > > immediately
> > > > > > after, it could result in a 404 error.
> > > > > > Are there any configurations to increase heartbeat or timeouts
> that
> > > > might
> > > > > > be causing temporary disconnections from the Kube API server?
> > > > > >
> > > > > > Thank you!
> > > > > >
> > > > >
> > > >
> > >
> >
>


Re: Frequent Flink JM restarts due to Kube API server errors.

2024-02-05 Thread Lavkesh Lahngir
Hi, Matthias, I was wondering if there are any timeout or heartbeat
configurations for KubeHA available.

Thanks.

On Mon, 5 Feb 2024 at 8:58 PM, Matthias Pohl 
wrote:

> That's stated in the Jira issue. I didn't have the time to investigate it
> further.
>
> On Mon, Feb 5, 2024 at 1:55 PM Lavkesh Lahngir  wrote:
>
> > Hi Matthias,
> > Thanks for the suggestion. Do we know which part of code caused this
> issue
> > and how it was fixed?
> >
> > Thanks!
> >
> > On Mon, 5 Feb 2024 at 18:06, Matthias Pohl  > .invalid>
> > wrote:
> >
> > > Hi Lavkesh,
> > > FLINK-33998 [1] sounds quite similar to what you describe.
> > >
> > > The solution was to upgrade to Flink version 1.14.6. I didn't have the
> > > capacity to look into the details considering that the mentioned Flink
> > > version 1.14 is not officially supported by the community anymore and a
> > fix
> > > seems to have been provided with a newer version.
> > >
> > > Matthias
> > >
> > > [1] https://issues.apache.org/jira/browse/FLINK-33998
> > >
> > > On Mon, Feb 5, 2024 at 6:18 AM Lavkesh Lahngir 
> > wrote:
> > >
> > > > Hii, Few more details:
> > > > We are running GKE version 1.27.7-gke.1121002.
> > > > and using flink version 1.14.3.
> > > >
> > > > Thanks!
> > > >
> > > > On Mon, 5 Feb 2024 at 12:05, Lavkesh Lahngir 
> > wrote:
> > > >
> > > > > Hii All,
> > > > >
> > > > > We run a Flink operator on GKE, deploying one Flink job per job
> > > manager.
> > > > > We utilize
> > > > >
> > >
> org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory
> > > > > for high availability. The JobManager employs config maps for
> > > > checkpointing
> > > > > and leader election. If, at any point, the Kube API server returns
> an
> > > > error
> > > > > (5xx or 4xx), the JM pod is restarted. This occurrence is sporadic,
> > > > > happening every 1-2 days for some jobs among the 400 running in the
> > > same
> > > > > cluster, each with its JobManager pod.
> > > > >
> > > > > What might be causing these errors from the Kube? One possibility
> is
> > > that
> > > > > when the JM writes the config map and attempts to retrieve it
> > > immediately
> > > > > after, it could result in a 404 error.
> > > > > Are there any configurations to increase heartbeat or timeouts that
> > > might
> > > > > be causing temporary disconnections from the Kube API server?
> > > > >
> > > > > Thank you!
> > > > >
> > > >
> > >
> >
>


Re: Frequent Flink JM restarts due to Kube API server errors.

2024-02-05 Thread Matthias Pohl
That's stated in the Jira issue. I didn't have the time to investigate it
further.

On Mon, Feb 5, 2024 at 1:55 PM Lavkesh Lahngir  wrote:

> Hi Matthias,
> Thanks for the suggestion. Do we know which part of code caused this issue
> and how it was fixed?
>
> Thanks!
>
> On Mon, 5 Feb 2024 at 18:06, Matthias Pohl  .invalid>
> wrote:
>
> > Hi Lavkesh,
> > FLINK-33998 [1] sounds quite similar to what you describe.
> >
> > The solution was to upgrade to Flink version 1.14.6. I didn't have the
> > capacity to look into the details considering that the mentioned Flink
> > version 1.14 is not officially supported by the community anymore and a
> fix
> > seems to have been provided with a newer version.
> >
> > Matthias
> >
> > [1] https://issues.apache.org/jira/browse/FLINK-33998
> >
> > On Mon, Feb 5, 2024 at 6:18 AM Lavkesh Lahngir 
> wrote:
> >
> > > Hii, Few more details:
> > > We are running GKE version 1.27.7-gke.1121002.
> > > and using flink version 1.14.3.
> > >
> > > Thanks!
> > >
> > > On Mon, 5 Feb 2024 at 12:05, Lavkesh Lahngir 
> wrote:
> > >
> > > > Hii All,
> > > >
> > > > We run a Flink operator on GKE, deploying one Flink job per job
> > manager.
> > > > We utilize
> > > >
> > org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory
> > > > for high availability. The JobManager employs config maps for
> > > checkpointing
> > > > and leader election. If, at any point, the Kube API server returns an
> > > error
> > > > (5xx or 4xx), the JM pod is restarted. This occurrence is sporadic,
> > > > happening every 1-2 days for some jobs among the 400 running in the
> > same
> > > > cluster, each with its JobManager pod.
> > > >
> > > > What might be causing these errors from the Kube? One possibility is
> > that
> > > > when the JM writes the config map and attempts to retrieve it
> > immediately
> > > > after, it could result in a 404 error.
> > > > Are there any configurations to increase heartbeat or timeouts that
> > might
> > > > be causing temporary disconnections from the Kube API server?
> > > >
> > > > Thank you!
> > > >
> > >
> >
>


Re: Frequent Flink JM restarts due to Kube API server errors.

2024-02-05 Thread Lavkesh Lahngir
Hi Matthias,
Thanks for the suggestion. Do we know which part of code caused this issue
and how it was fixed?

Thanks!

On Mon, 5 Feb 2024 at 18:06, Matthias Pohl 
wrote:

> Hi Lavkesh,
> FLINK-33998 [1] sounds quite similar to what you describe.
>
> The solution was to upgrade to Flink version 1.14.6. I didn't have the
> capacity to look into the details considering that the mentioned Flink
> version 1.14 is not officially supported by the community anymore and a fix
> seems to have been provided with a newer version.
>
> Matthias
>
> [1] https://issues.apache.org/jira/browse/FLINK-33998
>
> On Mon, Feb 5, 2024 at 6:18 AM Lavkesh Lahngir  wrote:
>
> > Hii, Few more details:
> > We are running GKE version 1.27.7-gke.1121002.
> > and using flink version 1.14.3.
> >
> > Thanks!
> >
> > On Mon, 5 Feb 2024 at 12:05, Lavkesh Lahngir  wrote:
> >
> > > Hii All,
> > >
> > > We run a Flink operator on GKE, deploying one Flink job per job
> manager.
> > > We utilize
> > >
> org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory
> > > for high availability. The JobManager employs config maps for
> > checkpointing
> > > and leader election. If, at any point, the Kube API server returns an
> > error
> > > (5xx or 4xx), the JM pod is restarted. This occurrence is sporadic,
> > > happening every 1-2 days for some jobs among the 400 running in the
> same
> > > cluster, each with its JobManager pod.
> > >
> > > What might be causing these errors from the Kube? One possibility is
> that
> > > when the JM writes the config map and attempts to retrieve it
> immediately
> > > after, it could result in a 404 error.
> > > Are there any configurations to increase heartbeat or timeouts that
> might
> > > be causing temporary disconnections from the Kube API server?
> > >
> > > Thank you!
> > >
> >
>


Re: Frequent Flink JM restarts due to Kube API server errors.

2024-02-05 Thread Matthias Pohl
Hi Lavkesh,
FLINK-33998 [1] sounds quite similar to what you describe.

The solution was to upgrade to Flink version 1.14.6. I didn't have the
capacity to look into the details considering that the mentioned Flink
version 1.14 is not officially supported by the community anymore and a fix
seems to have been provided with a newer version.

Matthias

[1] https://issues.apache.org/jira/browse/FLINK-33998

On Mon, Feb 5, 2024 at 6:18 AM Lavkesh Lahngir  wrote:

> Hii, Few more details:
> We are running GKE version 1.27.7-gke.1121002.
> and using flink version 1.14.3.
>
> Thanks!
>
> On Mon, 5 Feb 2024 at 12:05, Lavkesh Lahngir  wrote:
>
> > Hii All,
> >
> > We run a Flink operator on GKE, deploying one Flink job per job manager.
> > We utilize
> > org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory
> > for high availability. The JobManager employs config maps for
> checkpointing
> > and leader election. If, at any point, the Kube API server returns an
> error
> > (5xx or 4xx), the JM pod is restarted. This occurrence is sporadic,
> > happening every 1-2 days for some jobs among the 400 running in the same
> > cluster, each with its JobManager pod.
> >
> > What might be causing these errors from the Kube? One possibility is that
> > when the JM writes the config map and attempts to retrieve it immediately
> > after, it could result in a 404 error.
> > Are there any configurations to increase heartbeat or timeouts that might
> > be causing temporary disconnections from the Kube API server?
> >
> > Thank you!
> >
>


Re: Frequent Flink JM restarts due to Kube API server errors.

2024-02-04 Thread Lavkesh Lahngir
Hii, Few more details:
We are running GKE version 1.27.7-gke.1121002.
and using flink version 1.14.3.

Thanks!

On Mon, 5 Feb 2024 at 12:05, Lavkesh Lahngir  wrote:

> Hii All,
>
> We run a Flink operator on GKE, deploying one Flink job per job manager.
> We utilize
> org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory
> for high availability. The JobManager employs config maps for checkpointing
> and leader election. If, at any point, the Kube API server returns an error
> (5xx or 4xx), the JM pod is restarted. This occurrence is sporadic,
> happening every 1-2 days for some jobs among the 400 running in the same
> cluster, each with its JobManager pod.
>
> What might be causing these errors from the Kube? One possibility is that
> when the JM writes the config map and attempts to retrieve it immediately
> after, it could result in a 404 error.
> Are there any configurations to increase heartbeat or timeouts that might
> be causing temporary disconnections from the Kube API server?
>
> Thank you!
>


Frequent Flink JM restarts due to Kube API server errors.

2024-02-04 Thread Lavkesh Lahngir
Hii All,

We run a Flink operator on GKE, deploying one Flink job per job manager. We
utilize
org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory
for high availability. The JobManager employs config maps for checkpointing
and leader election. If, at any point, the Kube API server returns an error
(5xx or 4xx), the JM pod is restarted. This occurrence is sporadic,
happening every 1-2 days for some jobs among the 400 running in the same
cluster, each with its JobManager pod.

What might be causing these errors from the Kube? One possibility is that
when the JM writes the config map and attempts to retrieve it immediately
after, it could result in a 404 error.
Are there any configurations to increase heartbeat or timeouts that might
be causing temporary disconnections from the Kube API server?

Thank you!