Re: Native Kubernetes Task Managers

Alexis Sarda-Espinosa Sun, 09 Jun 2024 05:10:38 -0700

Hi,

sorry, I got sidetracked this week. Yes, I was thinking about the
possibility of changing the logic so that the names are stable, the way a
stateful set would do it but without needing an actual stateful set. When I
look at metrics, in particular those that are not exposed by Flink but
rather by other systems, it's easier to track behavior with stable names,
for example memory utilization even after a pod restart. It might even be
easier to detect such restarts since some Task Manager metrics would be
missing for a short period, and that could help narrow down the timestamps
for related logs, so that potential OOM kills can be corroborated through
the resource manager's logs. I think this could be more helpful than
exposing the attempt ID as part of the pod name, what do you think?


Regards,
Alexis.

Am Mo., 3. Juni 2024 um 08:35 Uhr schrieb spoon_lz <[email protected]>:

> Hi Alexis, Are you describing the logic of adjusting the following part：
> | final String podName = String.format(TASK_MANAGER_POD_FORMAT, clusterId,
> currentMaxAttemptId, ++currentMaxPodId);
> |
> `currentMaxAttemptId` relates to the number of rebuilds for
> ResourceManager, and `currentMaxPodId` describes the pod index.
>
>
>
> Regards,
> Zhuo.
> ---- Replied Message ----
> | From | Alexis Sarda-Espinosa<[email protected]> |
> | Date | 06/3/2024 14:19 |
> | To | <[email protected]> |
> | Subject | Re: Native Kubernetes Task Managers |
> Ah no, I meant that I wouldn't use a stateful set, rather just adjust the
> names of the pods that are created/managed directly by the job manager.
>
> Regards,
> Alexis.
>
> Am Mo., 3. Juni 2024 um 07:31 Uhr schrieb Xintong Song <
> [email protected]>:
>
> I may not have understood what you mean by the naming scheme. I think the
> limitation "pods in a StatefulSet are always terminated in the reverse
> order as they are created" comes from Kubernetes and has nothing to do with
> the naming scheme.
>
> Best,
>
> Xintong
>
>
>
> On Mon, Jun 3, 2024 at 1:13 PM Alexis Sarda-Espinosa <
> [email protected]> wrote:
>
> Hi Xintong,
>
> After experimenting a bit, I came to roughly the same conclusion: cleanup
> is what's more or less incompatible if Kubernetes manages the pods. Then
> it
> might be better to just allow using a more stable pod naming scheme that
> doesn't depend on the attempt number and thus produces more stable task
> manager metrics. I'll explore that.
>
> Regards,
> Alexis.
>
> On Mon, 3 Jun 2024, 03:35 Xintong Song, <[email protected]> wrote:
>
> I think the reason we didn't choose StatefulSet when introducing the
> Native
> K8s Deployment is that, IIRC, we want Flink's ResourceManager to have
> full
> control of the individual pod lifecycles.
>
> E.g.,
> - Pods in a StatefulSet are always terminated in the reverse order as
> they
> are created. This prevents us from releasing a specific idle TM that is
> not
> necessarily created lastly.
> - If a pod is unexpectedly terminated, Flink's ResourceManager should
> decide whether to restart it or not according to the job status.
> (Technically, the same issue as above, that we may want pods to be
> terminated / deleted in a different order.)
>
> There might be some other reasons. I just cannot recall all the
> details.
>
> As for determining whether a pod is OOM killed, I think Flink does
> print
> diagnostics for terminated pods in JM logs, i.e. the `exitCode`,
> `reason`
> and `message` of the `Terminated` container state. In our production,
> it
> shows "(exitCode=137, reason=OOMKilled, message=null)". However, since
> the
> diagnostics are from K8s, I'm not 100% sure whether this behavior is
> same
> for all K8s versions,.
>
> Best,
>
> Xintong
>
>
>
> On Sun, Jun 2, 2024 at 7:35 PM Alexis Sarda-Espinosa <
> [email protected]> wrote:
>
> Hi devs,
>
> Some time ago I asked about the way Task Manager pods are handled by
> the
> native Kubernetes driver [1]. I have now looked a bit through the
> source
> code and I think it could be possible to deploy TMs with a stateful
> set,
> which could allow tracking OOM kills as I mentioned in my original
> email,
> and could also make it easier to track metrics and create alerts,
> since
> the
> labels wouldn't change as much.
>
> One challenge is probably the new elastic scaling features [2], since
> the
> driver would have to differentiate between new pod requests due to a
> TM
> terminating, and a request due to scaling. I'm also not sure where
> downscaling requests are currently handled.
>
> I would be interested in taking a look at this and seeing if I can
> get
> something working. I think it would be possible to make it
> configurable
> in
> a way that maintains backwards compatibility. Would it be ok if I
> enter a
> Jira ticket and try it out?
>
> Regards,
> Alexis.
>
> [1] https://lists.apache.org/thread/jysgdldv8swgf4fhqwqochgf6hq0qs52
> [2]
>
>
>
>
>
> https://nightlies.apache.org/flink/flink-docs-release-1.19/docs/deployment/elastic_scaling/
>
>
>
>
>

Re: Native Kubernetes Task Managers

Reply via email to