[jira] [Commented] (FLINK-17976) Test native K8s integration

Robert Metzger (Jira) Tue, 09 Jun 2020 06:03:18 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-17976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17129249#comment-17129249
 ]


Robert Metzger commented on FLINK-17976:
----------------------------------------

Thanks a lot regarding your comments. 


> If you mean the jobmanager web dashboard are public accessible when service 
> exposed type is LoadBalancer, then it is true. Just as you say, maybe we need 
> to add a warning for users. However, providing a separate LB for each Flink 
> cluster is too expensive, users usually use customized ingress.

I will add a warning to the docs

> Currently, if the TaskManager is launched successfully, it could be released 
> after idle timeout. However, it seems that your cluster does not have enough 
> resource, then all the pods are pending. It is an expected behavior just like 
> YARN.
> If you kill/delete an active pod, it will be terminated and a new one will be 
> allocated. So the pending pods increase. Once it is launched and register to 
> Flink ResourceManager, the pending pods will decrease.

The scenario was the following:
Running TaskManagers: 4
Pending/Requested TaskMangers: 30

.. then I killed one of the running TaskManagers ...

Running TaskManagers: 4
Pending/Requested TaskMangers: 56

Expected behavior: The number of "Pending/Requested TaskMangers" at least does 
not increase. Ideally the {{KubernetesResourceManager}} cancels "Pending TMs" 
after a timeout of say 10 minutes.
Actual behavior: The number of Pending TaskManagers goes up even though there 
are plenty of unfulfilled requests pending.

Why is this bad? My Kubernetes cluster was basically polluted / spammed with 
pending pods.


> Test native K8s integration
> ---------------------------
>
>                 Key: FLINK-17976
>                 URL: https://issues.apache.org/jira/browse/FLINK-17976
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Runtime / Coordination
>    Affects Versions: 1.11.0
>            Reporter: Till Rohrmann
>            Assignee: Robert Metzger
>            Priority: Critical
>              Labels: release-testing
>             Fix For: 1.11.0
>
>         Attachments: enough_tm_wait_5min.txt
>
>
> Test Flink's native K8s integration:
> * session mode
> * application mode
> * custom Flink image
> * custom configuration and log properties



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-17976) Test native K8s integration

Reply via email to