[jira] [Created] (FLINK-17709) Active Kubernetes integration phase 3 - Advanced Features

2020-05-14 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-17709:


 Summary: Active Kubernetes integration phase 3 - Advanced Features
 Key: FLINK-17709
 URL: https://issues.apache.org/jira/browse/FLINK-17709
 Project: Flink
  Issue Type: New Feature
  Components: Deployment / Kubernetes, Runtime / Coordination
Reporter: Canbin Zheng
 Fix For: 1.12.0


This is the umbrella issue to track all the advanced features for phase 3 of 
active Kubernetes integration in Flink 1.12.0. Some of the features are:

# Support multiple JobManagers in ZooKeeper based HA setups.
# Support user-specified pod templates.
# Support FileSystem based high availability.
# Support running PyFlink.
# Support accessing secured services via K8s secrets.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-17707) Support configuring replica of Deployment in ZooKeeper based HA setups

2020-05-14 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-17707:


 Summary: Support configuring replica of Deployment in ZooKeeper 
based HA setups
 Key: FLINK-17707
 URL: https://issues.apache.org/jira/browse/FLINK-17707
 Project: Flink
  Issue Type: New Feature
  Components: Deployment / Kubernetes, Runtime / Coordination
Reporter: Canbin Zheng
 Fix For: 1.12.0


At the moment, in the native K8s setups, we hard code the replica of Deployment 
to 1. However, when users enable the ZooKeeper HighAvailabilityServices, they 
would like to configure the replica of Deployment also for faster failover.

This ticket proposes to make *replica* of Deployment configurable in the 
ZooKeeper based HA setups.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-17598) Implement FileSystemHAServices for native K8s setups

2020-05-09 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-17598:


 Summary: Implement FileSystemHAServices for native K8s setups
 Key: FLINK-17598
 URL: https://issues.apache.org/jira/browse/FLINK-17598
 Project: Flink
  Issue Type: New Feature
  Components: Deployment / Kubernetes, Runtime / Coordination
Reporter: Canbin Zheng


At the moment we use Zookeeper as a distributed coordinator for implementing 
JobManager high availability services. But in the cloud-native environment, 
there is a trend that more and more users prefer to use *Kubernetes* as the 
underlying scheduler backend while *Storage Object* as the Storage medium, both 
of these two services don't require Zookeeper deployment.

As a result, in the K8s setups, people have to deploy and maintain additional 
Zookeeper clusters for solving JobManager SPOF. This ticket proposes to provide 
a simplified FileSystem HA implementation with the leader-election removed, it 
saves the efforts of Zookeeper deployment and maintenance.

To achieve this, we plan to 
# Introduce the {{FileSystemHaServices}} which implements the 
{{HighAvailabilityServices}}.
# Replace Deployment with StatefulSet to ensure *at most one* semantics to 
avoid potential concurrent access to the underlying FileSystem.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-17566) Fix potential K8s resources leak after JobManager finishes in Applicaion mode

2020-05-08 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-17566:


 Summary: Fix potential K8s resources leak after JobManager 
finishes in Applicaion mode
 Key: FLINK-17566
 URL: https://issues.apache.org/jira/browse/FLINK-17566
 Project: Flink
  Issue Type: Bug
  Components: Deployment / Kubernetes
Reporter: Canbin Zheng


FLINK-10934 introduces applicaion mode support in the native K8s setups., but 
as the discussion in 
[https://github.com/apache/flink/pull/12003|https://github.com/apache/flink/pull/12003,],
 there's large probability that all the K8s resources leak after the JobManager 
finishes except that the replica of Deployment is scaled down to 0. We need to 
find out the root cause and fix it.

This may be related to the way fabric8 SDK deletes a Deployment. It splits the 
procedure into three steps as follows:
 # Scales down the replica to 0
 # Wait until the scaling down succeed
 # Delete the ReplicaSet

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-17565) Bump fabric8 version from 4.5.2 to 4.10.1

2020-05-07 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-17565:


 Summary: Bump fabric8 version from 4.5.2 to 4.10.1
 Key: FLINK-17565
 URL: https://issues.apache.org/jira/browse/FLINK-17565
 Project: Flink
  Issue Type: Improvement
  Components: Deployment / Kubernetes
Reporter: Canbin Zheng


Currently, we are using a version of 4.5.2, it's better that we upgrade it to 
4.10.1 for features like K8s 1.17 support, LeaderElection support, and etc.

For more details, please refer to [fabric8 
releases|[https://github.com/fabric8io/kubernetes-client/releases].]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-17549) Support running Stateful Functions on native Kubernetes setup

2020-05-06 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-17549:


 Summary: Support running Stateful Functions on native Kubernetes 
setup
 Key: FLINK-17549
 URL: https://issues.apache.org/jira/browse/FLINK-17549
 Project: Flink
  Issue Type: New Feature
  Components: Build System / Stateful Functions, Deployment / Kubernetes
Reporter: Canbin Zheng


This is the umbrella issue for running Stateful Functions on Kubernetes in 
native mode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-17480) Support PyFlink on native Kubernetes setup

2020-04-30 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-17480:


 Summary: Support PyFlink on native Kubernetes setup
 Key: FLINK-17480
 URL: https://issues.apache.org/jira/browse/FLINK-17480
 Project: Flink
  Issue Type: New Feature
  Components: Deployment / Kubernetes
Reporter: Canbin Zheng


This is the umbrella issue for all PyFlink related tasks with relation to Flink 
on Kubernetes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-17332) Fix restart policy not equals to Never for native task manager pods

2020-04-22 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-17332:


 Summary: Fix restart policy not equals to Never for native task 
manager pods
 Key: FLINK-17332
 URL: https://issues.apache.org/jira/browse/FLINK-17332
 Project: Flink
  Issue Type: Bug
  Components: Deployment / Kubernetes
Affects Versions: 1.10.0, 1.10.1
Reporter: Canbin Zheng
 Fix For: 1.11.0


Currently, we do not explicitly set the {{RestartPolicy}} for the TaskManager 
Pod in native K8s setups so that it is {{Always}} by default.  The task manager 
pod itself should not restart the failed Container, the decision should always 
made by the job manager.

Therefore, this ticket proposes to set the {{RestartPolicy}} to {{Never}} for 
the task manager pods.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-17232) Rethink the implicit behavior to use the Service externalIP as the address of the Endpoint

2020-04-18 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-17232:


 Summary: Rethink the implicit behavior to use the Service 
externalIP as the address of the Endpoint
 Key: FLINK-17232
 URL: https://issues.apache.org/jira/browse/FLINK-17232
 Project: Flink
  Issue Type: Sub-task
  Components: Deployment / Kubernetes
Affects Versions: 1.10.0, 1.10.1
Reporter: Canbin Zheng
 Fix For: 1.11.0


Currently, for the LB/NodePort type Service, if we found that the 
{{LoadBalancer}} in the {{Service}} is null, we would use the externalIPs 
configured in the external Service as the address of the Endpoint. Again, this 
is another implicit toleration and may confuse the users.

This ticket proposes to rethink the implicit toleration behaviour.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-17231) Deprecate possible implicit fallback from LB to NodePort When retrieving Endpoint of LB Service

2020-04-18 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-17231:


 Summary: Deprecate possible implicit fallback from LB to NodePort 
When retrieving Endpoint of LB Service
 Key: FLINK-17231
 URL: https://issues.apache.org/jira/browse/FLINK-17231
 Project: Flink
  Issue Type: Sub-task
  Components: Deployment / Kubernetes
Affects Versions: 1.10.0, 1.10.1
Reporter: Canbin Zheng
 Fix For: 1.11.0


Currently, if people use the LB Service, when it comes to retrieving the 
Endpoint for the external Service, we make an implicit fallback immediately to 
return the NodePort address and port if we fail to get the LB address in a 
single try. This kind of toleration can confuse the users since the NodePort 
address/port maybe unaccessible due to some reasons like network security 
policy, and the users may not know what really happen behind.

This ticket proposes to always return the LB address/port otherwise throw an 
Exception indicating that the LB is unready or abnormal.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-17230) Fix incorrect returned address of Endpoint for the ClusterIP Service

2020-04-18 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-17230:


 Summary: Fix incorrect returned address of Endpoint for the 
ClusterIP Service
 Key: FLINK-17230
 URL: https://issues.apache.org/jira/browse/FLINK-17230
 Project: Flink
  Issue Type: Sub-task
  Components: Deployment / Kubernetes
Affects Versions: 1.10.0, 1.10.1
Reporter: Canbin Zheng
 Fix For: 1.11.0


At the moment, when the type of the external Service is set to {{ClusterIP}}, 
we return an incorrect address 
{{KubernetesUtils.getInternalServiceName(clusterId) + "." + nameSpace}} for the 
Endpoint.

This ticket aims to fix this bug by returning 
{{KubernetesUtils.getRestServiceName(clusterId) + "." + nameSpace}} instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-17196) Rework the implementation of Fabric8FlinkKubeClient#getRestEndpoint

2020-04-16 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-17196:


 Summary: Rework the implementation of 
Fabric8FlinkKubeClient#getRestEndpoint
 Key: FLINK-17196
 URL: https://issues.apache.org/jira/browse/FLINK-17196
 Project: Flink
  Issue Type: Improvement
  Components: Deployment / Kubernetes
Affects Versions: 1.10.0, 1.10.1
Reporter: Canbin Zheng
 Fix For: 1.11.0


This is the umbrella issue for the rework of the implementation of 
{{Fabric8FlinkKubeClient#getRestEndpoint}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-17177) Handle Pod ERROR event correctly in KubernetesResourceManager#onError

2020-04-16 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-17177:


 Summary: Handle Pod ERROR event correctly in 
KubernetesResourceManager#onError
 Key: FLINK-17177
 URL: https://issues.apache.org/jira/browse/FLINK-17177
 Project: Flink
  Issue Type: Bug
  Components: Deployment / Kubernetes
Affects Versions: 1.10.0, 1.10.1
Reporter: Canbin Zheng
 Fix For: 1.11.0


Currently, once we receive an *ERROR* event that is sent from the K8s API 
server via the K8s {{Watcher}}, then {{KubernetesResourceManager#onError}} will 
handle it by calling the {{KubernetesResourceManager#removePodIfTerminated}}. 
This is incorrect since the *ERROR* event ** indicates an exception in the HTTP 
layer that is caused by the K8s Server, which means the previously created 
{{Watcher}} is no longer available and we should re-create a new {{Watcher}} 
immediately.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-17176) Slow down Pod recreation in KubernetesResourceManager#PodCallbackHandler

2020-04-16 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-17176:


 Summary: Slow down Pod recreation in 
KubernetesResourceManager#PodCallbackHandler
 Key: FLINK-17176
 URL: https://issues.apache.org/jira/browse/FLINK-17176
 Project: Flink
  Issue Type: Improvement
  Components: Deployment / Kubernetes
Affects Versions: 1.10.0
Reporter: Canbin Zheng
 Fix For: 1.11.0


In the native K8s setups, there are some cases that we do not control the speed 
of pod re-creation which poses potential risks to flood the K8s API Server in 
the {{PodCallbackHandler}} implementation of {{KubernetesResourceManager.}}

Here are steps to reproduce this kind of problems:
 # Mount the {{/opt/flink/log}} in the Container of TaskManager to a path on 
the K8s nodes via HostPath, make sure that the path exists but the TaskManager 
process has no write permission. We can achieve this via the user-specified pod 
template support or just hardcode it for testing only.
 # Launch a session cluster
 # Submit a new job to the session cluster, as expected, we can observe that 
the Pod constantly fails quickly during launching the main Container, then the 
{{KubernetesResourceManager#onModified}} is invoked to re-create a new Pod 
immediately, without any speed control.

To sum up, once the {{KubernetesResourceManager}} receives the Pod *ADD* event 
and that Pod is terminated before successfully registering into the 
{{KubernetesResourceManager}}, the {{KubernetesResourceManager}} will send 
another creation request to K8s API Server immediately.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-17104) Support registering custom JobStatusListeners from config

2020-04-12 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-17104:


 Summary: Support registering custom JobStatusListeners from config
 Key: FLINK-17104
 URL: https://issues.apache.org/jira/browse/FLINK-17104
 Project: Flink
  Issue Type: New Feature
  Components: API / Core
Reporter: Canbin Zheng


Currently, a variety of users are asking for registering custom 
JobStatusListener support to get timely feedback on the status transition of 
the jobs. This could be an important feature for effective Flink cluster 
monitoring systems.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-17102) Add -Dkubernetes.container.image= for the start-flink-session section

2020-04-12 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-17102:


 Summary: Add -Dkubernetes.container.image= for the 
start-flink-session section
 Key: FLINK-17102
 URL: https://issues.apache.org/jira/browse/FLINK-17102
 Project: Flink
  Issue Type: Sub-task
Reporter: Canbin Zheng


Add {{-Dkubernetes.container.image=}} as a guide for new users in 
the existing command:
{quote}{{}}
 
{{./bin/kubernetes-session.sh \}}

{{-Dkubernetes.cluster-id= \-Dtaskmanager.memory.process.size=4096m 
\-Dkubernetes.taskmanager.cpu=2 \-Dtaskmanager.numberOfTaskSlots=4 
\-Dresourcemanager.taskmanager-timeout=360}}{{}}
{quote}
Details could refer to 
[https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/deployment/native_kubernetes.html#start-flink-session]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-17100) Document Native Kubernetes Improvements

2020-04-12 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-17100:


 Summary: Document Native Kubernetes Improvements
 Key: FLINK-17100
 URL: https://issues.apache.org/jira/browse/FLINK-17100
 Project: Flink
  Issue Type: Improvement
  Components: Deployment / Kubernetes, Documentation
Reporter: Canbin Zheng


This is the umbrella issue for the native Kubernetes documentation improvements.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-17090) Harden preckeck for the KubernetesConfigOptions.JOB_MANAGER_CPU

2020-04-10 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-17090:


 Summary: Harden preckeck for the 
KubernetesConfigOptions.JOB_MANAGER_CPU
 Key: FLINK-17090
 URL: https://issues.apache.org/jira/browse/FLINK-17090
 Project: Flink
  Issue Type: Improvement
  Components: Deployment / Kubernetes
Affects Versions: 1.10.0
Reporter: Canbin Zheng
 Fix For: 1.11.0


If people specify a negative value for the config option of 
{{KubernetesConfigOptions#JOB_MANAGER_CPU}} as what the following command does,
{code:java}
./bin/kubernetes-session.sh -Dkubernetes.jobmanager.cpu=-3.0 
-Dkubernetes.cluster-id=...{code}
then it will throw an exception as follows:
{quote}org.apache.flink.client.deployment.ClusterDeploymentException: Could not 
create Kubernetes cluster "felix1".
 at 
org.apache.flink.kubernetes.KubernetesClusterDescriptor.deployClusterInternal(KubernetesClusterDescriptor.java:192)
 at 
org.apache.flink.kubernetes.KubernetesClusterDescriptor.deploySessionCluster(KubernetesClusterDescriptor.java:129)
 at 
org.apache.flink.kubernetes.cli.KubernetesSessionCli.run(KubernetesSessionCli.java:108)
 at 
org.apache.flink.kubernetes.cli.KubernetesSessionCli.lambda$main$0(KubernetesSessionCli.java:185)
 at 
org.apache.flink.runtime.security.contexts.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30)
 at 
org.apache.flink.kubernetes.cli.KubernetesSessionCli.main(KubernetesSessionCli.java:185)

Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure 
executing: POST at: 
[https://cls-cf5wqdwy.ccs.tencent-cloud.com/apis/apps/v1/namespaces/default/deployments].
 Message: Deployment.apps "felix1" is invalid: 
[spec.template.spec.containers[0].resources.limits[cpu]: Invalid value: "-3": 
must be greater than or equal to 0, 
spec.template.spec.containers[0].resources.requests[cpu]: Invalid value: "-3": 
must be greater than or equal to 0]. Received status: Status(apiVersion=v1, 
code=422, 
details=StatusDetails(causes=[StatusCause(field=spec.template.spec.containers[0].resources.limits[cpu],
 message=Invalid value: "-3": must be greater than or equal to 0, 
reason=FieldValueInvalid, additionalProperties={}), 
StatusCause(field=spec.template.spec.containers[0].resources.requests[cpu], 
message=Invalid value: "-3": must be greater than or equal to 0, 
reason=FieldValueInvalid, additionalProperties={})], group=apps, 
kind=Deployment, name=felix1, retryAfterSeconds=null, uid=null, 
additionalProperties={}), kind=Status, message=Deployment.apps "felix1" is 
invalid: [spec.template.spec.containers[0].resources.limits[cpu]: Invalid 
value: "-3": must be greater than or equal to 0, 
spec.template.spec.containers[0].resources.requests[cpu]: Invalid value: "-3": 
must be greater than or equal to 0], metadata=ListMeta(_continue=null, 
resourceVersion=null, selfLink=null, additionalProperties={}), reason=Invalid, 
status=Failure, additionalProperties={}).
 at 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:510)
 at 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:449)
 at 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:413)
 at 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:372)
 at 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:241)
 at 
io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:798)
 at 
io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:328)
 at 
io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:324)
 at 
org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.createJobManagerComponent(Fabric8FlinkKubeClient.java:83)
 at 
org.apache.flink.kubernetes.KubernetesClusterDescriptor.deployClusterInternal(KubernetesClusterDescriptor.java:182)
{quote}
Since there is a gap in the configuration model between the flink-side and the 
k8s-side, this ticket proposes to harden precheck in the flink k8s parameters 
parsing tool and throw a more user-friendly exception message like "the value 
of {{kubernetes.jobmanager.cpu}} must be greater than or equal to 0".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-17087) Use constant port for rest.port when it's set as 0 on Kubernetes

2020-04-10 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-17087:


 Summary: Use constant port for rest.port when it's set as 0 on 
Kubernetes
 Key: FLINK-17087
 URL: https://issues.apache.org/jira/browse/FLINK-17087
 Project: Flink
  Issue Type: Bug
  Components: Deployment / Kubernetes
Reporter: Canbin Zheng
 Fix For: 1.11.0


If people set {{rest.port}} to 0 when deploying a native K8s session cluster as 
the following command does,

 
{code:java}
./bin/kubernetes-session.sh -Dkubernetes.cluster-id=felix1 -Drest.port=0 ...
{code}
the submission client will throw an Exception as follows:

 
{quote}org.apache.flink.client.deployment.ClusterDeploymentException: Could not 
create Kubernetes cluster felix1
 at 
org.apache.flink.kubernetes.KubernetesClusterDescriptor.deployClusterInternal(KubernetesClusterDescriptor.java:189)
 at 
org.apache.flink.kubernetes.KubernetesClusterDescriptor.deploySessionCluster(KubernetesClusterDescriptor.java:129)
 at 
org.apache.flink.kubernetes.cli.KubernetesSessionCli.run(KubernetesSessionCli.java:108)
 at 
org.apache.flink.kubernetes.cli.KubernetesSessionCli.lambda$main$0(KubernetesSessionCli.java:185)
 at 
org.apache.flink.runtime.security.contexts.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30)
 at 
org.apache.flink.kubernetes.cli.KubernetesSessionCli.main(KubernetesSessionCli.java:185)

Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure 
executing: POST at: https://xxx/apis/apps/v1/namespaces/default/deployments. 
Message: Deployment.apps "felix1" is invalid: 
spec.template.spec.containers[0].ports[0].containerPort: Required value. 
Received status: Status(apiVersion=v1, code=422, 
details=StatusDetails(causes=[StatusCause(field=spec.template.spec.containers[0].ports[0].containerPort,
 message=Required value, reason=FieldValueRequired, additionalProperties={})], 
group=apps, kind=Deployment, name=felix1, retryAfterSeconds=null, uid=null, 
additionalProperties={}), kind=Status, message=Deployment.apps "felix1" is 
invalid: spec.template.spec.containers[0].ports[0].containerPort: Required 
value, metadata=ListMeta(_continue=null, resourceVersion=null, selfLink=null, 
additionalProperties={}), reason=Invalid, status=Failure, 
additionalProperties={}).
 at 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:510)
 at 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:449)
 at 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:413)
 at 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:372)
 at 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:241)
 at 
io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:798)
 at 
io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:328)
 at 
io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:324)
 at 
org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.createJobManagerComponent(Fabric8FlinkKubeClient.java:83)
 at 
org.apache.flink.kubernetes.KubernetesClusterDescriptor.deployClusterInternal(KubernetesClusterDescriptor.java:184)
 ... 5 more
{quote}
 

As we can see, the exception message is unintuitive and may confuse a variety 
of users. 

Therefore, this ticket proposes to use a fixed port instead if people set it as 
0, like what we have done for the {{blob.server.port}} and the 
{{taskmanager.rpc.port}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-17078) Logging output is misleading when executing bin/flink -e kubernetes-session without specifying cluster-id

2020-04-09 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-17078:


 Summary: Logging output is misleading when executing bin/flink -e 
kubernetes-session without specifying cluster-id
 Key: FLINK-17078
 URL: https://issues.apache.org/jira/browse/FLINK-17078
 Project: Flink
  Issue Type: Bug
  Components: Deployment / Kubernetes
Reporter: Canbin Zheng
 Fix For: 1.11.0


When executing the following command:

 
{code:java}
./bin/flink run -d -e kubernetes-session 
examples/streaming/SocketWindowWordCount.jar --hostname 172.16.0.6 --port 12345
{code}
The exception stack would be:
{quote}org.apache.flink.client.program.ProgramInvocationException: The main 
method caused an error: 
org.apache.flink.client.deployment.ClusterRetrieveException: Could not get the 
rest endpoint of flink-cluster-6fa5f5dc-4b50-48c3-b57f-d91e686fa474
 at 
org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:335)
 at 
org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:205)
 at org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:148)
 at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:662)
 at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:210)
 at 
org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:893)
 at org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:966)
 at 
org.apache.flink.runtime.security.contexts.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30)
 at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:966)
Caused by: java.lang.RuntimeException: 
org.apache.flink.client.deployment.ClusterRetrieveException: Could not get the 
rest endpoint of flink-cluster-6fa5f5dc-4b50-48c3-b57f-d91e686fa474
 at 
org.apache.flink.kubernetes.KubernetesClusterDescriptor.lambda$createClusterClientProvider$0(KubernetesClusterDescriptor.java:94)
 at 
org.apache.flink.kubernetes.KubernetesClusterDescriptor.retrieve(KubernetesClusterDescriptor.java:118)
 at 
org.apache.flink.kubernetes.KubernetesClusterDescriptor.retrieve(KubernetesClusterDescriptor.java:59)
 at 
org.apache.flink.client.deployment.executors.AbstractSessionClusterExecutor.execute(AbstractSessionClusterExecutor.java:63)
 at 
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:1756)
 at 
org.apache.flink.client.program.StreamContextEnvironment.executeAsync(StreamContextEnvironment.java:106)
 at 
org.apache.flink.client.program.StreamContextEnvironment.execute(StreamContextEnvironment.java:72)
 at 
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1643)
 at 
org.apache.flink.streaming.examples.socket.SocketWindowWordCount.main(SocketWindowWordCount.java:92)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498)
 at 
org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:321)
 ... 8 more
Caused by: org.apache.flink.client.deployment.ClusterRetrieveException: Could 
not get the rest endpoint of flink-cluster-6fa5f5dc-4b50-48c3-b57f-d91e686fa474
 ... 22 more
{quote}
The logging output is misleading, we'd better throw an exception indicating 
that people should explicitly specify the value of {{kubernetes.cluster-id}}.

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-17055) List jobs via bin/flink throws FlinkException indicating no cluster id was specified

2020-04-08 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-17055:


 Summary: List jobs via bin/flink throws FlinkException indicating 
no cluster id was specified
 Key: FLINK-17055
 URL: https://issues.apache.org/jira/browse/FLINK-17055
 Project: Flink
  Issue Type: Bug
  Components: Command Line Client
Affects Versions: 1.10.0
Reporter: Canbin Zheng


The first command works fine.
{code:java}
./bin/flink list -m yarn-cluster -yid  application_1580534782944_0015
{code}
The second command throws an Exception.
{code:java}
./bin/flink list -e yarn-per-job -yid application_1580534782944_0015
{code}
And the exception stack is:

> The program finished with the following exception:

> org.apache.flink.util.FlinkException: No cluster id was specified. Please 
> specify a cluster to which you would like to connect.

>    at 
>org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:836)

>    at org.apache.flink.client.cli.CliFrontend.list(CliFrontend.java:334)

>    at 
>org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:896)

>    at 
>org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:966)

>    at java.security.AccessController.doPrivileged(Native Method)

>    at javax.security.auth.Subject.doAs(Subject.java:422)

>    at 
>org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754)

>    at 
>org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)

>    at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:966)
 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-17034) Execute the container CMD under TINI for better hygiene

2020-04-07 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-17034:


 Summary:  Execute the container CMD under TINI for better hygiene
 Key: FLINK-17034
 URL: https://issues.apache.org/jira/browse/FLINK-17034
 Project: Flink
  Issue Type: Improvement
  Components: Deployment / Docker, Deployment / Kubernetes, Dockerfiles
Reporter: Canbin Zheng


The container of the JM or TMs is running in the shell form and it could not 
receive the *TERM* signal when the pod is deleted. Some of the problems are as 
follows:
 * The JM and the TMs could have no chance of cleanup, I used to create 
FLINK-15843[4] for tracking this problem.
 * The pod could take a long time(up to 40 seconds) to be deleted after the K8s 
API Server receives the deletion request.

The discussion could refer to 
[http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-111-Docker-image-unification-td38444i20.html]

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-17033) Upgrade OpenJDK docker image for Kubernetes

2020-04-07 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-17033:


 Summary: Upgrade OpenJDK docker image for Kubernetes
 Key: FLINK-17033
 URL: https://issues.apache.org/jira/browse/FLINK-17033
 Project: Flink
  Issue Type: Improvement
  Components: Deployment / Kubernetes
Reporter: Canbin Zheng
 Fix For: 1.11.0


The current docker image {{openjdk:8-jre-alpine}} used by Kubernetes has many 
problems, here is some of them:
 # There is no official support for this image since the commit 
[https://github.com/docker-library/openjdk/commit/3eb0351b208d739fac35345c85e3c6237c2114ec#diff-f95ffa3d134732c33f7b8368e099]
 # 
[DNS-lookup-5s-delay|https://k8s.imroc.io/troubleshooting/cases/dns-lookup-5s-delay]
 # [DNS resolver does not read 
/etc/hosts|https://github.com/golang/go/issues/22846]

Therefore, this ticket proposes to investigate an alternative official JDK 
docker image; I think it's a good choice to use {{openjdk:8-jre-slim}}(184MB) 
instead, the reasons are as follows:
 # It has official support from openjdk: 
[https://github.com/docker-library/docs/blob/master/openjdk/README.md#supported-tags-and-respective-dockerfile-links]
 # It is based on debian and does not have many problems such as DNS lookup 
delay.
 # It's much smaller in size than other official docker images.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-17032) Naming convention unification for all the Kubernetes Resources

2020-04-07 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-17032:


 Summary: Naming convention unification for all the Kubernetes 
Resources
 Key: FLINK-17032
 URL: https://issues.apache.org/jira/browse/FLINK-17032
 Project: Flink
  Issue Type: Improvement
  Components: Deployment / Kubernetes
Affects Versions: 1.10.0
Reporter: Canbin Zheng
 Fix For: 1.11.0


Currently, the naming rules are different among the Kubernetes resources we 
have created, the rules are as follows:
 # The Deployment: ${clusterId}
 # The internal Service: ${clusterId}
 # The external Service: ${clusterId}-rest
 # The Flink Configuration ConfigMap: flink-config-${clusterId}
 # The Hadoop Configuration ConfigMap: hadoop-config-${clusterId}
 # The JobManager Pod: ${clusterId}-${random string}-${random string}
 # The TaskManager Pod: 
${clusterId}-taskmanager-${currentMaxAttemptId}-${currentMaxPodId}

In the future, we would add other Kubernetes resources, and it would be better 
to have a unified naming convention for all of them.

This ticket proposes the following naming convention:
 * The Deployment: ${clusterId}
 * The internal Service: ${clusterId}-inter-svc
 * The external Service: ${clusterId}-rest-svc
 * The Flink Configuration ConfigMap: ${clusterId}-flink-config
 * The Hadoop Configuration ConfigMap: ${clusterId}-hadoop-config
 * The JobManager Pod: ${clusterId}-${random string}-${random string}
 * The TaskManager Pod: 
${clusterId}-taskmanager-${currentMaxAttemptId}-${currentMaxPodId}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-17008) Handle null value for HADOOP_CONF_DIR/HADOOP_HOME env in AbstractKubernetesParameters#getLocalHadoopConfigurationDirectory

2020-04-06 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-17008:


 Summary: Handle null value for HADOOP_CONF_DIR/HADOOP_HOME env in 
AbstractKubernetesParameters#getLocalHadoopConfigurationDirectory
 Key: FLINK-17008
 URL: https://issues.apache.org/jira/browse/FLINK-17008
 Project: Flink
  Issue Type: Bug
  Components: Deployment / Kubernetes
Reporter: Canbin Zheng
 Fix For: 1.11.0


{{System.getenv(Constants.ENV_HADOOP_CONF_DIR)}} or 
{{System.getenv(Constants.HADOOP_HOME)}} could return null value if they are 
not set in the System environments, it's a minor improvement to take these 
situations into consideration.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-16913) ReadableConfigToConfigurationAdapter#getEnum throws UnsupportedOperationException

2020-04-01 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-16913:


 Summary: ReadableConfigToConfigurationAdapter#getEnum throws 
UnsupportedOperationException
 Key: FLINK-16913
 URL: https://issues.apache.org/jira/browse/FLINK-16913
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Configuration
Affects Versions: 1.10.0
Reporter: Canbin Zheng
 Fix For: 1.10.1, 1.11.0
 Attachments: image-2020-04-01-16-46-13-122.png

Steps to reproduce the issue:
 # Set flink-conf.yaml
 ** state.backend: rocksdb
 ** state.checkpoints.dir: hdfs:///flink-checkpoints
 ** state.savepoints.dir: hdfs:///flink-checkpoints
 # Start a Kubernetes session cluster
 # Submit a job to the session cluster following a UnsupportedOperationException

{code:java}
 The program finished with the following 
exception:org.apache.flink.client.program.ProgramInvocationException: The main 
method caused an error: The adapter does not support this method
at 
org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:335)
at 
org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:205)
at 
org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:143)
at 
org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:659)
at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:210)
at 
org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:890)
at 
org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:963)
at 
org.apache.flink.runtime.security.contexts.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30)
at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:963)
Caused by: java.lang.UnsupportedOperationException: The adapter does not 
support this method
at 
org.apache.flink.configuration.ReadableConfigToConfigurationAdapter.getEnum(ReadableConfigToConfigurationAdapter.java:258)
at 
org.apache.flink.contrib.streaming.state.RocksDBStateBackend.(RocksDBStateBackend.java:336)
at 
org.apache.flink.contrib.streaming.state.RocksDBStateBackend.configure(RocksDBStateBackend.java:394)
at 
org.apache.flink.contrib.streaming.state.RocksDBStateBackendFactory.createFromConfig(RocksDBStateBackendFactory.java:47)
at 
org.apache.flink.contrib.streaming.state.RocksDBStateBackendFactory.createFromConfig(RocksDBStateBackendFactory.java:32)
at 
org.apache.flink.runtime.state.StateBackendLoader.loadStateBackendFromConfig(StateBackendLoader.java:154)
at 
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.loadStateBackend(StreamExecutionEnvironment.java:792)
at 
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.configure(StreamExecutionEnvironment.java:761)
at 
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.(StreamExecutionEnvironment.java:217)
at 
org.apache.flink.client.program.StreamContextEnvironment.(StreamContextEnvironment.java:53)
at 
org.apache.flink.client.program.StreamContextEnvironment.lambda$setAsContext$2(StreamContextEnvironment.java:103)
at java.util.Optional.map(Optional.java:215)
at 
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.getExecutionEnvironment(StreamExecutionEnvironment.java:1882)
at 
org.apache.flink.streaming.examples.socket.SocketWindowWordCount.main(SocketWindowWordCount.java:62)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:321)
... 8 more
{code}
The call stack is:

I am wondering why we introduce {{ReadableConfigToConfigurationAdapter}} to 
wrap the {{Configuration}} but leave many of the methods in it to throw 
UnsupportedOperationException that causes problems. Why don't we use 
{{Configuration}} object directly?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-16855) Update K8s template for the standalone setup

2020-03-29 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-16855:


 Summary: Update K8s template for the standalone setup
 Key: FLINK-16855
 URL: https://issues.apache.org/jira/browse/FLINK-16855
 Project: Flink
  Issue Type: Improvement
  Components: Deployment / Kubernetes
Reporter: Canbin Zheng
 Fix For: 1.11.0


Following FLINK-16602 and discussion in 
[https://github.com/apache/flink/pull/11456]

We would like to update the K8s template for the standalone setup as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-16737) Remove KubernetesUtils#getContentFromFile

2020-03-23 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-16737:


 Summary: Remove KubernetesUtils#getContentFromFile
 Key: FLINK-16737
 URL: https://issues.apache.org/jira/browse/FLINK-16737
 Project: Flink
  Issue Type: Improvement
  Components: Deployment / Kubernetes
Reporter: Canbin Zheng
 Fix For: 1.11.0


Since {{org.apache.flink.util.FileUtils}} has already provided some utilities 
such as {{readFile}} or {{readFileUtf8}} for reading file contents, we can 
remove the {{KubernetesUtils#getContentFromFile}} and use the {{FileUtils}} 
tool instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-16715) Always use the configuration argument in YarnClusterDescriptor#startAppMaster to make it more self-contained

2020-03-23 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-16715:


 Summary: Always use the configuration argument in 
YarnClusterDescriptor#startAppMaster to make it more self-contained
 Key: FLINK-16715
 URL: https://issues.apache.org/jira/browse/FLINK-16715
 Project: Flink
  Issue Type: Improvement
  Components: Deployment / YARN
Reporter: Canbin Zheng
 Fix For: 1.11.0


In the YarnClusterDescriptor#{{startAppMaster()}} we are using some time the 
configuration argument to the method to get/set config options, and 
sometimes the flinkConfiguration which is a class member. This ticket 
proposes to always use the configuration argument to make the method 
more self-contained.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-16699) Support accessing secured services via K8s secrets

2020-03-20 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-16699:


 Summary: Support accessing secured services via K8s secrets
 Key: FLINK-16699
 URL: https://issues.apache.org/jira/browse/FLINK-16699
 Project: Flink
  Issue Type: Improvement
  Components: Deployment / Kubernetes
Reporter: Canbin Zheng
 Fix For: 1.11.0


Kubernetes [Secrets|https://kubernetes.io/docs/concepts/configuration/secret/] 
can be used to provide credentials for a Flink application to access secured 
services.  This ticket proposes to 
 # Support to mount user-specified K8s Secrets into the JobManager/TaskManager 
Container
 # Support to use a user-specified K8s Secret through an environment variable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-16625) Extract BootstrapTools#getEnvironmentVariables to a general utility in ConfigurationUtil

2020-03-16 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-16625:


 Summary: Extract BootstrapTools#getEnvironmentVariables to a 
general utility in ConfigurationUtil
 Key: FLINK-16625
 URL: https://issues.apache.org/jira/browse/FLINK-16625
 Project: Flink
  Issue Type: Improvement
  Components: API / Core
Reporter: Canbin Zheng
 Fix For: 1.11.0


{{BootstrapTools#getEnvironmentVariables}} actually is a general utility to 
extract key-value pairs with specified prefix trimmed from the Flink 
Configuration object. It can not only be used to extract customized environment 
variables in the YARN setup but also for customized 
annotations/labels/node-selectors in the Kubernetes setup.

This ticket proposes to rename it to 
{{ConfigurationUtil#getPrefixedKeyValuePairs}} and move it to the 
{{flink-core}} module as a more general utility to share for the 
YARN/Kubernetes setup.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-16624) Support to customize annotations for the rest Service on Kubernetes

2020-03-16 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-16624:


 Summary: Support to customize annotations for the rest Service on 
Kubernetes
 Key: FLINK-16624
 URL: https://issues.apache.org/jira/browse/FLINK-16624
 Project: Flink
  Issue Type: Improvement
  Components: Deployment / Kubernetes
Reporter: Canbin Zheng
 Fix For: 1.11.0


There are some scenarios that people would like to customize annotations 
Service, for example:
 # Specify the LB type to decide whether it should have an external IP.
 # Specify the Security Policy for the LB.

It's a common need, especially in the Cloud Environment.  This ticket proposes 
to add support for it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-16602) Rework the Service design for Kubernetes deployment

2020-03-15 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-16602:


 Summary: Rework the Service design for Kubernetes deployment
 Key: FLINK-16602
 URL: https://issues.apache.org/jira/browse/FLINK-16602
 Project: Flink
  Issue Type: Improvement
  Components: Deployment / Kubernetes
Affects Versions: 1.10.0
Reporter: Canbin Zheng
 Fix For: 1.11.0


{color:#0e101a}At the moment we usually create two Services for a Flink 
application, one is the internal Service and the other is the so-called rest 
Service, the previous aims for forwarding request from the TMs to the JM, and 
the rest Service mainly serves as an external service for the Flink 
application. Here is a summary of the issues:{color}

 
 # {color:#0e101a}The functionality boundary of the two Services is not clear 
enough since the internal Service could also become the rest Service when its 
exposed type is ClusterIP.{color}
 # {color:#0e101a}For the high availability scenario, we create a useless 
internal Service which does not help forward the internal requests since the 
TMs directly communicate with the JM via the IP or hostname of the JM 
Pod.{color}
 # {color:#0e101a}Headless service is enough to help forward the internal 
requests from the TMs to the JM. Service of ClusterIP type would add 
corresponding rules into the iptables, too many rules in the iptables would 
lower the kube-proxy's efficiency in refreshing iptables while notified of 
change events, which could cause severe stability problems in a Kubernetes 
cluster.{color}

 

{color:#0e101a}Therefore, we propose some improvements to the current 
design:{color}

 
 # {color:#0e101a}Clarify the functionality boundary for the two Services, the 
internal Service only serves the internal communication from TMs to JM, while 
the rest Service makes the Flink cluster accessible from outside. The internal 
Service only exposes the RPC and BLOB ports while the external one exposes the 
REST port.{color}
 # {color:#0e101a}Do not create the internal Service in the high availability 
case.{color}
 # {color:#0e101a}Use HEADLESS type for the internal Service.{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-16601) Corret the way to get Endpoint address for NodePort rest Service

2020-03-15 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-16601:


 Summary: Corret the way to get Endpoint address for NodePort rest 
Service
 Key: FLINK-16601
 URL: https://issues.apache.org/jira/browse/FLINK-16601
 Project: Flink
  Issue Type: Bug
  Components: Deployment / Kubernetes
Affects Versions: 1.10.0
Reporter: Canbin Zheng
 Fix For: 1.11.0


Currently, if one sets the type of the rest-service to {{NodePort}}, then the 
way to get the Endpoint address is by calling the method of 
'KubernetesClient.getMasterUrl().getHost()'. This solution works fine for the 
case of the non-managed Kubernetes cluster but not for the managed ones.

For the managed Kubernetes cluster setups, the Kubernetes masters are deployed 
in a pool different from the Kubernetes nodes and do not expose NodePort for a 
NodePort Service.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-16600) Respect the rest.bind-port for the Kubernetes setup

2020-03-14 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-16600:


 Summary: Respect the rest.bind-port for the Kubernetes setup
 Key: FLINK-16600
 URL: https://issues.apache.org/jira/browse/FLINK-16600
 Project: Flink
  Issue Type: Bug
  Components: Deployment / Kubernetes
Affects Versions: 1.10.0
Reporter: Canbin Zheng
 Fix For: 1.11.0


Our current logic only takes care of {{RestOptions.PORT}} but not 
{{RestOptions.BIND_PORT, }}which is a bug.


For example, when one sets the {{RestOptions.BIND_PORT}} to a value other than 
{{RestOptions.PORT}}, jobs could not be submitted to the existing session 
cluster deployed via the kubernetes-session.sh.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-16598) Respect the rest port exposed by Service in Fabric8FlinkKubeClient#getRestEndpoint

2020-03-14 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-16598:


 Summary: Respect the rest port exposed by Service in 
Fabric8FlinkKubeClient#getRestEndpoint
 Key: FLINK-16598
 URL: https://issues.apache.org/jira/browse/FLINK-16598
 Project: Flink
  Issue Type: Bug
  Components: Deployment / Kubernetes
Affects Versions: 1.10.0
Reporter: Canbin Zheng
 Fix For: 1.11.0


Steps to reproduce the problem:
 # Start a Kubernetes session cluster, by default, the rest port exposed by the 
rest Service is 8081.
 # Submit a job to the session cluster, but specify a different rest port via 
-Drest.port=8082

As we can see, the job could never be submitted to the existing session cluster 
since we retrieve the Endpoint from the Flink configuration object instead of 
the created Service.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-16547) Corrent the order to write temporary files in YarnClusterDescriptor#startAppMaster

2020-03-11 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-16547:


 Summary: Corrent the order to write temporary files in 
YarnClusterDescriptor#startAppMaster
 Key: FLINK-16547
 URL: https://issues.apache.org/jira/browse/FLINK-16547
 Project: Flink
  Issue Type: Improvement
  Components: Deployment / YARN
Reporter: Canbin Zheng
 Fix For: 1.11.0


Currently, in {{YarnClusterDescriptor#startAppMaster}}, we first write out and 
upload the Flink Configuration file, then start to write out the JobGraph file 
and set its name into the Flink Configuration object, the afterward setting is 
not written into the Flink Configuration file so that it does not take effect 
in the cluster side.

Since in the client-side we name the JobGraph file with the default value of 
FileJobGraphRetriever.JOB_GRAPH_FILE_PATH option, the cluster side could 
succeed in retrieving that file. 

This ticket proposes to write out the JobGraph file before the Configuration 
file to ensure that the setting of FileJobGraphRetriever.JOB_GRAPH_FILE_PATH is 
delivered to the cluster side.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-16546) Fix logging bug in YarnClusterDescriptor#startAppMaster

2020-03-11 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-16546:


 Summary: Fix logging bug in YarnClusterDescriptor#startAppMaster
 Key: FLINK-16546
 URL: https://issues.apache.org/jira/browse/FLINK-16546
 Project: Flink
  Issue Type: Improvement
  Components: Deployment / YARN
Reporter: Canbin Zheng
 Fix For: 1.11.0


It's a minor fixup of the warning in case of tmpJobGraphFile deletion failure.

From

 
{code:java}
LOG.warn("Fail to delete temporary file {}.", tmpConfigurationFile.toPath());
{code}
to

 

 
{code:java}
LOG.warn("Fail to delete temporary file {}.", tmpJobGraphFile.toPath());
{code}
 

 

[https://github.com/apache/flink/blob/master/flink-yarn/src/main/java/org/apache/flink/yarn/YarnClusterDescriptor.java#L854]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-16521) Remove unused FileUtils#isClassFile

2020-03-10 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-16521:


 Summary: Remove unused FileUtils#isClassFile
 Key: FLINK-16521
 URL: https://issues.apache.org/jira/browse/FLINK-16521
 Project: Flink
  Issue Type: Improvement
  Components: API / Core
Reporter: Canbin Zheng
 Fix For: 1.10.1, 1.11.0


Remove the unused public method of {{FileUtils#isClassFile}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-16508) Name the ports exposed by the main Container in Pod

2020-03-09 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-16508:


 Summary: Name the ports exposed by the main Container in Pod
 Key: FLINK-16508
 URL: https://issues.apache.org/jira/browse/FLINK-16508
 Project: Flink
  Issue Type: Improvement
  Components: Deployment / Kubernetes
Affects Versions: 1.10.0
Reporter: Canbin Zheng
 Fix For: 1.10.1, 1.11.0


Currently, we expose some ports via the main Container of the JobManager and 
the TaskManager, but we forget to name those ports so that people could be 
confused because there is no description of the port usage. This ticket 
proposes to explicitly name the ports in the Container to help people 
understand the usage of those ports.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-16494) Use enum type instead of string type for KubernetesConfigOptions.CONTAINER_IMAGE_PULL_POLICY

2020-03-08 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-16494:


 Summary: Use enum type instead of string type for 
KubernetesConfigOptions.CONTAINER_IMAGE_PULL_POLICY
 Key: FLINK-16494
 URL: https://issues.apache.org/jira/browse/FLINK-16494
 Project: Flink
  Issue Type: Improvement
  Components: Deployment / Kubernetes
Reporter: Canbin Zheng
 Fix For: 1.11.0


At the moment the value of 
{{KubernetesConfigOptions.{color:#172b4d}CONTAINER_IMAGE_PULL_POLICY}} 
{color}option could only be one of 
_{color:#172b4d}Always{color}_{color:#172b4d}, 
{color}_{color:#172b4d}Never{color}{color:#172b4d}, and IfNotPresent{color}_, 
this ticket proposes to use enum type for the option instead of string type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-16493) Use enum type instead of string type for KubernetesConfigOptions.REST_SERVICE_EXPOSED_TYPE

2020-03-08 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-16493:


 Summary: Use enum type instead of string type for 
KubernetesConfigOptions.REST_SERVICE_EXPOSED_TYPE
 Key: FLINK-16493
 URL: https://issues.apache.org/jira/browse/FLINK-16493
 Project: Flink
  Issue Type: Improvement
  Components: Deployment / Kubernetes
Reporter: Canbin Zheng
 Fix For: 1.11.0


At the moment the value of 
{{KubernetesConfigOptions.REST_SERVICE_EXPOSED_TYPE}} option could only be one 
of _{color:#172b4d}ClusterIP, NodePort, and 
LoadBalancer{color}_{color:#172b4d}, {color}{color:#172b4d}this ticket proposes 
to use enum type for the option instead of string type.{color}_{color:#172b4d}
{color}_



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-16253) Switch to Log4j 2 by default for flink-kubernetes submodule

2020-02-23 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-16253:


 Summary: Switch to Log4j 2 by default for flink-kubernetes 
submodule
 Key: FLINK-16253
 URL: https://issues.apache.org/jira/browse/FLINK-16253
 Project: Flink
  Issue Type: Improvement
  Components: Deployment / Kubernetes
Reporter: Canbin Zheng
 Fix For: 1.11.0


Switch to Log4j 2 by default for flink-kubernetes submodule, including the 
script and the container startup command or parameters.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-16240) Port KubernetesUtilsTest to the right package

2020-02-22 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-16240:


 Summary: Port KubernetesUtilsTest to the right package
 Key: FLINK-16240
 URL: https://issues.apache.org/jira/browse/FLINK-16240
 Project: Flink
  Issue Type: Improvement
  Components: Deployment / Kubernetes
Affects Versions: 1.10.0
Reporter: Canbin Zheng
 Fix For: 1.11.0


Port {{KubernetesUtilsTest}} from {{org.apache.flink.kubernetes}} to  
{{org.apache.flink.kubernetes.utils.}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-16239) Port KubernetesSessionCliTest to the right package

2020-02-22 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-16239:


 Summary: Port KubernetesSessionCliTest to the right package
 Key: FLINK-16239
 URL: https://issues.apache.org/jira/browse/FLINK-16239
 Project: Flink
  Issue Type: Improvement
  Components: Deployment / Kubernetes
Affects Versions: 1.10.0
Reporter: Canbin Zheng
 Fix For: 1.11.0


Port KubernetesSessionCliTest from {{org.apache.flink.kubernetes}} to  
{{org.apache.flink.kubernetes.cli}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-16238) Rename class name of Fabric8ClientTest to Fabric8FlinkKubeClient

2020-02-22 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-16238:


 Summary: Rename class name of Fabric8ClientTest to 
Fabric8FlinkKubeClient
 Key: FLINK-16238
 URL: https://issues.apache.org/jira/browse/FLINK-16238
 Project: Flink
  Issue Type: Improvement
  Components: Deployment / Kubernetes
Affects Versions: 1.10.0
Reporter: Canbin Zheng
 Fix For: 1.11.0


It's a minor change to alignment the test class name of 
\{{org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-16194) Refactor the Kubernetes resouces construction architecture

2020-02-20 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-16194:


 Summary: Refactor the Kubernetes resouces construction architecture
 Key: FLINK-16194
 URL: https://issues.apache.org/jira/browse/FLINK-16194
 Project: Flink
  Issue Type: Improvement
  Components: Deployment / Kubernetes
Affects Versions: 1.10.0
Reporter: Canbin Zheng
 Fix For: 1.11.0


So far, Flink has made efforts for the native integration of Kubernetes. 
However, it is always essential to evaluate the existing design and consider 
alternatives that have better design and are easier to maintain in the long 
run. We have suffered from some problems while developing new features base on 
the current code. Here is some of them:
 # We don’t have a unified monadic-step based orchestrator architecture to 
construct all the Kubernetes resources.
 ** There are inconsistencies between the orchestrator architecture that client 
uses to create the Kubernetes resources, and the orchestrator architecture that 
the master uses to create Pods; this confuses new contributors, as there is a 
cognitive burden to understand two architectural philosophies instead of one; 
for another, maintenance and new feature development become quite challenging.
 ** Pod construction is done in one step. With the introduction of new features 
for the Pod, the construction process could become far more complicated, and 
the functionality of a single class could explode, which hurts code 
readability, writability, and testability. At the moment, we have encountered 
such challenges and realized that it is not an easy thing to develop new 
features related to the Pod.
 ** The implementations of a specific feature are usually scattered in multiple 
decoration classes. For example, the current design uses a decoration class 
chain that contains five Decorator class to mount a configuration file to the 
Pod. If people would like to introduce other configuration files support, such 
as Hadoop configuration or Keytab files, they have no choice but to repeat the 
same tedious and scattered process.
 # We don’t have dedicated objects or tools for centrally parsing, verifying, 
and managing the Kubernetes parameters, which has raised some maintenance and 
inconsistency issues.
 ** There are many duplicated parsing and validating code, including settings 
of Image, ImagePullPolicy, ClusterID, ConfDir, Labels, etc. It not only harms 
readability and testability but also is prone to mistakes. Refer to issue 
FLINK-16025 for inconsistent parsing of the same parameter.
 ** The parameters are scattered so that some of the method signatures have to 
declare many unnecessary input parameters, such as 
FlinkMasterDeploymentDecorator#createJobManagerContainer.

 

For solving these issues, we propose to 
 # Introduce a unified monadic-step based orchestrator architecture that has a 
better, cleaner and consistent abstraction for the Kubernetes resources 
construction process. 
 # Add some dedicated tools for centrally parsing, verifying, and managing the 
Kubernetes parameters.

 

Refer to the design doc for the details, any feedback is welcome.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-16025) Service could expose different blob server port mismatched with JM Container

2020-02-12 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-16025:


 Summary: Service could expose different blob server port 
mismatched with JM Container
 Key: FLINK-16025
 URL: https://issues.apache.org/jira/browse/FLINK-16025
 Project: Flink
  Issue Type: Bug
  Components: Deployment / Kubernetes
Affects Versions: 1.10.0
Reporter: Canbin Zheng
 Fix For: 1.10.1, 1.11.0


The Service would always expose 6124 port if it should expose that port, and 
while building ServicePort we do not explicitly specify a target port, so the 
target port would always be 6124 too.
{code:java}
// From ServiceDecorator.java

servicePorts.add(getServicePort(
 getPortName(BlobServerOptions.PORT.key()),
 Constants.BLOB_SERVER_PORT));

private ServicePort getServicePort(String name, int port) {
   return new ServicePortBuilder()
  .withName(name)
  .withPort(port)
  .build();
}


{code}
 meanwhile the Container of the JM would expose the blob server port which is 
configured in the Flink Configuration,
{code:java}
// From FlinkMasterDeploymentDecorator.java

final int blobServerPort = KubernetesUtils.parsePort(flinkConfig, 
BlobServerOptions.PORT);

...

final Container container = createJobManagerContainer(flinkConfig, mainClass, 
hasLogback, hasLog4j, blobServerPort);
{code}
 so there is a risk that the TM could not executing Task due to dependencies 
fetching failure if the Service exposes a blob server port which is different 
from the JM Container when one configures the blob server port with a value 
different from 6124.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15857) Support --help for the kubernetes-session script

2020-02-02 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-15857:


 Summary: Support --help for the kubernetes-session script
 Key: FLINK-15857
 URL: https://issues.apache.org/jira/browse/FLINK-15857
 Project: Flink
  Issue Type: Sub-task
  Components: Command Line Client, Deployment / Kubernetes
Affects Versions: 1.10.0
Reporter: Canbin Zheng
 Fix For: 1.11.0


Add --help support for the kubernetes-session script to show usage of the 
Kubernetes session options for users.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15856) Various Kubernetes command client improvements

2020-02-02 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-15856:


 Summary: Various Kubernetes command client improvements
 Key: FLINK-15856
 URL: https://issues.apache.org/jira/browse/FLINK-15856
 Project: Flink
  Issue Type: Improvement
  Components: Command Line Client, Deployment / Kubernetes
Affects Versions: 1.10.0
Reporter: Canbin Zheng
 Fix For: 1.11.0


This is the umbrella issue for the Kubernetes related command client 
improvements.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15852) Job is submitted to the wrong session cluster

2020-02-02 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-15852:


 Summary: Job is submitted to the wrong session cluster
 Key: FLINK-15852
 URL: https://issues.apache.org/jira/browse/FLINK-15852
 Project: Flink
  Issue Type: Bug
  Components: Command Line Client
Affects Versions: 1.10.0
Reporter: Canbin Zheng
 Fix For: 1.11.0, 1.10.1


Steps to reproduce the problem:
 # Deploy a YARN session cluster by command \{{./bin/yarn-session.sh -d}}
 # Deploy a Kubernetes session cluster by command 
\{{./bin/kubernetes-session.sh -Dkubernetes.cluster-id=test ...}}
 # Try to submit a Job to the Kubernetes session cluster by command 
\{{./bin/flink run -d -e kubernetes-session -Dkubernetes.cluster-id=test 
examples/streaming/WordCount.jar}}

It's expected that the Job will be submitted to the Kubernetes session cluster 
whose cluster-id is *test*, however, the job was submitted to the YARN session 
cluster.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15843) Do not violently kill TaskManagers

2020-02-02 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-15843:


 Summary: Do not violently kill TaskManagers
 Key: FLINK-15843
 URL: https://issues.apache.org/jira/browse/FLINK-15843
 Project: Flink
  Issue Type: Sub-task
  Components: Deployment / Kubernetes
Affects Versions: 1.10.0
Reporter: Canbin Zheng
 Fix For: 1.11.0


The current solution of stopping a TaskManager instance when JobManager sends a 
deletion request is by directly calling 
${\{KubernetesClient.pods().withName().delete}}, thus that instance would be 
violently killed with a _KILL_ signal and having no chance to clean up, which 
could cause problems because we expect the process to gracefully terminate when 
it is no longer needed.

Refer to the guide of [Termination of 
Pods|[https://kubernetes.io/docs/concepts/workloads/pods/pod/#termination-of-pods]|https://kubernetes.io/docs/concepts/workloads/pods/pod/#termination-of-pods],],
 we know that on Kubernetes a _TERM_ signal would be first sent to the main 
process in each container, and may be followed up with a force _KILL_ signal if 
the grace period has expired; the Unix signal will be sent to the process which 
has PID 1 ([Docker 
Kill|https://docs.docker.com/engine/reference/commandline/kill/]), however, the 
TaskManagerRunner Process is spawned by 
{color:#172b4d}/opt/flink/bin/kubernetes-entry.sh {color}and could never have 
PID 1, so it would never receive the Unix signal_._

 

One walk around could be that JobManager firstly sends a *KILL_WORKER* message 
to the  TaskManager, then the TaskManager gracefully terminates itself to 
ensure that the clean-up is completely finished, lastly, the JobManager deletes 
the Pod after a configurable graceful shut-down period.

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15822) Rethink the necessity of the internal Service

2020-01-30 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-15822:


 Summary: Rethink the necessity of the internal Service
 Key: FLINK-15822
 URL: https://issues.apache.org/jira/browse/FLINK-15822
 Project: Flink
  Issue Type: Sub-task
  Components: Deployment / Kubernetes
Affects Versions: 1.10.0
Reporter: Canbin Zheng
 Fix For: 1.11.0, 1.10.1


The current design introduces two Kubernetes Services when deploying a new 
session cluster.  The rest Service serves external communication while the 
internal Service mainly serves two purposes:
 # A discovery service directs the communication from TaskManagers to the 
JobManager Pod that has labels containing the internal Service’s selector in 
the non-HA mode, so that the TM Pods could re-register to the new JM Pod once a 
JM Pod failover occurs, while in the HA mode, there could be one active and 
multiple standby JM Pods, so we use the Pod IP of the active one for internal 
communication instead of using the internal Service .
 # The OwnerReference of all other Kubernetes Resources, including the rest 
Service,  the ConfigMap and the JobManager Deployment.

Is it possible that we just create one single Service instead of two? I think 
things could work quite well with only the rest Service, meanwhile the design 
and code could be more succinct.

This ticket proposes to remove the internal Service, the core changes including
 # In the non-HA mode, we use the rest Service as the JobManager Pod discovery 
service.
 # Set the JobManager Deployment as the OwnerReference of all the other 
Kubernetes Resources, including the rest Service and the ConfigMap.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15817) Kubernetes Resource leak while deployment exception happens

2020-01-30 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-15817:


 Summary: Kubernetes Resource leak while deployment exception 
happens
 Key: FLINK-15817
 URL: https://issues.apache.org/jira/browse/FLINK-15817
 Project: Flink
  Issue Type: Sub-task
  Components: Deployment / Kubernetes
Affects Versions: 1.10.0
Reporter: Canbin Zheng
 Fix For: 1.11.0, 1.10.1


When we deploy a new session cluster on Kubernetes cluster, usually there are 
four steps to create the Kubernetes components, and the creation order is as 
below: internal Service -> rest Service -> ConfigMap -> JobManager Deployment.

After the internal Service is created, any Exceptions that fail the cluster 
deployment progress would cause Kubernetes Resource leak, for example:
 #  If failed to create rest Service due to service name 
constraint([FLINK-15816|https://issues.apache.org/jira/browse/FLINK-15816]), 
the internal Service would not be cleaned up when the deploy progress 
terminates.
 # If failed to create JobManager Deployment(a case is that 
_jobmanager.heap.size_ is too small such as 512M, which is less than the 
default configuration value of 'containerized.heap-cutoff-min'), the internal 
Service, the rest Service, and the ConfigMap all leaks.

This ticket proposes to do some clean-ups(cleans the residual Services and 
ConfigMap) if the cluster deployment progress terminates accidentally on the 
client-side.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15816) Limit the maximum length of the value of kubernetes.cluster-id configuration option

2020-01-30 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-15816:


 Summary: Limit the maximum length of the value of 
kubernetes.cluster-id configuration option
 Key: FLINK-15816
 URL: https://issues.apache.org/jira/browse/FLINK-15816
 Project: Flink
  Issue Type: Sub-task
  Components: Deployment / Kubernetes
Affects Versions: 1.10.0
Reporter: Canbin Zheng
 Fix For: 1.11.0, 1.10.1


Two Kubernetes Service will be created when a session cluster is deployed, one 
is the internal Service and the other is the rest Service, we set the internal 
Service name to the value of the _kubernetes.cluster-id_ configuration option 
and then set the rest Service name to  _${kubernetes.cluster-id}_ with a suffix 
*-rest* appended, said if we set the _kubernetes.cluster-id_ to *session-test*, 
then the internal Service name will be *session-test* and the rest Service name 
be *session-test-rest;* there is a constraint in Kubernetes that the Service 
name must be no more than 63 characters, for the current naming convention it 
leads to that the value of _kubernetes.cluster-id_ should not be more than 58 
characters, otherwise there are scenarios that the internal Service is created 
successfully then comes up with a ClusterDeploymentException when trying to 
create the rest Service.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15667) Hadoop Configurations Mount Support

2020-01-19 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-15667:


 Summary: Hadoop Configurations Mount Support
 Key: FLINK-15667
 URL: https://issues.apache.org/jira/browse/FLINK-15667
 Project: Flink
  Issue Type: New Feature
  Components: Deployment / Kubernetes
Reporter: Canbin Zheng


Mounts the Hadoop configurations, either as a pre-defined config map, or a 
local configuration directory on Pod.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15666) GPU scheduling support in Kubernetes mode

2020-01-19 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-15666:


 Summary: GPU scheduling support in Kubernetes mode
 Key: FLINK-15666
 URL: https://issues.apache.org/jira/browse/FLINK-15666
 Project: Flink
  Issue Type: New Feature
  Components: Deployment / Kubernetes
Reporter: Canbin Zheng


This is an umbrella ticket for work on GPU scheduling in Kubernetes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15665) External shuffle service support in Kubernetes mode

2020-01-19 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-15665:


 Summary: External shuffle service support in Kubernetes mode
 Key: FLINK-15665
 URL: https://issues.apache.org/jira/browse/FLINK-15665
 Project: Flink
  Issue Type: New Feature
  Components: Deployment / Kubernetes
Reporter: Canbin Zheng


This is an umbrella ticket for work on Kubernetes-specific external shuffle 
service.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15663) Kerberos Support in Kubernetes Deploy Mode

2020-01-19 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-15663:


 Summary: Kerberos Support in Kubernetes Deploy Mode
 Key: FLINK-15663
 URL: https://issues.apache.org/jira/browse/FLINK-15663
 Project: Flink
  Issue Type: New Feature
  Components: Deployment / Kubernetes
Reporter: Canbin Zheng


This is the umbrella issue for all Kerberos related tasks with relation to 
Flink on Kubernetes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15662) Make K8s client timeouts configurable

2020-01-19 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-15662:


 Summary: Make K8s client timeouts configurable
 Key: FLINK-15662
 URL: https://issues.apache.org/jira/browse/FLINK-15662
 Project: Flink
  Issue Type: Improvement
  Components: Deployment / Kubernetes
Reporter: Canbin Zheng


Kubernetes clients used in the client-side submission and requesting worker 
pods should have configurable read and connect timeouts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15659) Introduce a client-side StatusWatcher for JM Pods

2020-01-18 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-15659:


 Summary: Introduce a client-side StatusWatcher for JM Pods
 Key: FLINK-15659
 URL: https://issues.apache.org/jira/browse/FLINK-15659
 Project: Flink
  Issue Type: New Feature
  Components: Deployment / Kubernetes
Reporter: Canbin Zheng


It's quite useful to introduce a client-side StatusWatcher to track and log the 
status of the JobManager Pods so that users who deploy applications on K8s via 
CLI can learn about the deploying progress conveniently.

The status logging could occur on every state change, also at an interval.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15656) Support user-specified pod templates

2020-01-18 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-15656:


 Summary: Support user-specified pod templates
 Key: FLINK-15656
 URL: https://issues.apache.org/jira/browse/FLINK-15656
 Project: Flink
  Issue Type: New Feature
  Components: Deployment / Kubernetes
Reporter: Canbin Zheng


The current approach of introducing new configuration options for each aspect 
of pod specification a user might wish is becoming unwieldy, we have to 
maintain more and more Flink side Kubernetes configuration options and users 
have to learn the gap between the declarative model used by Kubernetes and the 
configuration model used by Flink. It's a great improvement to allow users to 
specify pod templates as central places for all customization needs for the 
jobmanager and taskmanager pods.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15654) Expose podIP for Containers by Environment Variables

2020-01-18 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-15654:


 Summary: Expose podIP for Containers by Environment Variables
 Key: FLINK-15654
 URL: https://issues.apache.org/jira/browse/FLINK-15654
 Project: Flink
  Issue Type: New Feature
  Components: Deployment / Kubernetes
 Environment: {code:java}
//代码占
{code}
Reporter: Canbin Zheng


Expose IP information of a Pod for its containers to use.

 
{code:java}
new ContainerBuilder()
.addNewEnv()
  .withName(ENV_JOBMANAGER_BIND_ADDRESS)
  .withValueFrom(new EnvVarSourceBuilder()
.withNewFieldRef("v1", "status.podIP")
.build())
  .endEnv()
{code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15653) Support to configure request or limit for MEM requirement

2020-01-18 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-15653:


 Summary: Support to configure request or limit for MEM requirement
 Key: FLINK-15653
 URL: https://issues.apache.org/jira/browse/FLINK-15653
 Project: Flink
  Issue Type: New Feature
Reporter: Canbin Zheng






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15652) Support for imagePullSecrets k8s option

2020-01-18 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-15652:


 Summary: Support for imagePullSecrets k8s option
 Key: FLINK-15652
 URL: https://issues.apache.org/jira/browse/FLINK-15652
 Project: Flink
  Issue Type: New Feature
  Components: Deployment / Kubernetes
Reporter: Canbin Zheng


It's likely that one would pull images from the private image registry, 
credentials can be passed with the Pod specification through the 
`imagePullSecrets` parameter,
which refers to the k8s secret by name. Implementation wise we expose a new 
configuration option to the users and then pass it along to the K8S.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15649) Support mounting volumes

2020-01-18 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-15649:


 Summary: Support mounting volumes 
 Key: FLINK-15649
 URL: https://issues.apache.org/jira/browse/FLINK-15649
 Project: Flink
  Issue Type: New Feature
  Components: Deployment / Kubernetes
Reporter: Canbin Zheng






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15648) Support to configure request or limit for CPU requirement

2020-01-18 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-15648:


 Summary: Support to configure request or limit for CPU requirement
 Key: FLINK-15648
 URL: https://issues.apache.org/jira/browse/FLINK-15648
 Project: Flink
  Issue Type: New Feature
Reporter: Canbin Zheng


The current branch use kubernetes.xx.cpu to configure request and limit 
resource requirement for a Container, it's an important improvement to separate 
these two configurations, we can use kubernetes.xx.request.cpu and 
kubernetes.xx.limit.cpu to specify request and limit resource 
requirements.{color:#6a8759}
{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15647) Support to set annotations for JM/TM Pods

2020-01-18 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-15647:


 Summary: Support to set annotations for JM/TM Pods
 Key: FLINK-15647
 URL: https://issues.apache.org/jira/browse/FLINK-15647
 Project: Flink
  Issue Type: New Feature
  Components: Deployment / Kubernetes
Reporter: Canbin Zheng


One can use either labels or annotations to attach metadata to Kubernetes 
objects. Labels can be used to select objects and to find collections of 
objects that satisfy certain conditions. In contrast, annotations are not used 
to identify and select objects. The metadata in an annotation can be small or 
large, structured or unstructured, and can include characters not permitted by 
labels.

 

[Kubernetes 
Annotation|https://kubernetes.io/docs/concepts/overview/working-with-objects/annotations/]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15646) Configurable K8s context support

2020-01-18 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-15646:


 Summary: Configurable K8s context support
 Key: FLINK-15646
 URL: https://issues.apache.org/jira/browse/FLINK-15646
 Project: Flink
  Issue Type: New Feature
  Components: Deployment / Kubernetes
Reporter: Canbin Zheng


One would be forced to first {{kubectl config use-context }} to switch 
to the desired context if working with multiple K8S clusters or having multiple 
K8S "users" for interacting with the specified cluster,  so it is an important 
improvement to add an option(kubernetes.context) for configuring arbitrary 
contexts when deploying a Flink cluster. If that option is not specified, then 
the current context in the config file would be used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-14432) Add 'SHOW SYSTEM FUNCTIONS' support in SQL CLI

2019-10-17 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-14432:


 Summary: Add 'SHOW SYSTEM FUNCTIONS' support in SQL CLI
 Key: FLINK-14432
 URL: https://issues.apache.org/jira/browse/FLINK-14432
 Project: Flink
  Issue Type: Improvement
  Components: Table SQL / Client
Reporter: Canbin Zheng


Add 'SHOW SYSTEM FUNCTIONS' as a end-2-end functionality for SQL CLI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-14179) Wrong description of SqlCommand.SHOW_FUNCTIONS

2019-09-23 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-14179:


 Summary: Wrong description of SqlCommand.SHOW_FUNCTIONS
 Key: FLINK-14179
 URL: https://issues.apache.org/jira/browse/FLINK-14179
 Project: Flink
  Issue Type: Bug
  Components: Table SQL / Client
Affects Versions: 1.9.0
Reporter: Canbin Zheng
 Fix For: 1.10.0
 Attachments: image-2019-09-24-10-59-26-286.png

Currently '*SHOW FUNCTIONS*' lists not only user-defined functions, but also 
system-defined ones, the description {color:#172b4d}*'Shows all registered 
user-defined functions.'* not correctly depicts this functionality. I think we 
can change the description to '*Shows all system-defined and user-defined 
functions.*'{color}

 

{color:#172b4d}!image-2019-09-24-10-59-26-286.png!{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-14171) Add 'SHOW USER FUNCTIONS' support in SQL CLI

2019-09-23 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-14171:


 Summary: Add 'SHOW USER FUNCTIONS' support in SQL CLI
 Key: FLINK-14171
 URL: https://issues.apache.org/jira/browse/FLINK-14171
 Project: Flink
  Issue Type: New Feature
  Components: Table SQL / Client
Affects Versions: 1.9.0
Reporter: Canbin Zheng
 Fix For: 1.10.0


Currently *listUserDefinedFunctions* has been supported in Executor, I think we 
can add 'SHOW USER FUNCTIONS' support to make it a end-2-end functionality.--



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-14102) Introduce DB2Dialect

2019-09-17 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-14102:


 Summary: Introduce DB2Dialect
 Key: FLINK-14102
 URL: https://issues.apache.org/jira/browse/FLINK-14102
 Project: Flink
  Issue Type: Sub-task
Reporter: Canbin Zheng






--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (FLINK-14101) Introduce SqlServerDialect

2019-09-17 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-14101:


 Summary: Introduce SqlServerDialect
 Key: FLINK-14101
 URL: https://issues.apache.org/jira/browse/FLINK-14101
 Project: Flink
  Issue Type: Sub-task
Reporter: Canbin Zheng






--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (FLINK-14100) Introduce OracleDialect

2019-09-17 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-14100:


 Summary: Introduce OracleDialect
 Key: FLINK-14100
 URL: https://issues.apache.org/jira/browse/FLINK-14100
 Project: Flink
  Issue Type: Sub-task
Reporter: Canbin Zheng






--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (FLINK-14098) Support multiple sql statements splitting by semicolon for TableEnvironment

2019-09-17 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-14098:


 Summary: Support multiple sql statements splitting by semicolon 
for TableEnvironment
 Key: FLINK-14098
 URL: https://issues.apache.org/jira/browse/FLINK-14098
 Project: Flink
  Issue Type: New Feature
Reporter: Canbin Zheng


Currently TableEnvironment.sqlUpdate supports single sql statement parsing by 
invoking SqlParser.parseStmt.

Actually, after the work of 
[CALCITE-2453|https://issues.apache.org/jira/browse/CALCITE-2453], Calcite’s 
SqlParser is able to parse multiple sql statements split by semicolon, IMO, 
it’s useful to refactor TableEnvironment.sqlUpdate to support multiple sql 
statements too, by invoking SqlParser.parseStmtList instead.

I am not sure whether this is a duplicated ticket, if it is, let me know, 
thanks.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (FLINK-14078) Introduce more JDBCDialect implementations

2019-09-15 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-14078:


 Summary: Introduce more JDBCDialect implementations
 Key: FLINK-14078
 URL: https://issues.apache.org/jira/browse/FLINK-14078
 Project: Flink
  Issue Type: New Feature
  Components: Connectors / JDBC
Reporter: Canbin Zheng


 MySQL, Derby and Postgres JDBCDialect are available now, maybe we can 
introduce more JDBCDialect implementations, such as OracleDialect, 
SqlServerDialect, DB2Dialect, etc.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (FLINK-14075) Remove unnecessary resource close in JDBCOutputFormatTest.java

2019-09-13 Thread Canbin Zheng (Jira)
Canbin Zheng created FLINK-14075:


 Summary: Remove unnecessary resource close in 
JDBCOutputFormatTest.java
 Key: FLINK-14075
 URL: https://issues.apache.org/jira/browse/FLINK-14075
 Project: Flink
  Issue Type: Improvement
  Components: Connectors / JDBC
Reporter: Canbin Zheng
 Fix For: 1.10.0


 
{code:java}
public void clearOutputTable() throws Exception {
   Class.forName(DRIVER_CLASS);
   try (
  Connection conn = DriverManager.getConnection(DB_URL);
  Statement stat = conn.createStatement()) {
  stat.execute("DELETE FROM " + OUTPUT_TABLE);

  stat.close();
  conn.close();
   }
}{code}
Unnecessary close of stat and conn can be removed.

 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (FLINK-6269) var could be a val

2017-04-05 Thread CanBin Zheng (JIRA)
CanBin Zheng created FLINK-6269:
---

 Summary: var could be a val
 Key: FLINK-6269
 URL: https://issues.apache.org/jira/browse/FLINK-6269
 Project: Flink
  Issue Type: Wish
  Components: JobManager
Affects Versions: 1.2.0
Reporter: CanBin Zheng
Priority: Trivial


In JobManager.scala
var userCodeLoader = libraryCacheManager.getClassLoader(jobGraph.getJobID)
this var could be a val





--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (FLINK-6224) RemoteStreamEnvironment not resolve hostname of JobManager

2017-03-30 Thread CanBin Zheng (JIRA)
CanBin Zheng created FLINK-6224:
---

 Summary: RemoteStreamEnvironment not resolve hostname of JobManager
 Key: FLINK-6224
 URL: https://issues.apache.org/jira/browse/FLINK-6224
 Project: Flink
  Issue Type: Bug
  Components: Client, DataStream API
Affects Versions: 1.2.0
Reporter: CanBin Zheng
Assignee: CanBin Zheng


I run two examples in the same client. 

first one use
ExecutionEnvironment.createRemoteEnvironment("10.75.203.170", 59551)
second one use
StreamExecutionEnvironment.createRemoteEnvironment("10.75.203.170", 59551)

the first one runs successfully, but the second example fails(connect to 
JobManager timeout), for the second one, if I change host parameter from ip to 
hostname, it works. 

I checked the source code and found, 
ExecutionEnvironment.createRemoteEnvironment 
resolves the given hostname, this will lookup the hostname for the given 
ip. In contrast, the StreamExecutionEnvironment.createRemoteEnvironment 
won't do this.

As Till Rohrmann mentioned,  the problem is that with 
FLINK-2821 [1], we can no longer resolve the hostname on the JobManager, so, 
we'd better resolve hostname for given ip in RemoteStreamEnvironment too.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)