[jira] [Created] (FLINK-33997) Typo in the doc `classloader.parent-first-patterns-additional`
Matyas Orhidi created FLINK-33997: - Summary: Typo in the doc `classloader.parent-first-patterns-additional` Key: FLINK-33997 URL: https://issues.apache.org/jira/browse/FLINK-33997 Project: Flink Issue Type: Bug Affects Versions: 1.18.0 Reporter: Matyas Orhidi Typo in the doc: [https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/ops/debugging/debugging_classloading/#unloading-of-dynamically-loaded-classes-in-user-code] classloader.parent-first-patterns-additional -> classloader.parent-first-patterns.additional -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-32690) Report Double.NAN instead of null for missing autoscaler metrics
Matyas Orhidi created FLINK-32690: - Summary: Report Double.NAN instead of null for missing autoscaler metrics Key: FLINK-32690 URL: https://issues.apache.org/jira/browse/FLINK-32690 Project: Flink Issue Type: Improvement Components: Kubernetes Operator Reporter: Matyas Orhidi Fix For: kubernetes-operator-1.7.0 Change null values to Double.NAN for autoscaler metrics during blackout periods when no data is gathered. This appears to be a more common practice then null. Also consistent with other metrics we have. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-32272) Expose LOAD_MAX as autoscaler metric
Matyas Orhidi created FLINK-32272: - Summary: Expose LOAD_MAX as autoscaler metric Key: FLINK-32272 URL: https://issues.apache.org/jira/browse/FLINK-32272 Project: Flink Issue Type: New Feature Components: Kubernetes Operator Reporter: Matyas Orhidi Fix For: kubernetes-operator-1.6.0 LOAD_MAX is a metric that helps identifying the busiest vertices a.k.a hot spots in job graph. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-32271) Report RECOMMENDED_PARALLELISM as an autoscaler metric
Matyas Orhidi created FLINK-32271: - Summary: Report RECOMMENDED_PARALLELISM as an autoscaler metric Key: FLINK-32271 URL: https://issues.apache.org/jira/browse/FLINK-32271 Project: Flink Issue Type: New Feature Components: Kubernetes Operator Reporter: Matyas Orhidi Fix For: kubernetes-operator-1.6.0 It is beneficial to report the recommended parallelism and overlay it with the current parallelism on the same chart when auto scaler is running in advisor mode. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-31717) Unit tests running with local kube config
Matyas Orhidi created FLINK-31717: - Summary: Unit tests running with local kube config Key: FLINK-31717 URL: https://issues.apache.org/jira/browse/FLINK-31717 Project: Flink Issue Type: New Feature Components: Kubernetes Operator Reporter: Matyas Orhidi Some unit tests are using local kube environment. This can be dangerous when pointing to sensitive clusters e.g. in prod. {{2023-04-03 12:32:53,956 i.f.k.c.Config [DEBUG] Found for Kubernetes config at: [/Users//.kube/config]. }} A misconfigured kube config environment revealed the issue: {{[ERROR] Tests run: 2, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 0.012 s <<< FAILURE! - in org.apache.flink.kubernetes.operator.FlinkOperatorTest [ERROR] org.apache.flink.kubernetes.operator.FlinkOperatorTest.testConfigurationPassedToJOSDK Time elapsed: 0.008 s <<< ERROR! java.lang.NullPointerException at org.apache.flink.kubernetes.operator.FlinkOperatorTest.testConfigurationPassedToJOSDK(FlinkOperatorTest.java:63) [ERROR] org.apache.flink.kubernetes.operator.FlinkOperatorTest.testLeaderElectionConfig Time elapsed: 0.004 s <<< ERROR! java.lang.NullPointerException at org.apache.flink.kubernetes.operator.FlinkOperatorTest.testLeaderElectionConfig(FlinkOperatorTest.java:108) }} move ~/.kube/config -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-31611) Add delayed restart to failed jobs
Matyas Orhidi created FLINK-31611: - Summary: Add delayed restart to failed jobs Key: FLINK-31611 URL: https://issues.apache.org/jira/browse/FLINK-31611 Project: Flink Issue Type: New Feature Reporter: Matyas Orhidi Operator is able to restart failed jobs already using: {{kubernetes.operator.job.restart.failed: true}} It's beneficial however to keep a failed job around for a while for inspection: {{kubernetes.operator.job.restart.failed.delay: 5m}} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-30609) Add ephemeral storage to CRD
Matyas Orhidi created FLINK-30609: - Summary: Add ephemeral storage to CRD Key: FLINK-30609 URL: https://issues.apache.org/jira/browse/FLINK-30609 Project: Flink Issue Type: New Feature Reporter: Matyas Orhidi We should consider adding ephemeral storage to the existing resource specification in CRD, next to `cpu` and `memory` https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#setting-requests-and-limits-for-local-ephemeral-storage -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-30330) Exclude .github from source release(s)
Matyas Orhidi created FLINK-30330: - Summary: Exclude .github from source release(s) Key: FLINK-30330 URL: https://issues.apache.org/jira/browse/FLINK-30330 Project: Flink Issue Type: Bug Components: Kubernetes Operator Reporter: Matyas Orhidi -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-30157) Trigger Events Before JM Recovery and Unhealthy Job Restarts
Matyas Orhidi created FLINK-30157: - Summary: Trigger Events Before JM Recovery and Unhealthy Job Restarts Key: FLINK-30157 URL: https://issues.apache.org/jira/browse/FLINK-30157 Project: Flink Issue Type: Improvement Components: Kubernetes Operator Affects Versions: kubernetes-operator-1.3.0 Reporter: Matyas Orhidi Fix For: kubernetes-operator-1.3.0 We should emit specific events for the following cases: * JM recovery * Unhealthy Job Restarts -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-29744) Throw DeploymentFailedException on ImagePullBackOff
Matyas Orhidi created FLINK-29744: - Summary: Throw DeploymentFailedException on ImagePullBackOff Key: FLINK-29744 URL: https://issues.apache.org/jira/browse/FLINK-29744 Project: Flink Issue Type: Improvement Reporter: Matyas Orhidi -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-29619) Remove redundant MeterView updater thread from KubernetesClientMetrics
Matyas Orhidi created FLINK-29619: - Summary: Remove redundant MeterView updater thread from KubernetesClientMetrics Key: FLINK-29619 URL: https://issues.apache.org/jira/browse/FLINK-29619 Project: Flink Issue Type: Bug Reporter: Matyas Orhidi The `MetricRegistryImpl` already has a solution to update `MeterView` objects periodically. https://github.com/apache/flink/blob/7a509c46e45b9a91f2b7d01f13afcdef266b1faf/flink-runtime/src/main/java/org/apache/flink/runtime/metrics/MetricRegistryImpl.java#L404 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-29475) Add WARNING/ERROR checker for the operator in e2e tests
Matyas Orhidi created FLINK-29475: - Summary: Add WARNING/ERROR checker for the operator in e2e tests Key: FLINK-29475 URL: https://issues.apache.org/jira/browse/FLINK-29475 Project: Flink Issue Type: Improvement Affects Versions: kubernetes-operator-1.3.0 Reporter: Matyas Orhidi We can also try eliminating unwanted warnings like: {{[WARN ] The client is using resource type 'flinkdeployments' with unstable version 'v1beta1'}} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-29474) Name collision: Group already contains a Metric with the name
Matyas Orhidi created FLINK-29474: - Summary: Name collision: Group already contains a Metric with the name Key: FLINK-29474 URL: https://issues.apache.org/jira/browse/FLINK-29474 Project: Flink Issue Type: Bug Components: Kubernetes Operator Affects Versions: kubernetes-operator-1.2.0 Reporter: Matyas Orhidi Fix For: kubernetes-operator-1.2.0 k create -f examples/basic-session-deployment-and-job.yaml results in warnings: {quote} flink-kubernetes-operator 2022-09-29 13:30:00,001 o.a.f.m.MetricGroup [WARN ][default/basic-session-job-example] Name collision: Group already contains a Metric with the name │ │ 'TimeSeconds'. Metric will not be reported.[flink-kubernetes-operator-6f9bbfd557-ljp6w, k8soperator, default, flink-kubernetes-operator, system, Lifecycle, Transition, Resume] │ │ flink-kubernetes-operator 2022-09-29 13:30:00,001 o.a.f.m.MetricGroup [WARN ][default/basic-session-job-example] Name collision: Group already contains a Metric with the name │ │ 'TimeSeconds'. Metric will not be reported.[flink-kubernetes-operator-6f9bbfd557-ljp6w, k8soperator, default, flink-kubernetes-operator, system, Lifecycle, Transition, Upgrade] {quote} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-29327) Operator configs are showing up among standard Flink configs
Matyas Orhidi created FLINK-29327: - Summary: Operator configs are showing up among standard Flink configs Key: FLINK-29327 URL: https://issues.apache.org/jira/browse/FLINK-29327 Project: Flink Issue Type: Improvement Components: Kubernetes Operator Affects Versions: kubernetes-operator-1.1.0 Reporter: Matyas Orhidi Fix For: kubernetes-operator-1.2.0 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-29322) Expose savepoint format on Web UI
Matyas Orhidi created FLINK-29322: - Summary: Expose savepoint format on Web UI Key: FLINK-29322 URL: https://issues.apache.org/jira/browse/FLINK-29322 Project: Flink Issue Type: New Feature Components: Runtime / Web Frontend Reporter: Matyas Orhidi Savepoint format is not exposed on the Web UI, thus users should remember how they triggered it. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-29313) Some config overrides are ignored when set under spec.flinkConfiguration
Matyas Orhidi created FLINK-29313: - Summary: Some config overrides are ignored when set under spec.flinkConfiguration Key: FLINK-29313 URL: https://issues.apache.org/jira/browse/FLINK-29313 Project: Flink Issue Type: Bug Components: Kubernetes Operator Affects Versions: kubernetes-operator-1.2.0 Reporter: Matyas Orhidi Fix For: kubernetes-operator-1.2.0 Some [configs|https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/operations/configuration/#resourceuser-configuration] that can be specified under spec.flinkConfiguration won't take affect without an upgrade, e.g.:{{{}{}}} * {{kubernetes.operator.periodic.savepoint.interval}} * {{kubernetes.operator.savepoint.format.type}} These properties are used mainly from the so called 'observeConfig', and won't be available in the operator until the job is restarted. Ideally these should be changed without an upgrade, but at the moment they won't take affect at all. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-29261) Consider using FAIL_ON_UNKNOWN_PROPERTIES in the Operator
Matyas Orhidi created FLINK-29261: - Summary: Consider using FAIL_ON_UNKNOWN_PROPERTIES in the Operator Key: FLINK-29261 URL: https://issues.apache.org/jira/browse/FLINK-29261 Project: Flink Issue Type: Bug Reporter: Matyas Orhidi The operator cannot be downgraded, once the CR specification is written to the `status` Caused by: com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException: Unrecognized field "mode" (class org.apache.flink.kubernetes.operator.crd.spec.FlinkDeploymentSpec), not marked as ignorable (12 known properties: "restartNonce", "imagePullPolicy", "ingress", "flinkConfiguration", "serviceAccount", "image", "job", "podTemplate", "jobManager", "logConfiguration", "flinkVersion", "taskManager"]) at [Source: UNKNOWN; byte offset: #UNKNOWN] (through reference chain: org.apache.flink.kubernetes.operator.crd.spec.FlinkDeploymentSpec["mode"]) at com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException.from(UnrecognizedPropertyException.java:61) at com.fasterxml.jackson.databind.DeserializationContext.handleUnknownProperty(DeserializationContext.java:1127) at com.fasterxml.jackson.databind.deser.std.StdDeserializer.handleUnknownProperty(StdDeserializer.java:1989) at com.fasterxml.jackson.databind.deser.BeanDeserializerBase.handleUnknownProperty(BeanDeserializerBase.java:1700) at com.fasterxml.jackson.databind.deser.BeanDeserializerBase.handleUnknownVanilla(BeanDeserializerBase.java:1678) at com.fasterxml.jackson.databind.deser.BeanDeserializer.vanillaDeserialize(BeanDeserializer.java:319) at com.fasterxml.jackson.databind.deser.BeanDeserializer.deserialize(BeanDeserializer.java:176) at com.fasterxml.jackson.databind.deser.DefaultDeserializationContext.readRootValue(DefaultDeserializationContext.java:322) at com.fasterxml.jackson.databind.ObjectMapper._readValue(ObjectMapper.java:4650) at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:2831) at com.fasterxml.jackson.databind.ObjectMapper.treeToValue(ObjectMapper.java:3295) at org.apache.flink.kubernetes.operator.reconciler.ReconciliationUtils.deserializeSpecWithMeta(ReconciliationUtils.java:288) ... 18 more -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-29251) Send CREATED status and Cancel event via FlinkResourceListener
Matyas Orhidi created FLINK-29251: - Summary: Send CREATED status and Cancel event via FlinkResourceListener Key: FLINK-29251 URL: https://issues.apache.org/jira/browse/FLINK-29251 Project: Flink Issue Type: Improvement Components: Kubernetes Operator Reporter: Matyas Orhidi Fix For: kubernetes-operator-1.2.0 To complete the lifecycle history of a custom resource the operator should sent: * CREATED status notification during initial deployment of a CR * Cancel event when deleting a CR -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-29194) Add LoggingResourceListener as default
Matyas Orhidi created FLINK-29194: - Summary: Add LoggingResourceListener as default Key: FLINK-29194 URL: https://issues.apache.org/jira/browse/FLINK-29194 Project: Flink Issue Type: New Feature Components: Deployment / Kubernetes Reporter: Matyas Orhidi Fix For: kubernetes-operator-1.2.0 For auditing/debugging purposes the operator needs a way to report the emitted events / status updates in the logs: {{[DEBUG] [default.basic-example] Event | Info | SpecChanged | UPGRADE change(s) detected (FlinkDeploymentSpec[image=flink:1.15,restartNonce=] differs from FlinkDeploymentSpec[image=flink:1.15asdf,restartNonce=123]), starting reconciliation.}} {{[DEBUG] [default.basic-example] Event | Info | Suspended | Suspending existing deployment.}} {{[DEBUG] [default.basic-example] Status | Info | UPGRADING | The resource is being upgraded }} {{[DEBUG] [default.basic-example] Status | Info | UPGRADING | The resource is being upgraded }} {{[DEBUG] [default.basic-example] Event | Info | Submit | Starting deployment}} {{[DEBUG] [default.basic-example] Status | Info | DEPLOYED | The resource is deployed/submitted to Kubernetes, but it’s not yet considered to be stable and might be rolled back in the future }} {{[DEBUG] [default.basic-example] Status | Info | DEPLOYED | The resource is deployed/submitted to Kubernetes, but it’s not yet considered to be stable and might be rolled back in the future }} {{[DEBUG] [default.basic-example] Event | Info | StatusChanged | Job status changed from RECONCILING to CREATED}} {{[DEBUG] [default.basic-example] Status | Info | DEPLOYED | The resource is deployed/submitted to Kubernetes, but it’s not yet considered to be stable and might be rolled back in the future }} {{[DEBUG] [default.basic-example] Event | Info | StatusChanged | Job status changed from CREATED to RUNNING}} {{[DEBUG] [default.basic-example] Status | Info | STABLE | The resource deployment is considered to be stable and won’t be rolled back }} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-28594) Add metrics for FlinkService
Matyas Orhidi created FLINK-28594: - Summary: Add metrics for FlinkService Key: FLINK-28594 URL: https://issues.apache.org/jira/browse/FLINK-28594 Project: Flink Issue Type: Improvement Components: Kubernetes Operator Reporter: Matyas Orhidi Fix For: kubernetes-operator-1.2.0 We would need some metrics for the `FlinkService` to be able to tell how long does it take to perform most of the blocking operations we have in this service -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-28593) Introduce default ingress templates at operator level
Matyas Orhidi created FLINK-28593: - Summary: Introduce default ingress templates at operator level Key: FLINK-28593 URL: https://issues.apache.org/jira/browse/FLINK-28593 Project: Flink Issue Type: Improvement Components: Kubernetes Operator Reporter: Matyas Orhidi Fix For: kubernetes-operator-1.2.0 Ingress templates are currently [defined at CR level|https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/operations/ingress/], but these rules can be enabled globally at operator level too. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-28592) Implement custom resource counters as counters not gauges
Matyas Orhidi created FLINK-28592: - Summary: Implement custom resource counters as counters not gauges Key: FLINK-28592 URL: https://issues.apache.org/jira/browse/FLINK-28592 Project: Flink Issue Type: Improvement Components: Kubernetes Operator Affects Versions: kubernetes-operator-1.1.0 Reporter: Matyas Orhidi Fix For: kubernetes-operator-1.2.0 * change to current implementation to counters * add counters at global level -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-28564) Update NOTICE/LICENCE files for 1.1.0 release
Matyas Orhidi created FLINK-28564: - Summary: Update NOTICE/LICENCE files for 1.1.0 release Key: FLINK-28564 URL: https://issues.apache.org/jira/browse/FLINK-28564 Project: Flink Issue Type: Improvement Components: Kubernetes Operator Reporter: Matyas Orhidi Fix For: kubernetes-operator-1.1.0 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-28517) Bump Flink version to 1.15.1
Matyas Orhidi created FLINK-28517: - Summary: Bump Flink version to 1.15.1 Key: FLINK-28517 URL: https://issues.apache.org/jira/browse/FLINK-28517 Project: Flink Issue Type: Improvement Affects Versions: kubernetes-operator-1.1.0 Reporter: Matyas Orhidi Fix For: kubernetes-operator-1.1.0 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-28476) Add metrics for Kubernetes API server access
Matyas Orhidi created FLINK-28476: - Summary: Add metrics for Kubernetes API server access Key: FLINK-28476 URL: https://issues.apache.org/jira/browse/FLINK-28476 Project: Flink Issue Type: Improvement Components: Kubernetes Operator Reporter: Matyas Orhidi Fix For: kubernetes-operator-1.1.0 e.g.: * http response counter * http response latency histogram * http response status counter -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-28445) Support dynamic configurations
Matyas Orhidi created FLINK-28445: - Summary: Support dynamic configurations Key: FLINK-28445 URL: https://issues.apache.org/jira/browse/FLINK-28445 Project: Flink Issue Type: New Feature Reporter: Matyas Orhidi Fix For: kubernetes-operator-1.1.0 It is beneficial in certain scenarios to load operator configurations from multiple ConfigMaps. For example the `kubernetes.operator.watched.namespaces` is a typical property maintained by control planes and the rest is by the Operator. By allowing loading the configuration from multiple ConfigMaps the default configuration can be owned by the Operator and other environment specific overrides by a control plane. This also allows upgrading the Operator independently from control planes. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-28436) test_multi_sessionjob.sh is failing intermittently
Matyas Orhidi created FLINK-28436: - Summary: test_multi_sessionjob.sh is failing intermittently Key: FLINK-28436 URL: https://issues.apache.org/jira/browse/FLINK-28436 Project: Flink Issue Type: Bug Components: Kubernetes Operator Affects Versions: kubernetes-operator-1.1.0 Reporter: Matyas Orhidi Fix For: kubernetes-operator-1.1.0 https://github.com/apache/flink-kubernetes-operator/runs/7222745771?check_suite_focus=true -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-28389) Correct spec and status updates in FlinkDeploymentControllerTest
Matyas Orhidi created FLINK-28389: - Summary: Correct spec and status updates in FlinkDeploymentControllerTest Key: FLINK-28389 URL: https://issues.apache.org/jira/browse/FLINK-28389 Project: Flink Issue Type: Improvement Components: Kubernetes Operator Reporter: Matyas Orhidi Fix For: kubernetes-operator-1.1.0 The testing `FlinkDeploymentController` we use in the FlinkDeploymentControllerTest mutates the FlinkDeployment object. This behaviour is different how it works in real environments causing inconsistent behaviour in some tests. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-28331) Persist status after every observe loop
Matyas Orhidi created FLINK-28331: - Summary: Persist status after every observe loop Key: FLINK-28331 URL: https://issues.apache.org/jira/browse/FLINK-28331 Project: Flink Issue Type: Improvement Components: Kubernetes Operator Affects Versions: kubernetes-operator-1.1.0 Reporter: Matyas Orhidi Fix For: kubernetes-operator-1.1.0 Make sure we don't loose any status information because of the reconcile logic. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-28261) Consider Using Dependent Resources for Ingress
Matyas Orhidi created FLINK-28261: - Summary: Consider Using Dependent Resources for Ingress Key: FLINK-28261 URL: https://issues.apache.org/jira/browse/FLINK-28261 Project: Flink Issue Type: Improvement Components: Kubernetes Operator Affects Versions: kubernetes-operator-1.1.0 Reporter: Matyas Orhidi JOSDK 3 introduced the concept of dependent resources, which would allow us to handle Ingress a more JOSDK native way: see [https://javaoperatorsdk.io/docs/dependent-resources#standalone-dependent-resources.] This functionality could be a good fit for standalone mode implementation too. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (FLINK-28223) Add artifact-fetcher to the pod-template.yaml example
Matyas Orhidi created FLINK-28223: - Summary: Add artifact-fetcher to the pod-template.yaml example Key: FLINK-28223 URL: https://issues.apache.org/jira/browse/FLINK-28223 Project: Flink Issue Type: Improvement Reporter: Matyas Orhidi We could improve the pod template example to have an artifact fetcher. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (FLINK-28186) Trigger Operator Events on Configuration Changes
Matyas Orhidi created FLINK-28186: - Summary: Trigger Operator Events on Configuration Changes Key: FLINK-28186 URL: https://issues.apache.org/jira/browse/FLINK-28186 Project: Flink Issue Type: Improvement Affects Versions: kubernetes-operator-1.1.0 Reporter: Matyas Orhidi Fix For: kubernetes-operator-1.1.0 The Operator can already emit K8s Events related to CRs it manages, but it needs to emit events on important Operator related changes too, e.g. config updates, dynamic namespace changes, etc. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (FLINK-28166) Configurable Automatic Retries on Error
Matyas Orhidi created FLINK-28166: - Summary: Configurable Automatic Retries on Error Key: FLINK-28166 URL: https://issues.apache.org/jira/browse/FLINK-28166 Project: Flink Issue Type: Improvement Components: Kubernetes Operator Affects Versions: kubernetes-operator-1.1.0 Reporter: Matyas Orhidi Fix For: kubernetes-operator-1.1.0 Make automatic reconciliation retries configurable. The current behaviour is the default defined in JOSDK: https://javaoperatorsdk.io/docs/features#automatic-retries-on-error -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (FLINK-28141) Document Dynamic Namespaces
Matyas Orhidi created FLINK-28141: - Summary: Document Dynamic Namespaces Key: FLINK-28141 URL: https://issues.apache.org/jira/browse/FLINK-28141 Project: Flink Issue Type: Improvement Components: Kubernetes Operator Affects Versions: kubernetes-operator-1.1.0 Reporter: Matyas Orhidi Fix For: kubernetes-operator-1.1.0 -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (FLINK-28059) Parallelize e2e tests
Matyas Orhidi created FLINK-28059: - Summary: Parallelize e2e tests Key: FLINK-28059 URL: https://issues.apache.org/jira/browse/FLINK-28059 Project: Flink Issue Type: Improvement Components: Kubernetes Operator Affects Versions: kubernetes-operator-1.1.0 Reporter: Matyas Orhidi Motivation: * Tests are running in a loop within a single step * It takes 15mins for the e2e tests to finish * We could run 256 parallel tasks instead of the current 6 * Without looking at the logs it is hard to spot/verify which exact tests are running during e2e CI workflows Suggestions: * Let's add the tests into an extra dimension of the test matrix instead of looping * Try to find a way to share the common steps before/after the tests -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (FLINK-27892) More than 1 secondary resource related to primary
Matyas Orhidi created FLINK-27892: - Summary: More than 1 secondary resource related to primary Key: FLINK-27892 URL: https://issues.apache.org/jira/browse/FLINK-27892 Project: Flink Issue Type: Bug Reporter: Matyas Orhidi When submitting the `the basic-session-job.yaml' in multiple namespaces: {{flink-kubernetes-operator java.lang.IllegalStateException: More than 1 secondary resource related to primary flink-kubernetes-operator at io.javaoperatorsdk.operator.processing.event.source.ResourceEventSource.getSecondaryResource(ResourceEventSource.java:19) flink-kubernetes-operator at io.javaoperatorsdk.operator.api.reconciler.DefaultContext.getSecondaryResource(DefaultContext.java:47) flink-kubernetes-operator at io.javaoperatorsdk.operator.api.reconciler.Context.getSecondaryResource(Context.java:15) flink-kubernetes-operator at org.apache.flink.kubernetes.operator.controller.FlinkSessionJobController.validateSessionJob(FlinkSessionJobController.java:135) flink-kubernetes-operator at org.apache.flink.kubernetes.operator.controller.FlinkSessionJobController.reconcile(FlinkSessionJobController.java:91) flink-kubernetes-operator at org.apache.flink.kubernetes.operator.controller.FlinkSessionJobController.reconcile(FlinkSessionJobController.java:51) flink-kubernetes-operator at io.javaoperatorsdk.operator.processing.Controller$2.execute(Controller.java:201) flink-kubernetes-operator at io.javaoperatorsdk.operator.processing.Controller$2.execute(Controller.java:153) flink-kubernetes-operator at io.javaoperatorsdk.operator.api.monitoring.Metrics.timeControllerExecution(Metrics.java:34) flink-kubernetes-operator at io.javaoperatorsdk.operator.processing.Controller.reconcile(Controller.java:152) flink-kubernetes-operator at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.reconcileExecution(ReconciliationDispatcher.java:135) flink-kubernetes-operator at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleReconcile(ReconciliationDispatcher.java:115) flink-kubernetes-operator at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleDispatch(ReconciliationDispatcher.java:86) flink-kubernetes-operator at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleExecution(ReconciliationDispatcher.java:59) flink-kubernetes-operator at io.javaoperatorsdk.operator.processing.event.EventProcessor$ControllerExecution.run(EventProcessor.java:390) flink-kubernetes-operator at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) flink-kubernetes-operator at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) flink-kubernetes-operator at java.base/java.lang.Thread.run(Unknown Source)}} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (FLINK-27871) Dynamic configuration change is undedected on config removal
Matyas Orhidi created FLINK-27871: - Summary: Dynamic configuration change is undedected on config removal Key: FLINK-27871 URL: https://issues.apache.org/jira/browse/FLINK-27871 Project: Flink Issue Type: Bug Components: Kubernetes Operator Affects Versions: kubernetes-operator-1.0.0 Reporter: Matyas Orhidi Fix For: kubernetes-operator-1.1.0 The Operator does not detect when a configuration entry is removed from the configmap. The equals check in *FlinkConfigManager.updateDefaultConfig* returns true incorrectly in this: {{if (newConf.equals(defaultConfig)) {}} {{LOG.info("Default configuration did not change, nothing to do...");}} {{return;}} {{}}} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (FLINK-27812) Support Dynamic change of watched namespaces
Matyas Orhidi created FLINK-27812: - Summary: Support Dynamic change of watched namespaces Key: FLINK-27812 URL: https://issues.apache.org/jira/browse/FLINK-27812 Project: Flink Issue Type: Improvement Reporter: Matyas Orhidi -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (FLINK-27714) Migrate to java-operator-sdk v3
Matyas Orhidi created FLINK-27714: - Summary: Migrate to java-operator-sdk v3 Key: FLINK-27714 URL: https://issues.apache.org/jira/browse/FLINK-27714 Project: Flink Issue Type: Improvement Components: Kubernetes Operator Affects Versions: kubernetes-operator-1.0.0 Reporter: Matyas Orhidi Fix For: kubernetes-operator-1.1.0 There are a few features planning to add to the operator: * Dynamic change of watched namespaces and automatic adjustment of related {{EventSources}} * Improved Error Handling API also worth evaluating of: * Dependent resources management! See the [documentation|https://javaoperatorsdk.io/docs/dependent-resources] for more information * Support for following a set of namespaces in {{InformerEventSource}} and other related improvements. * Removal for need of {{PrimaryToSecondaryMapper}} - now handled automatically for you https://github.com/java-operator-sdk/java-operator-sdk/releases/tag/v3.0.0 -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (FLINK-27665) Optimise event triggering on DeploymentFailedExceptions
Matyas Orhidi created FLINK-27665: - Summary: Optimise event triggering on DeploymentFailedExceptions Key: FLINK-27665 URL: https://issues.apache.org/jira/browse/FLINK-27665 Project: Flink Issue Type: Improvement Components: Kubernetes Operator Affects Versions: kubernetes-operator-0.1.0 Reporter: Matyas Orhidi Fix For: kubernetes-operator-1.0.0 Attachments: image-2022-05-17-12-08-42-597.png, image-2022-05-17-12-13-19-489.png Use `EventUtils` when handling `DeploymentFailedExceptions` to avoid appending new events on every reconcile loop: !image-2022-05-17-12-13-19-489.png! -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (FLINK-27609) Tracking flink-version and flink-revision in FlinkDeploymentStatus
Matyas Orhidi created FLINK-27609: - Summary: Tracking flink-version and flink-revision in FlinkDeploymentStatus Key: FLINK-27609 URL: https://issues.apache.org/jira/browse/FLINK-27609 Project: Flink Issue Type: Improvement Components: Kubernetes Operator Affects Versions: kubernetes-operator-0.1.0 Reporter: Matyas Orhidi Fix For: kubernetes-operator-1.0.0 The rest api can provide accurate versioning information through the config endpoint: [https://nightlies.apache.org/flink/flink-docs-master/docs/ops/rest_api/#config] The operator should propagate such fields in the status: * flink-version * flink-revision This greatly improves the ability to identify malicious Flink versions (CVE affected, deprecated, etc.) in managed environments. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (FLINK-27573) Configuring a new random job result store directory
Matyas Orhidi created FLINK-27573: - Summary: Configuring a new random job result store directory Key: FLINK-27573 URL: https://issues.apache.org/jira/browse/FLINK-27573 Project: Flink Issue Type: Improvement Components: Kubernetes Operator Reporter: Matyas Orhidi Create a random job result store directory to work around: https://issues.apache.org/jira/browse/FLINK-27569 -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (FLINK-27520) Use admission-controller-framework in Webhook
Matyas Orhidi created FLINK-27520: - Summary: Use admission-controller-framework in Webhook Key: FLINK-27520 URL: https://issues.apache.org/jira/browse/FLINK-27520 Project: Flink Issue Type: Improvement Components: Kubernetes Operator Affects Versions: kubernetes-operator-0.1.0 Reporter: Matyas Orhidi Use the released [https://github.com/java-operator-sdk/admission-controller-framework] instead of borrowed source codes in the Webhook module. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (FLINK-27468) Observing JobManager deployment. Previous status: MISSING
Matyas Orhidi created FLINK-27468: - Summary: Observing JobManager deployment. Previous status: MISSING Key: FLINK-27468 URL: https://issues.apache.org/jira/browse/FLINK-27468 Project: Flink Issue Type: Bug Components: Kubernetes Operator Affects Versions: kubernetes-operator-0.1.0 Reporter: Matyas Orhidi The operator keeps looping if the K8s deployment gets deleted ( and probably when the job is in terminal Flink state such as FAILED). We need to agree on how to handle such cases and fix it. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (FLINK-27190) Revisit error handling in main reconcile() loop
Matyas Orhidi created FLINK-27190: - Summary: Revisit error handling in main reconcile() loop Key: FLINK-27190 URL: https://issues.apache.org/jira/browse/FLINK-27190 Project: Flink Issue Type: Improvement Components: Kubernetes Operator Reporter: Matyas Orhidi Fix For: kubernetes-operator-1.0.0 The are some improvements introduced around error handling: * [https://github.com/java-operator-sdk/java-operator-sdk/pull/1033] in the upcoming java-operator-sdk release [v3.0.0.RC1.|https://github.com/java-operator-sdk/java-operator-sdk/releases/tag] We should revisit and simplify further the error logic in {{FlinkDeploymentController.reconcile()}} {{Currently}} * checked exceptions are wrapped in runtime exceptions * validation errors are terminal errors but handled with differently -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-26973) Emit events on state transitions for FlinkDeployment
Matyas Orhidi created FLINK-26973: - Summary: Emit events on state transitions for FlinkDeployment Key: FLINK-26973 URL: https://issues.apache.org/jira/browse/FLINK-26973 Project: Flink Issue Type: Improvement Reporter: Matyas Orhidi To improve observability we should emit Events during the lifecycle of FlinkDeployments -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-26953) Introduce Operator Specific Metrics
Matyas Orhidi created FLINK-26953: - Summary: Introduce Operator Specific Metrics Key: FLINK-26953 URL: https://issues.apache.org/jira/browse/FLINK-26953 Project: Flink Issue Type: Improvement Components: Kubernetes Operator Reporter: Matyas Orhidi Beyond the basic JVM metrics the Operator currently exposes, it could report further Operator specific metrics, e.g.: * total number of deployments * number of active/failed jobs * etc. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-26916) The Operator Ignores job related changes (jar, parallelism) during last-state upgrades
Matyas Orhidi created FLINK-26916: - Summary: The Operator Ignores job related changes (jar, parallelism) during last-state upgrades Key: FLINK-26916 URL: https://issues.apache.org/jira/browse/FLINK-26916 Project: Flink Issue Type: Bug Components: Kubernetes Operator Reporter: Matyas Orhidi RC: The old jobgraph is being reused when resuming -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-26866) Expired cert during Helm installation
Matyas Orhidi created FLINK-26866: - Summary: Expired cert during Helm installation Key: FLINK-26866 URL: https://issues.apache.org/jira/browse/FLINK-26866 Project: Flink Issue Type: Bug Components: Deployment / Kubernetes Reporter: Matyas Orhidi I have a minikube cluster running for a while. Although the cert manager seems ok on it and the operator comes up helm installation drops a concerning error: {{helm install flink-operator helm/flink-operator}} {{Error: INSTALLATION FAILED: failed to create resource: Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s": x509: certificate has expired or is not yet valid: current time 2022-03-25T11:01:46Z is after 2022-03-21T08:31:13Z}} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-26862) Link the Github repository from Operator documentation
Matyas Orhidi created FLINK-26862: - Summary: Link the Github repository from Operator documentation Key: FLINK-26862 URL: https://issues.apache.org/jira/browse/FLINK-26862 Project: Flink Issue Type: Sub-task Reporter: Matyas Orhidi -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-26817) Update ingress docs with templating examples
Matyas Orhidi created FLINK-26817: - Summary: Update ingress docs with templating examples Key: FLINK-26817 URL: https://issues.apache.org/jira/browse/FLINK-26817 Project: Flink Issue Type: Sub-task Reporter: Matyas Orhidi -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-26765) Document RBAC model
Matyas Orhidi created FLINK-26765: - Summary: Document RBAC model Key: FLINK-26765 URL: https://issues.apache.org/jira/browse/FLINK-26765 Project: Flink Issue Type: Sub-task Reporter: Matyas Orhidi -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-26706) Introduce Ingress URL templating
Matyas Orhidi created FLINK-26706: - Summary: Introduce Ingress URL templating Key: FLINK-26706 URL: https://issues.apache.org/jira/browse/FLINK-26706 Project: Flink Issue Type: Sub-task Reporter: Matyas Orhidi Instead of the current basic `ingressDomain` based approach, we could introduce a more advanced templating mechanism. Check the Spark Operator's approach for reference: https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/quick-start-guide.md#driver-ui-access-and-ingress This would eliminate the need for creating `*.example.com` like wildcard DNS entries. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-26663) Pod augmentation for the operator
Matyas Orhidi created FLINK-26663: - Summary: Pod augmentation for the operator Key: FLINK-26663 URL: https://issues.apache.org/jira/browse/FLINK-26663 Project: Flink Issue Type: Sub-task Reporter: Matyas Orhidi Currently we provide no convenient way to augment the operator pod itself. It'd be great if we could add something similar to the pod templating mechanism used in Flink core. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-26659) Document UI access via Ingress
Matyas Orhidi created FLINK-26659: - Summary: Document UI access via Ingress Key: FLINK-26659 URL: https://issues.apache.org/jira/browse/FLINK-26659 Project: Flink Issue Type: Sub-task Reporter: Matyas Orhidi -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-26637) Document Basic Concepts and Architecture
Matyas Orhidi created FLINK-26637: - Summary: Document Basic Concepts and Architecture Key: FLINK-26637 URL: https://issues.apache.org/jira/browse/FLINK-26637 Project: Flink Issue Type: Sub-task Reporter: Matyas Orhidi -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-26546) Extract Observer Interface
Matyas Orhidi created FLINK-26546: - Summary: Extract Observer Interface Key: FLINK-26546 URL: https://issues.apache.org/jira/browse/FLINK-26546 Project: Flink Issue Type: Sub-task Components: Kubernetes Operator Reporter: Matyas Orhidi Similarly to the Reconciler Interface we should extract the Observer interface. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-26472) Introduce Savepoint object in JobStatus
Matyas Orhidi created FLINK-26472: - Summary: Introduce Savepoint object in JobStatus Key: FLINK-26472 URL: https://issues.apache.org/jira/browse/FLINK-26472 Project: Flink Issue Type: Sub-task Components: Kubernetes Operator Reporter: Matyas Orhidi We currently store only the `savepointLocation` as a String in the JobState. It would be beneficial to introduce a Savepoint object with a few additional fields instead: * {{String location}} * {{String timestamp}} * {{boolean success}} * {{String error}} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-26328) Control Logging Behavior in Flink Deployments
Matyas Orhidi created FLINK-26328: - Summary: Control Logging Behavior in Flink Deployments Key: FLINK-26328 URL: https://issues.apache.org/jira/browse/FLINK-26328 Project: Flink Issue Type: Sub-task Reporter: Matyas Orhidi Looking at [https://github.com/spotify/flink-on-k8s-operator/blob/master/docs/user_guide.md#control-logging-behavior] Something similar could work here as well {quote}The default logging configuration provided by the operator sends logs from JobManager and TaskManager to {{{}stdout{}}}. This has the effect of making it so that logging from Flink workloads running on Kubernetes behaves like every other Kubernetes pod. Your Flink logs should be stored wherever you generally expect to see your container logs in your environment. Sometimes, however, this is not a good fit. An example of when you might want to customize logging behavior is to restore the visibility of logs in the Flink JobManager web interface. Or you might want to ship logs directly to a different sink, or using a different formatter. You can use the {{spec.logConfig}} field to fully control the log4j and logback configuration. It is a string-to-string map, whose keys and values become filenames and contents (respectively) in the folder {{/opt/flink/conf}} in each container. The default Flink docker entrypoint expects this directory to contain two files: {{log4j-console.properties}} and {{{}logback-console.xml{}}}. {quote} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-26257) Document metrics configuration for Prometheus
Matyas Orhidi created FLINK-26257: - Summary: Document metrics configuration for Prometheus Key: FLINK-26257 URL: https://issues.apache.org/jira/browse/FLINK-26257 Project: Flink Issue Type: Sub-task Reporter: Matyas Orhidi -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-26157) Containers Should Not Run As Root
Matyas Orhidi created FLINK-26157: - Summary: Containers Should Not Run As Root Key: FLINK-26157 URL: https://issues.apache.org/jira/browse/FLINK-26157 Project: Flink Issue Type: Sub-task Reporter: Matyas Orhidi Processes in a container should not run as root. Create a user in the Dockerfile with a known UID:GID (e.g. flink:flink) -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-13957) Redact passwords from dynamic properties on job submission
Matyas Orhidi created FLINK-13957: - Summary: Redact passwords from dynamic properties on job submission Key: FLINK-13957 URL: https://issues.apache.org/jira/browse/FLINK-13957 Project: Flink Issue Type: Improvement Components: Client / Job Submission Affects Versions: 1.9.0 Reporter: Matyas Orhidi Fix For: 1.9.1 SSL related passwords specified by dynamic properties are showing up in {{FlinkYarnSessionCli}} logs in plain text: {{19/09/04 04:57:43 INFO cli.FlinkYarnSessionCli: Dynamic Property set: security.ssl.internal.truststore-password=changeit}} {{19/09/04 04:57:43 INFO cli.FlinkYarnSessionCli: Dynamic Property set: security.ssl.internal.keystore-password=changeit}} {{19/09/04 04:57:43 INFO cli.FlinkYarnSessionCli: Dynamic Property set: security.ssl.internal.key-password=changeit}} -- This message was sent by Atlassian Jira (v8.3.2#803003)