[ 
https://issues.apache.org/jira/browse/FLINK-39571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis-Mircea Ciupitu updated FLINK-39571:
------------------------------------------
    Description: 
h1. Summary

The Flink Kubernetes Operator is a mature project used in production by a large 
number of organizations, but its documentation has not kept pace with the 
codebase. Today the docs cover installation ({{{}helm.md{}}}), CRD schema 
({{{}reference.md{}}}, {{{}overview.md{}}}), and metrics/logging well, but stay 
thin on the operational concerns that determine whether an operator can be 
safely run in production: high availability, security posture, the Kubernetes 
ConfigMaps the operator creates and manages, the event taxonomy, day-2 
troubleshooting, and the full surface of the autoscaler. As a result, users 
frequently have to read the source code or rely on community channels for 
answers that should live in the official docs.

The goal of this umbrella is to bring the operator documentation closer to the 
depth and structure of the Flink main documentation itself, where each 
operational concern (state backends, HA, security, deployment, monitoring, 
debugging) has a dedicated, narrative-driven section rather than being 
scattered across pages.
h1. New Pages
 - *State* - Document the Kubernetes ConfigMaps the operator creates and 
manages per Flink resource.

 - *Events* - Document the event taxonomy emitted by the operator (submit, 
recovery, scaling, snapshot, validation, etc.), deduplication semantics, and 
the events can be consumed.

 - *High Availability* - Document the operator-side leader election, replica 
topology, recovery semantics, limitations, and configuration reference.

 - *Security* - Document the RBAC scoping, operator -> Flink REST mTLS, 
truststore/keystore management, Kerberos auth, webhook TLS, and secrets 
handling for credentials.

 - *Debugging* - Document common failure modes, how to interpret status fields 
and reconciler logs, runbooks for stuck reconciliations, and diagnostic 
configuration toggles.

 - *Production Readiness Checklist* - Single-page checklist consolidating HA, 
security, resource sizing, observability, upgrade strategy, and disaster 
recovery, modeled on similar pages in other mature Kubernetes operators.

h1. Pages to be updated
 - *Configuration* - It does not explain which properties hot-reload vs which 
require an operator restart, documents only a fraction of the operator's actual 
environment variables, lacks guidance on the YAML configuration format and its 
limitations, and groups Leader Election and High Availability with general 
configuration instead of giving it a dedicated page.
 - *Autoscaler* - It should expand the existing 358-line page to cover more 
internal semantics, scaling cooldowns, exclusion semantics, scaling history 
persistence, and advance additional features that can be enabled by the end 
users.

  was:
h1. Summary

The Flink Kubernetes Operator is a mature project used in production by a large 
number of organizations, but its documentation has not kept pace with the 
codebase. Today the docs cover installation ({{helm.md}}), CRD schema 
({{reference.md}}, {{overview.md}}), and metrics/logging well, but stay thin on 
the operational concerns that determine whether an operator can be safely run 
in production: high availability, security posture, the Kubernetes ConfigMaps 
the operator creates and manages, the event taxonomy, day-2 troubleshooting, 
and the full surface of the autoscaler. As a result, users frequently have to 
read the source code or rely on community channels for answers that should live 
in the official docs.

The goal of this umbrella is to bring the operator documentation closer to the 
depth and structure of the Flink main documentation itself, where each 
operational concern (state backends, HA, security, deployment, monitoring, 
debugging) has a dedicated, narrative-driven section rather than being 
scattered across pages.

h1. New Sections

- *State*, the Kubernetes ConfigMaps the operator creates and manages per Flink 
resource: HA ConfigMaps (leader info and last-completed-checkpoint pointers), 
the autoscaler state ConfigMap, and auxiliary ConfigMaps (flink-conf, 
pod-template, log4j). For each: naming, ownership, lifecycle across 
{{last-state}} upgrades and deletes, relevant configuration knobs, and recovery 
procedures when a ConfigMap is lost or corrupted. State backends and the 
{{FlinkStateSnapshot}} CR are cross-linked to the upstream Flink docs rather 
than re-documented.

- *Events*, the event taxonomy emitted by the operator (submit, recovery, 
scaling, snapshot, validation, etc.), deduplication semantics, and how to 
consume.

- *High Availability*, operator-side leader election, replica topology, 
JobManager HA via Kubernetes ConfigMaps, recovery semantics, and configuration 
reference.

- *Security*, RBAC scoping, operator -> Flink REST mTLS, truststore/keystore 
management, Kerberos auth, webhook TLS, and secrets handling for credentials.

- *Debugging*, common failure modes, how to interpret status fields and 
reconciler logs, runbooks for stuck reconciliations, and diagnostic 
configuration toggles.

- *Production Readiness Checklist*, single-page checklist consolidating HA, 
security, resource sizing, observability, upgrade strategy, and disaster 
recovery, modeled on similar pages in other mature Kubernetes operators.

h1. Updated Sections

- *Autoscaler*, expand the existing 358-line page to cover more internal 
semantics, scaling cooldowns, exclusion semantics, scaling history persistence, 
and advance additional features that can be enabled by the end users.




> Improve Flink Kubernetes Operator documentation coverage
> --------------------------------------------------------
>
>                 Key: FLINK-39571
>                 URL: https://issues.apache.org/jira/browse/FLINK-39571
>             Project: Flink
>          Issue Type: Improvement
>          Components: Kubernetes Operator
>    Affects Versions: kubernetes-operator-1.14.0
>            Reporter: Dennis-Mircea Ciupitu
>            Assignee: Dennis-Mircea Ciupitu
>            Priority: Major
>             Fix For: kubernetes-operator-1.16.0
>
>
> h1. Summary
> The Flink Kubernetes Operator is a mature project used in production by a 
> large number of organizations, but its documentation has not kept pace with 
> the codebase. Today the docs cover installation ({{{}helm.md{}}}), CRD schema 
> ({{{}reference.md{}}}, {{{}overview.md{}}}), and metrics/logging well, but 
> stay thin on the operational concerns that determine whether an operator can 
> be safely run in production: high availability, security posture, the 
> Kubernetes ConfigMaps the operator creates and manages, the event taxonomy, 
> day-2 troubleshooting, and the full surface of the autoscaler. As a result, 
> users frequently have to read the source code or rely on community channels 
> for answers that should live in the official docs.
> The goal of this umbrella is to bring the operator documentation closer to 
> the depth and structure of the Flink main documentation itself, where each 
> operational concern (state backends, HA, security, deployment, monitoring, 
> debugging) has a dedicated, narrative-driven section rather than being 
> scattered across pages.
> h1. New Pages
>  - *State* - Document the Kubernetes ConfigMaps the operator creates and 
> manages per Flink resource.
>  - *Events* - Document the event taxonomy emitted by the operator (submit, 
> recovery, scaling, snapshot, validation, etc.), deduplication semantics, and 
> the events can be consumed.
>  - *High Availability* - Document the operator-side leader election, replica 
> topology, recovery semantics, limitations, and configuration reference.
>  - *Security* - Document the RBAC scoping, operator -> Flink REST mTLS, 
> truststore/keystore management, Kerberos auth, webhook TLS, and secrets 
> handling for credentials.
>  - *Debugging* - Document common failure modes, how to interpret status 
> fields and reconciler logs, runbooks for stuck reconciliations, and 
> diagnostic configuration toggles.
>  - *Production Readiness Checklist* - Single-page checklist consolidating HA, 
> security, resource sizing, observability, upgrade strategy, and disaster 
> recovery, modeled on similar pages in other mature Kubernetes operators.
> h1. Pages to be updated
>  - *Configuration* - It does not explain which properties hot-reload vs which 
> require an operator restart, documents only a fraction of the operator's 
> actual environment variables, lacks guidance on the YAML configuration format 
> and its limitations, and groups Leader Election and High Availability with 
> general configuration instead of giving it a dedicated page.
>  - *Autoscaler* - It should expand the existing 358-line page to cover more 
> internal semantics, scaling cooldowns, exclusion semantics, scaling history 
> persistence, and advance additional features that can be enabled by the end 
> users.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to