villebro opened a new pull request, #99: URL: https://github.com/apache/superset-kubernetes-operator/pull/99
## Summary Building a comprehensive Development-mode sample that exercises nearly every component at once under a hardened (`runAsNonRoot`) pod security context turned up a series of operator gaps. This PR fixes them so the operator works under restricted Pod Security Standards, recovers stuck lifecycle tasks on its own, runs Celery Flower without crash-looping, and routes all components through a single Ingress the same way Gateway API already does — plus the comprehensive sample and local-testing tooling that found everything. Guiding principle: lifecycle tasks and component plumbing are implementation details users shouldn't have to babysit. If something can't start, the operator should make it work where it reasonably can, otherwise surface a clear reason — never require a human to `kubectl delete job`. ## What changed — operator 1. **Helper containers run non-root by default.** The `create-database` init container (postgres/mysql client) defaults to a non-root `runAsUser` (postgres=70, mysql=999) when no UID is pinned, so it satisfies an inherited `runAsNonRoot: true` instead of `CreateContainerConfigError`. The maintenance page ships an `nginx.conf` that moves the pid/temp paths to `/tmp` and defaults to a hardened non-root securityContext. Explicit user UIDs win. 2. **Wedged task Jobs self-heal.** Task Jobs carry a pod-spec-hash annotation. The controller detects un-startable task pods (`CreateContainerConfigError`, image-pull failures, `Unschedulable`) and, when the desired pod spec changed since the Job was created, deletes and recreates it from the current spec — no manual `kubectl delete`. When the spec is unchanged it records a `TaskCannotStart` condition + event and waits instead of silently looping. 3. **Terminally-failed tasks retry on a pod-spec change.** A task that exhausted its retries reruns once the user fixes how its pod runs; a purely application-level failure (unchanged pod) stays terminal so it can't loop. 4. **Celery Flower probes use `/healthcheck` under the URL prefix.** Flower serves under `--url_prefix` (default `/flower`) and gates `/api/*` behind auth in 2.0; the probes were hardcoded to `/api/workers`, which 404'd then 401'd → CrashLoopBackOff. Probes now hit `<prefix>/healthcheck` (200, no auth). 5. **Ingress routes all components by path, symmetric with Gateway API.** A shared `componentRoutes` helper feeds both `reconcileHTTPRoute` and `reconcileIngress`. An Ingress host with no explicit `paths` now fans out (`/` → web, `/flower`, `/mcp`, `/ws` for present components), forwarding as-is (no rewrite) so each component owns its subpath (e.g. Flower via `--url_prefix`). A host with explicit `paths` stays a web-server override. ## What changed — API, samples & docs - **WebsocketServer marked experimental** (Go doc comments → `api-reference.md`, `index.md`, configuration guide) — needs a custom Node.js image and its routing is unvalidated. - `config/samples/superset_v1alpha1_superset_dev_full.yaml` — comprehensive, hardened dev sample (web, Celery worker/beat/flower, MCP, template processing, thumbnails, alerts & reports, `createDatabase`, maintenance page, NodePort + Ingress, `runAsNonRoot` with numeric `runAsUser: 1000`). - `config/samples/dev-dependencies.yaml` — throwaway Postgres + Valkey for Kind. - `config/samples/kind-cluster.yaml` — Kind config with `extraPortMappings` + `ingress-ready` so the sample is reachable through ingress-nginx without port-forwards. - Docs: lifecycle self-heal/retry behavior, create-database non-root default, and the Ingress per-component fan-out; CEL message + API reference regenerated. ## Testing - New unit tests: pod startup-error detection, pod-spec hash sensitivity, self-heal vs surface, terminal-retry detection, non-root helper securityContext defaulting, maintenance non-root, Flower health-path prefixing, component-route fan-out, and Ingress multi-component vs explicit-paths behavior. - `make lint` clean (0 issues); `go build ./...`, full `internal/...` unit tests, integration-tagged compilation, `make codegen` (no stale artifacts), and `kubectl kustomize config/samples` all pass. - Manually verified on Kind: migrate with `createDatabase` under `runAsNonRoot`, self-heal of a wedged migrate Job after a spec fix, Flower staying up, MCP running, and single-URL routing to web/flower/mcp through one Ingress host. ## Notes / follow-ups - The sample's Flower pod installs `flower` at startup (`uv pip install`) and runs as root (not in the stock image); comment points to the production path (custom image). Same class of "stock image lacks optional deps" applies to **alerts & reports / thumbnails**, which need a browser-equipped worker image (Playwright; Selenium's auto-driver has no arm64 build) — documented as a caveat, left to a follow-up image. - Celery worker/beat have no probes (no HTTP surface); an exec `celery inspect ping` probe is a possible follow-up. - Subpath rewriting for non-prefix-aware components is intentionally avoided (it breaks UIs like Flower); components own their subpath instead. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
