jykae opened a new issue, #68072:
URL: https://github.com/apache/airflow/issues/68072

   ### Description
   
   Today the official Airflow Helm chart's `migrateDatabaseJob` only runs 
forward `airflow db migrate`. Doing a `helm upgrade` that targets an **older** 
`airflowVersion` than the one currently running leaves the metadata DB schema 
ahead of the running image, and the api-server pod fails to start. The chart 
should reconcile the DB schema in **both** directions — upgrade *and* downgrade 
— based on the dispatched `airflowVersion`, behind an explicit opt-in.
   
   ### Use case/motivation
   
   We operate Airflow on Kubernetes via this chart and ship to multiple 
environments through CI. We need rollback to be a first-class operation:
   
   - Deploying an older release tag (`helm upgrade` with an older 
`airflowVersion`) should bring the cluster — schema included — back to that 
release.
   - Today this requires out-of-band tooling: detecting current vs target 
version, then `kubectl exec`ing into the still-running old api-server pod and 
invoking `airflow db downgrade --to-version <target> --yes` **before** helm 
starts rolling the new image. We've implemented this as a workflow step driven 
by a small bash script, but it duplicates logic for every team using the chart 
and only helps people deploying via GitHub Actions — not ArgoCD, not manual 
helm.
   
   A chart-native solution would mean: set `airflowVersion: <older>` in values, 
run `helm upgrade`, the chart reconciles the schema, the new (older) pods come 
up.
   
   #### Hard constraint that shapes the design
   
   `airflow db downgrade --to-version X.Y.Z` requires the alembic revision 
scripts for **every revision between the current head and the target**. Those 
scripts only ship inside the image of the version that **introduced** them. So:
   
   | Direction | Image that must run the operation | Why |
   |---|---|---|
   | Upgrade (current < target) | **Target** image | Forward revisions ship in 
the target image |
   | Downgrade (current > target) | **Currently running** image | 
Reverse-direction code for revisions to undo only exists in the current image |
   | Same | none | No-op |
   
   Today's `migrateDatabaseJob` is correct for the first row only. A 
pre-upgrade hook running with `airflow_image_for_migrations` (the target image) 
cannot perform a downgrade — the target image doesn't carry the scripts that 
need to be reversed.
   
   #### Proposed design
   
   Add a new opt-in **pre-upgrade** Job, `db-downgrade-job`, executing before 
the existing `migrateDatabaseJob`:
   
   ```
   helm.sh/hook            helm.sh/hook-weight
   ─────────────           ─────────────────────
   pre-upgrade             -10   db-downgrade-job        (NEW)
   post-install,           1     migrate-database-job    (existing, unchanged)
   post-upgrade
   ```
   
   Job behaviour (pseudocode):
   
   1. `target = .Values.airflowVersion`
   2. `current = discover_current_version()`
   3. if `current is None` or `current <= target`: exit 0
   4. if not `migrateDatabaseJob.allowDowngrade`: exit non-zero with a clear 
"downgrade required but disabled" message
   5. discover the currently-running api-server (or scheduler) Deployment's pod 
image — that is the image that wrote the current schema
   6. `kubectl exec` into that pod and run `airflow db downgrade --to-version 
<target> --yes`
   7. exit with the downgrade's status
   
   Discovery options for `current` (we'd pick one — preference: alembic-table 
to avoid needing more RBAC for discovery):
   
   - Query `alembic_version` table + ship a small alembic-rev → Airflow-version 
map alongside `appVersion` bumps.
   - Read `Deployment/<release>-api-server` pod spec image (requires 
`deployments.get`).
   - `kubectl exec -- airflow version` on the running pod (requires 
`pods/exec`).
   
   ##### New values (opt-in, default off — zero impact for existing deployers)
   
   ```yaml
   migrateDatabaseJob:
     # ...existing fields...
     allowDowngrade: false            # opt-in; default off for backward compat
     downgrade:
       currentVersionSource: alembic-table   # one of: alembic-table | 
running-pod-image | running-pod-exec
       extraEnv: []
       resources: {...}
   ```
   
   ##### RBAC (opt-in only — only renders when `allowDowngrade: true`)
   
   New `ServiceAccount` `airflow-db-reconcile` with a `Role` scoped to the 
release namespace:
   
   - `pods, pods/exec` (verbs: `get`, `list`, `create`) — to run `airflow db 
downgrade` against the live api-server pod.
   - `deployments` (verbs: `get`, `list`) — to discover the current image, if 
`currentVersionSource: running-pod-image`.
   
   ##### Hook ordering and rollback safety
   
   | Phase | Hook | Action |
   |---|---|---|
   | `pre-upgrade` (weight -10) | `db-downgrade-job` | Detect direction; if 
downgrade, run it in the live api-server pod. Schema now matches target. |
   | (helm applies manifests) | — | New Deployment specs go in; rolling update 
pulls target image. |
   | `post-upgrade` (weight 1) | `migrate-database-job` (existing) | `airflow 
db migrate` — no-op if downgrade already aligned the schema; performs forward 
migration on upgrades. |
   
   `helm rollback` benefits too — the same hooks fire, so rolling to an older 
`airflowVersion` correctly walks the schema back.
   
   ##### Backward compatibility
   
   - Default `allowDowngrade: false` → behaviour identical to today for every 
existing user.
   - With the flag off, a `helm upgrade` to an older `airflowVersion` **fails 
fast** in the new pre-upgrade hook with a clear message — currently the deploy 
half-succeeds and leaves the cluster broken.
   - With the flag on, behaviour becomes bidirectional.
   
   ##### Test matrix to add under `chart/tests/`
   
   1. Same version → both jobs no-op
   2. Forward (current < target) → downgrade job exits 0; migrate job runs 
forward
   3. Backward, `allowDowngrade: false` → downgrade job exits non-zero with 
expected message; migrate job not reached
   4. Backward, `allowDowngrade: true` → downgrade job exec's into discovered 
pod with `airflow db downgrade --to-version <target> --yes`; migrate job runs 
forward (no-op)
   5. RBAC objects render only when `allowDowngrade: true`
   
   End-to-end (kind via `breeze k8s tests`):
   
   - Install 3.0.x → upgrade to 3.1.x → downgrade to 3.0.x → verify alembic 
head matches the 3.0.x branch tip and api-server starts.
   
   #### Why a chart hook rather than out-of-band tooling
   
   Because every operator (GitHub Actions, ArgoCD, Flux, manual `helm upgrade`) 
hits the same problem, the chart is the single right place to own the contract. 
The current state forces each deployer to maintain their own pre-helm script. 
The opt-in flag keeps the change safe: nothing happens unless you ask for it.
   
   ### Related issues
   
   None I could find on apache/airflow that propose chart-side support for 
downgrade. Closest relatives:
   
   - #55689 — *Not able to downgrade from AF3 to AF2 without FAB provider* 
(closed; about provider-side compatibility, not chart behaviour).
   - #63532 / #63535 — performance/correctness of specific downgrade migrations 
(orthogonal — those are about the migrations themselves working at all; this 
issue is about when/how the chart runs them).
   
   ### Are you willing to submit a PR?
   
   - [x] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to