jykae opened a new issue, #68072:
URL: https://github.com/apache/airflow/issues/68072
### Description
Today the official Airflow Helm chart's `migrateDatabaseJob` only runs
forward `airflow db migrate`. Doing a `helm upgrade` that targets an **older**
`airflowVersion` than the one currently running leaves the metadata DB schema
ahead of the running image, and the api-server pod fails to start. The chart
should reconcile the DB schema in **both** directions — upgrade *and* downgrade
— based on the dispatched `airflowVersion`, behind an explicit opt-in.
### Use case/motivation
We operate Airflow on Kubernetes via this chart and ship to multiple
environments through CI. We need rollback to be a first-class operation:
- Deploying an older release tag (`helm upgrade` with an older
`airflowVersion`) should bring the cluster — schema included — back to that
release.
- Today this requires out-of-band tooling: detecting current vs target
version, then `kubectl exec`ing into the still-running old api-server pod and
invoking `airflow db downgrade --to-version <target> --yes` **before** helm
starts rolling the new image. We've implemented this as a workflow step driven
by a small bash script, but it duplicates logic for every team using the chart
and only helps people deploying via GitHub Actions — not ArgoCD, not manual
helm.
A chart-native solution would mean: set `airflowVersion: <older>` in values,
run `helm upgrade`, the chart reconciles the schema, the new (older) pods come
up.
#### Hard constraint that shapes the design
`airflow db downgrade --to-version X.Y.Z` requires the alembic revision
scripts for **every revision between the current head and the target**. Those
scripts only ship inside the image of the version that **introduced** them. So:
| Direction | Image that must run the operation | Why |
|---|---|---|
| Upgrade (current < target) | **Target** image | Forward revisions ship in
the target image |
| Downgrade (current > target) | **Currently running** image |
Reverse-direction code for revisions to undo only exists in the current image |
| Same | none | No-op |
Today's `migrateDatabaseJob` is correct for the first row only. A
pre-upgrade hook running with `airflow_image_for_migrations` (the target image)
cannot perform a downgrade — the target image doesn't carry the scripts that
need to be reversed.
#### Proposed design
Add a new opt-in **pre-upgrade** Job, `db-downgrade-job`, executing before
the existing `migrateDatabaseJob`:
```
helm.sh/hook helm.sh/hook-weight
───────────── ─────────────────────
pre-upgrade -10 db-downgrade-job (NEW)
post-install, 1 migrate-database-job (existing, unchanged)
post-upgrade
```
Job behaviour (pseudocode):
1. `target = .Values.airflowVersion`
2. `current = discover_current_version()`
3. if `current is None` or `current <= target`: exit 0
4. if not `migrateDatabaseJob.allowDowngrade`: exit non-zero with a clear
"downgrade required but disabled" message
5. discover the currently-running api-server (or scheduler) Deployment's pod
image — that is the image that wrote the current schema
6. `kubectl exec` into that pod and run `airflow db downgrade --to-version
<target> --yes`
7. exit with the downgrade's status
Discovery options for `current` (we'd pick one — preference: alembic-table
to avoid needing more RBAC for discovery):
- Query `alembic_version` table + ship a small alembic-rev → Airflow-version
map alongside `appVersion` bumps.
- Read `Deployment/<release>-api-server` pod spec image (requires
`deployments.get`).
- `kubectl exec -- airflow version` on the running pod (requires
`pods/exec`).
##### New values (opt-in, default off — zero impact for existing deployers)
```yaml
migrateDatabaseJob:
# ...existing fields...
allowDowngrade: false # opt-in; default off for backward compat
downgrade:
currentVersionSource: alembic-table # one of: alembic-table |
running-pod-image | running-pod-exec
extraEnv: []
resources: {...}
```
##### RBAC (opt-in only — only renders when `allowDowngrade: true`)
New `ServiceAccount` `airflow-db-reconcile` with a `Role` scoped to the
release namespace:
- `pods, pods/exec` (verbs: `get`, `list`, `create`) — to run `airflow db
downgrade` against the live api-server pod.
- `deployments` (verbs: `get`, `list`) — to discover the current image, if
`currentVersionSource: running-pod-image`.
##### Hook ordering and rollback safety
| Phase | Hook | Action |
|---|---|---|
| `pre-upgrade` (weight -10) | `db-downgrade-job` | Detect direction; if
downgrade, run it in the live api-server pod. Schema now matches target. |
| (helm applies manifests) | — | New Deployment specs go in; rolling update
pulls target image. |
| `post-upgrade` (weight 1) | `migrate-database-job` (existing) | `airflow
db migrate` — no-op if downgrade already aligned the schema; performs forward
migration on upgrades. |
`helm rollback` benefits too — the same hooks fire, so rolling to an older
`airflowVersion` correctly walks the schema back.
##### Backward compatibility
- Default `allowDowngrade: false` → behaviour identical to today for every
existing user.
- With the flag off, a `helm upgrade` to an older `airflowVersion` **fails
fast** in the new pre-upgrade hook with a clear message — currently the deploy
half-succeeds and leaves the cluster broken.
- With the flag on, behaviour becomes bidirectional.
##### Test matrix to add under `chart/tests/`
1. Same version → both jobs no-op
2. Forward (current < target) → downgrade job exits 0; migrate job runs
forward
3. Backward, `allowDowngrade: false` → downgrade job exits non-zero with
expected message; migrate job not reached
4. Backward, `allowDowngrade: true` → downgrade job exec's into discovered
pod with `airflow db downgrade --to-version <target> --yes`; migrate job runs
forward (no-op)
5. RBAC objects render only when `allowDowngrade: true`
End-to-end (kind via `breeze k8s tests`):
- Install 3.0.x → upgrade to 3.1.x → downgrade to 3.0.x → verify alembic
head matches the 3.0.x branch tip and api-server starts.
#### Why a chart hook rather than out-of-band tooling
Because every operator (GitHub Actions, ArgoCD, Flux, manual `helm upgrade`)
hits the same problem, the chart is the single right place to own the contract.
The current state forces each deployer to maintain their own pre-helm script.
The opt-in flag keeps the change safe: nothing happens unless you ask for it.
### Related issues
None I could find on apache/airflow that propose chart-side support for
downgrade. Closest relatives:
- #55689 — *Not able to downgrade from AF3 to AF2 without FAB provider*
(closed; about provider-side compatibility, not chart behaviour).
- #63532 / #63535 — performance/correctness of specific downgrade migrations
(orthogonal — those are about the migrations themselves working at all; this
issue is about when/how the chart runs them).
### Are you willing to submit a PR?
- [x] Yes I am willing to submit a PR!
### Code of Conduct
- [x] I agree to follow this project's [Code of
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]