srchilukoori opened a new issue, #66511:
URL: https://github.com/apache/airflow/issues/66511
### Under which category would you file this issue?
Helm chart
### Apache Airflow version
main (3.0.0.dev0) — affects CI K8S system tests
### What happened and how to reproduce it?
**Problem**
K8S system tests fail intermittently when Docker Hub anonymous-pull rate
limits are exhausted. The Helm chart's postgresql subchart uses
`bitnamilegacy/postgresql:16.1.0-debian-11-r15`, which is pulled by containerd
inside Kind at pod scheduling time — unauthenticated and without retry. When
the runner IP's 100-pull/6h quota is spent, PostgreSQL never starts and all
Airflow pods enter CrashLoopBackOff waiting for DB migrations.
PR #66423 added `K8S_TEST_IMAGES_TO_PRELOAD` to address this class of flake
for `alpine`, `busybox`, and `ubuntu` images, but the postgresql image — the
most critical one since all Airflow components depend on it — was not included.
**How to reproduce**
Non-deterministic. Depends on how many CI jobs share the runner IP within
Docker Hub's 6-hour window. Evidence from two unrelated PRs:
1. PR #66420 — a **one-line comment change** to `k8s-tests.yml` (cannot
cause functional failure):
- 5/6 K8S system test jobs passed, 1 failed
(`KubernetesExecutor-3.10-v1.30.13-true`)
- Same executor+python+K8S version as a passing job
(`KubernetesExecutor-3.10-v1.30.13-false` passed)
- Error:
```
ErrImagePull: failed to pull and unpack image
"docker.io/bitnamilegacy/postgresql:16.1.0-debian-11-r15":
429 Too Many Requests - Server message: toomanyrequests: You have reached
your unauthenticated pull rate limit.
```
2. PR #65840 — sphinx theme workspace (no K8S code changes):
- 35/36 K8S system test jobs passed, 1 failed
(`CeleryExecutor-3.11-v1.31.12-true`)
- Failed at the very first "Cleanup repo" step (`docker run bash`) before
any test code ran:
```
docker: Error response from daemon: Head
"https://registry-1.docker.io/v2/library/bash/manifests/latest":
net/http: TLS handshake timeout
```
3. Main branch run 25461521992 (same day): all 6 K8S jobs passed —
confirming the failure is non-deterministic, not a regression.
### What you think should happen instead?
The `bitnamilegacy/postgresql:16.1.0-debian-11-r15` image should be included
in `K8S_TEST_IMAGES_TO_PRELOAD` (added by PR #66423). The mechanism already
exists:
1. Host-side `docker pull` with retry-on-429
2. `kind load docker-image` into cluster nodes
3. Kubelet finds the image locally (`imagePullPolicy: IfNotPresent` because
the tag is pinned)
This is the same proven pattern that already protects `alpine:3.23`,
`busybox:1.37`, and `ubuntu:24.04`.
**Fix PR:** #66507 (all 6 K8S system tests pass with this change)
### Operating System
Ubuntu (GitHub Actions runner)
### Deployment
Official Apache Airflow Helm Chart
### Apache Airflow Provider(s)
_No response_
### Versions of Apache Airflow Providers
_No response_
### Official Helm Chart version
main (development)
### Kubernetes Version
v1.30.13, v1.31.12 (both observed failing)
### Helm Chart configuration
Default `chart/values.yaml`:
```yaml
postgresql:
enabled: true
image:
repository: bitnamilegacy/postgresql
tag: "16.1.0-debian-11-r15"
```
### Docker Image customizations
Not Applicable
### Anything else?
**Frequency:** Intermittent — observed ~1 out of 6 K8S jobs failing per run
when rate-limited.
**Separate issue — `bash:latest` in "Cleanup repo" step:**
The K8S workflow (and 10+ other workflows) uses `docker run ... bash -c "rm
-rf /workspace/*"` as its first step. This pulls `library/bash:latest` from
Docker Hub unauthenticated. The TLS timeout in PR #65840 hit this step. This is
a broader problem (not K8S-specific) and should be tracked separately —
possible fix is replacing with `sudo rm -rf` in a shell step.
**Related:**
- PR #66423 — Added `K8S_TEST_IMAGES_TO_PRELOAD` mechanism (merged)
- PR #66507 — Adds postgresql to the pre-load list (all 6 K8S jobs green)
- Issue #56322 — Deployment fails with `bitnamilegacy/postgresql`
(user-facing, same image)
### Are you willing to submit PR?
- [x] Yes I am willing to submit a PR!
### Code of Conduct
- [x] I agree to follow this project's [Code of
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]