srchilukoori opened a new issue, #66511:
URL: https://github.com/apache/airflow/issues/66511

   ### Under which category would you file this issue?
   
   Helm chart
   
   ### Apache Airflow version
   
   main (3.0.0.dev0) — affects CI K8S system tests
   
   ### What happened and how to reproduce it?
   
   **Problem**
   
   K8S system tests fail intermittently when Docker Hub anonymous-pull rate 
limits are exhausted. The Helm chart's postgresql subchart uses 
`bitnamilegacy/postgresql:16.1.0-debian-11-r15`, which is pulled by containerd 
inside Kind at pod scheduling time — unauthenticated and without retry. When 
the runner IP's 100-pull/6h quota is spent, PostgreSQL never starts and all 
Airflow pods enter CrashLoopBackOff waiting for DB migrations.
   
   PR #66423 added `K8S_TEST_IMAGES_TO_PRELOAD` to address this class of flake 
for `alpine`, `busybox`, and `ubuntu` images, but the postgresql image — the 
most critical one since all Airflow components depend on it — was not included.
   
   **How to reproduce**
   
   Non-deterministic. Depends on how many CI jobs share the runner IP within 
Docker Hub's 6-hour window. Evidence from two unrelated PRs:
   
   1. PR #66420 — a **one-line comment change** to `k8s-tests.yml` (cannot 
cause functional failure):
      - 5/6 K8S system test jobs passed, 1 failed 
(`KubernetesExecutor-3.10-v1.30.13-true`)
      - Same executor+python+K8S version as a passing job 
(`KubernetesExecutor-3.10-v1.30.13-false` passed)
      - Error:
      ```
      ErrImagePull: failed to pull and unpack image 
"docker.io/bitnamilegacy/postgresql:16.1.0-debian-11-r15":
      429 Too Many Requests - Server message: toomanyrequests: You have reached 
your unauthenticated pull rate limit.
      ```
   
   2. PR #65840 — sphinx theme workspace (no K8S code changes):
      - 35/36 K8S system test jobs passed, 1 failed 
(`CeleryExecutor-3.11-v1.31.12-true`)
      - Failed at the very first "Cleanup repo" step (`docker run bash`) before 
any test code ran:
      ```
      docker: Error response from daemon: Head 
"https://registry-1.docker.io/v2/library/bash/manifests/latest":
      net/http: TLS handshake timeout
      ```
   
   3. Main branch run 25461521992 (same day): all 6 K8S jobs passed — 
confirming the failure is non-deterministic, not a regression.
   
   ### What you think should happen instead?
   
   The `bitnamilegacy/postgresql:16.1.0-debian-11-r15` image should be included 
in `K8S_TEST_IMAGES_TO_PRELOAD` (added by PR #66423). The mechanism already 
exists:
   1. Host-side `docker pull` with retry-on-429
   2. `kind load docker-image` into cluster nodes
   3. Kubelet finds the image locally (`imagePullPolicy: IfNotPresent` because 
the tag is pinned)
   
   This is the same proven pattern that already protects `alpine:3.23`, 
`busybox:1.37`, and `ubuntu:24.04`.
   
   **Fix PR:** #66507 (all 6 K8S system tests pass with this change)
   
   ### Operating System
   
   Ubuntu (GitHub Actions runner)
   
   ### Deployment
   
   Official Apache Airflow Helm Chart
   
   ### Apache Airflow Provider(s)
   
   _No response_
   
   ### Versions of Apache Airflow Providers
   
   _No response_
   
   ### Official Helm Chart version
   
   main (development)
   
   ### Kubernetes Version
   
   v1.30.13, v1.31.12 (both observed failing)
   
   ### Helm Chart configuration
   
   Default `chart/values.yaml`:
   ```yaml
   postgresql:
     enabled: true
     image:
       repository: bitnamilegacy/postgresql
       tag: "16.1.0-debian-11-r15"
   ```
   
   ### Docker Image customizations
   
   Not Applicable
   
   ### Anything else?
   
   **Frequency:** Intermittent — observed ~1 out of 6 K8S jobs failing per run 
when rate-limited.
   
   **Separate issue — `bash:latest` in "Cleanup repo" step:**
   
   The K8S workflow (and 10+ other workflows) uses `docker run ... bash -c "rm 
-rf /workspace/*"` as its first step. This pulls `library/bash:latest` from 
Docker Hub unauthenticated. The TLS timeout in PR #65840 hit this step. This is 
a broader problem (not K8S-specific) and should be tracked separately — 
possible fix is replacing with `sudo rm -rf` in a shell step.
   
   **Related:**
   - PR #66423 — Added `K8S_TEST_IMAGES_TO_PRELOAD` mechanism (merged)
   - PR #66507 — Adds postgresql to the pre-load list (all 6 K8S jobs green)
   - Issue #56322 — Deployment fails with `bitnamilegacy/postgresql` 
(user-facing, same image)
   
   ### Are you willing to submit PR?
   
   - [x] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to