Vamsi-klu opened a new pull request, #61935:
URL: https://github.com/apache/airflow/pull/61935

   ## Summary
   - Pre-create `~/.aws/cli/cache` directory in `KubernetesHook.get_conn()` to 
prevent a `FileExistsError` race condition when multiple KPO tasks authenticate 
via `aws eks get-token` concurrently on the same Celery worker
   - Older botocore versions (<1.40.2) call `os.makedirs()` without 
`exist_ok=True`, causing intermittent task failures before pod creation
   
   ## Root Cause
   When parallel KubernetesPodOperator tasks invoke exec-based EKS 
authentication on the same worker, the AWS CLI races to create 
`~/.aws/cli/cache`. The losing process gets `FileExistsError` (errno 17), which 
surfaces as a **403 Forbidden** from the Kubernetes API — the task fails before 
the pod is even created.
   
   Fixed upstream in [botocore 
1.40.2](https://github.com/boto/botocore/commit/f1c1bc90aa292b42195edecf4cf35ae348e6cc37),
 but this defensive fix protects users on older versions.
   
   ## Why this approach (and not something else)
   
   We considered several alternatives before landing on defensive directory 
pre-creation:
   
   | Approach | Why we rejected it |
   |----------|-------------------|
   | **Retry on 403 in `generic_api_retry`** | 403 is normally a permanent 
permissions error. Adding it to `TRANSIENT_STATUS_CODES` would mask real auth 
failures and add retry latency to every legitimate 403. Distinguishing 
transient exec-auth 403s from real permission denials is not reliably possible 
— the Kubernetes client's ExecProvider silently swallows the subprocess error 
and proceeds with a bad token, so the 403 looks identical to a genuine RBAC 
denial. |
   | **`threading.Lock` around config loading** | The exec plugin (`aws eks 
get-token`) runs *lazily* during the first API call, not during 
`config.load_kube_config()`. A lock around config loading wouldn't prevent the 
race. Locking around every API call would serialize all K8s operations — 
unacceptable for performance. |
   | **Parse kubeconfig to detect exec-based auth** | Over-engineered for a 
one-line fix. Would add complexity, fragile YAML parsing, and still need 
per-tool knowledge of which cache dirs to create. |
   | **Pin `botocore >= 1.40.2` as a dependency** | The Kubernetes provider has 
no direct dependency on botocore and shouldn't. AWS is just one of many 
possible exec-based auth backends. |
   | **Documentation-only (recommend botocore upgrade)** | Doesn't help users 
who can't control their botocore version (e.g., managed Airflow platforms like 
Astronomer). |
   
   **Why pre-creation wins:**
   - It's a single `os.makedirs(..., exist_ok=True)` call — the exact same fix 
botocore 1.40.2 applied, just done earlier in the call chain
   - `exist_ok=True` is inherently safe for concurrent invocations — no race 
between our pre-creation and the AWS CLI
   - Zero performance overhead (one syscall, idempotent)
   - Zero risk of masking real errors — we don't change retry behavior or error 
handling
   - Protects all users regardless of their botocore version
   
   ### Changes
   - **`hooks/kubernetes.py`**: Added `_ensure_exec_plugin_cache_dirs()` 
function called from `get_conn()` before any kube config loading. Uses 
`os.makedirs(..., exist_ok=True)` to pre-create the cache directory.
   - **`test_kubernetes.py`**: 3 new test cases verifying directory creation, 
idempotency, and integration with `get_conn()`.
   
   closes: #60943
   
   ## Test plan
   - [x] New unit tests verify directory creation, idempotency, and integration
   - [ ] Manual: Run parallel KPO tasks on same Celery worker with EKS auth and 
botocore < 1.40.2
   
   > **Note to users**: Upgrading to `botocore >= 1.40.2` also resolves this at 
the source. This fix provides a safety net for environments that cannot upgrade 
immediately.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to