Vamsi-klu opened a new pull request, #61935: URL: https://github.com/apache/airflow/pull/61935
## Summary - Pre-create `~/.aws/cli/cache` directory in `KubernetesHook.get_conn()` to prevent a `FileExistsError` race condition when multiple KPO tasks authenticate via `aws eks get-token` concurrently on the same Celery worker - Older botocore versions (<1.40.2) call `os.makedirs()` without `exist_ok=True`, causing intermittent task failures before pod creation ## Root Cause When parallel KubernetesPodOperator tasks invoke exec-based EKS authentication on the same worker, the AWS CLI races to create `~/.aws/cli/cache`. The losing process gets `FileExistsError` (errno 17), which surfaces as a **403 Forbidden** from the Kubernetes API — the task fails before the pod is even created. Fixed upstream in [botocore 1.40.2](https://github.com/boto/botocore/commit/f1c1bc90aa292b42195edecf4cf35ae348e6cc37), but this defensive fix protects users on older versions. ## Why this approach (and not something else) We considered several alternatives before landing on defensive directory pre-creation: | Approach | Why we rejected it | |----------|-------------------| | **Retry on 403 in `generic_api_retry`** | 403 is normally a permanent permissions error. Adding it to `TRANSIENT_STATUS_CODES` would mask real auth failures and add retry latency to every legitimate 403. Distinguishing transient exec-auth 403s from real permission denials is not reliably possible — the Kubernetes client's ExecProvider silently swallows the subprocess error and proceeds with a bad token, so the 403 looks identical to a genuine RBAC denial. | | **`threading.Lock` around config loading** | The exec plugin (`aws eks get-token`) runs *lazily* during the first API call, not during `config.load_kube_config()`. A lock around config loading wouldn't prevent the race. Locking around every API call would serialize all K8s operations — unacceptable for performance. | | **Parse kubeconfig to detect exec-based auth** | Over-engineered for a one-line fix. Would add complexity, fragile YAML parsing, and still need per-tool knowledge of which cache dirs to create. | | **Pin `botocore >= 1.40.2` as a dependency** | The Kubernetes provider has no direct dependency on botocore and shouldn't. AWS is just one of many possible exec-based auth backends. | | **Documentation-only (recommend botocore upgrade)** | Doesn't help users who can't control their botocore version (e.g., managed Airflow platforms like Astronomer). | **Why pre-creation wins:** - It's a single `os.makedirs(..., exist_ok=True)` call — the exact same fix botocore 1.40.2 applied, just done earlier in the call chain - `exist_ok=True` is inherently safe for concurrent invocations — no race between our pre-creation and the AWS CLI - Zero performance overhead (one syscall, idempotent) - Zero risk of masking real errors — we don't change retry behavior or error handling - Protects all users regardless of their botocore version ### Changes - **`hooks/kubernetes.py`**: Added `_ensure_exec_plugin_cache_dirs()` function called from `get_conn()` before any kube config loading. Uses `os.makedirs(..., exist_ok=True)` to pre-create the cache directory. - **`test_kubernetes.py`**: 3 new test cases verifying directory creation, idempotency, and integration with `get_conn()`. closes: #60943 ## Test plan - [x] New unit tests verify directory creation, idempotency, and integration - [ ] Manual: Run parallel KPO tasks on same Celery worker with EKS auth and botocore < 1.40.2 > **Note to users**: Upgrading to `botocore >= 1.40.2` also resolves this at the source. This fix provides a safety net for environments that cannot upgrade immediately. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
