benvit92 edited a comment on issue #18041:
URL: https://github.com/apache/airflow/issues/18041#issuecomment-945741654
Hello all,
not sure if it is helpful but I can tell the following:
we had the same issue with Airflow 2.1.0 deployed on AKS (Azure k8s) and we
also noticed that a lot of pods were not being cleaned after completion
(Success or Error or even CrashLoopBackOff without loading the correct pod
template) and while scraping the scheduler logs for sigtermed tasks I noticed
the following entries:
```
{"timestamp": "2021-10-18T10:11:11.363282Z", "level": "INFO", "name":
"airflow.executors.kubernetes_executor.KubernetesExecutor", "message": "Failed
to adopt pod cbtbpartyczpartybctranslation.fa5c434191a048cdb9ad9aa747e0f3e9.
Reason: (403)\nReason: Forbidden\nHTTP response headers:
HTTPHeaderDict({'Audit-Id': '427c0455-2a35-4eec-bd15-b2a2e1d82639',
'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json',
'X-Content-Type-Options': 'nosniff', 'Date': 'Mon, 18 Oct 2021 10:11:11 GMT',
'Content-Length': '421'})\nHTTP response body:
{\"kind\":\"Status\",\"apiVersion\":\"v1\",\"metadata\":{},\"status\":\"Failure\",\"message\":\"pods
\\\"cbtbpartyczpartybctranslation.fa5c434191a048cdb9ad9aa747e0f3e9\\\" is
forbidden: User \\\"system:serviceaccount:cbdp:airflow-cbdp\\\" cannot patch
resource \\\"pods\\\" in API group \\\"\\\" in the namespace
\\\"cbdp\\\"\",\"reason\":\"Forbidden\",\"details\":{\"name\":\"cbtbpartyczpartybctranslation.fa5c434191a048cdb9ad9aa747e0f3e9\",\"k
ind\":\"pods\"},\"code\":403}\n\n"}
```
As the scheduler was not able to adopt the pods it was sending SIGTERM to
pods that were actually still running but probably now were orphaned.
After chasing a lot of loose ends the way we were able to fix it was by
adding the "patch" permission to the RBAC role for pods we have for Airflow in
the helm chart (this seems to be missing even from the official airflow helm
chart
https://github.com/helm/charts/blob/master/stable/airflow/templates/rbac/airflow-role.yaml)
and cleaning up old pending pods.
After this change we are noticing a more stable behavior where a SIGTERM has
not been raised yet (and hopefully it won't :) ) and no failure messages on
adopting pods as well so far.
Hope this helps someone, if not feel free to discard it.
We will keep monitoring the behavior and if this seems to be the permanent
fix will consider committing it back in the helm chart of Airflow.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]