benvit92 edited a comment on issue #18041:
URL: https://github.com/apache/airflow/issues/18041#issuecomment-945741654


   Hello all,
   
   not sure if it is helpful but I can tell the following:
   
   we had the same issue with Airflow 2.1.0 deployed on AKS (Azure k8s) and we 
also noticed that a lot of pods were not being cleaned after completion 
(Success or Error or even CrashLoopBackOff without loading the correct pod 
template) and while scraping the scheduler logs for sigtermed tasks I noticed 
the following entries:
   ```
   {"timestamp": "2021-10-18T10:11:11.363282Z", "level": "INFO", "name": 
"airflow.executors.kubernetes_executor.KubernetesExecutor", "message": "Failed 
to adopt pod cbtbpartyczpartybctranslation.fa5c434191a048cdb9ad9aa747e0f3e9. 
Reason: (403)\nReason: Forbidden\nHTTP response headers: 
HTTPHeaderDict({'Audit-Id': '427c0455-2a35-4eec-bd15-b2a2e1d82639', 
'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 
'X-Content-Type-Options': 'nosniff', 'Date': 'Mon, 18 Oct 2021 10:11:11 GMT', 
'Content-Length': '421'})\nHTTP response body: 
{\"kind\":\"Status\",\"apiVersion\":\"v1\",\"metadata\":{},\"status\":\"Failure\",\"message\":\"pods
 \\\"cbtbpartyczpartybctranslation.fa5c434191a048cdb9ad9aa747e0f3e9\\\" is 
forbidden: User \\\"system:serviceaccount:cbdp:airflow-cbdp\\\" cannot patch 
resource \\\"pods\\\" in API group \\\"\\\" in the namespace 
\\\"cbdp\\\"\",\"reason\":\"Forbidden\",\"details\":{\"name\":\"cbtbpartyczpartybctranslation.fa5c434191a048cdb9ad9aa747e0f3e9\",\"k
 ind\":\"pods\"},\"code\":403}\n\n"}
   ```
   As the scheduler was not able to adopt the pods it was sending SIGTERM to 
pods that were actually still running but probably now were orphaned. 
   After chasing a lot of loose ends the way we were able to fix it was by 
adding the "patch" permission to the RBAC role for pods we have for Airflow in 
the helm chart (this seems to be missing even from the official airflow helm 
chart 
https://github.com/helm/charts/blob/master/stable/airflow/templates/rbac/airflow-role.yaml)
 and cleaning up old pending pods.
   
   After this change we are noticing a more stable behavior where a SIGTERM has 
not been raised yet (and hopefully it won't :) ) and no failure messages on 
adopting pods as well so far.
   
   Hope this helps someone, if not feel free to discard it.
   We will keep monitoring the behavior and if this seems to be the permanent 
fix will consider committing it back in the helm chart of Airflow.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to