[ 
https://issues.apache.org/jira/browse/AIRFLOW-2966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16691099#comment-16691099
 ] 

ASF GitHub Bot commented on AIRFLOW-2966:
-----------------------------------------

johnhofman opened a new pull request #4209: [AIRFLOW-2966] Catch ApiException 
in the Kubernetes Executor
URL: https://github.com/apache/incubator-airflow/pull/4209
 
 
   ### Description
   
   Creating a pod that exceeds a namespace's resource quota throws an 
ApiException. This change catches the exception and the task is re-queued 
inside the Executor instead of killing the scheduler.
   
   `click 7.0` was recently released but `flask-appbuilder 1.11.1 has 
requirement click==6.7`. I have pinned `click==6.7` to make the dependencies 
resolve.
   
   ### Tests
   
   This adds a single test `TestKubernetesExecutor. test_run_next_exception` 
that covers this single scenario. Without the changes this test fails when the 
ApiException is not caught. 
   
   This is the first test case for the `KubernetesExecutor`,  so I needed to 
add the `[kubernetes]` section to `default_test.cfg` so that the 
`KubernetesExecutor` can be built without exceptions.
   
   Jira ticket: https://issues.apache.org/jira/browse/AIRFLOW-2966
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> KubernetesExecutor + namespace quotas kills scheduler if the pod can't be 
> launched
> ----------------------------------------------------------------------------------
>
>                 Key: AIRFLOW-2966
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-2966
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: scheduler
>    Affects Versions: 2.0.0
>         Environment: Kubernetes 1.9.8
>            Reporter: John Hofman
>            Assignee: John Hofman
>            Priority: Major
>             Fix For: 2.0.0
>
>
> When running Airflow in Kubernetes with the KubernetesExecutor and resource 
> quota's set on the namespace Airflow is deployed in. If the scheduler tries 
> to launch a pod into the namespace that exceeds the namespace limits it gets 
> an ApiException, and crashes the scheduler.
> This stack trace is an example of the ApiException from the kubernetes client:
> {code:java}
> [2018-08-27 09:51:08,516] {pod_launcher.py:58} ERROR - Exception when 
> attempting to create Namespaced Pod.
> Traceback (most recent call last):
> File "/src/apache-airflow/airflow/contrib/kubernetes/pod_launcher.py", line 
> 55, in run_pod_async
> resp = self._client.create_namespaced_pod(body=req, namespace=pod.namespace)
> File 
> "/usr/local/lib/python3.6/site-packages/kubernetes/client/apis/core_v1_api.py",
>  line 6057, in create_namespaced_pod
> (data) = self.create_namespaced_pod_with_http_info(namespace, body, **kwargs)
> File 
> "/usr/local/lib/python3.6/site-packages/kubernetes/client/apis/core_v1_api.py",
>  line 6142, in create_namespaced_pod_with_http_info
> collection_formats=collection_formats)
> File 
> "/usr/local/lib/python3.6/site-packages/kubernetes/client/api_client.py", 
> line 321, in call_api
> _return_http_data_only, collection_formats, _preload_content, 
> _request_timeout)
> File 
> "/usr/local/lib/python3.6/site-packages/kubernetes/client/api_client.py", 
> line 155, in __call_api
> _request_timeout=_request_timeout)
> File 
> "/usr/local/lib/python3.6/site-packages/kubernetes/client/api_client.py", 
> line 364, in request
> body=body)
> File "/usr/local/lib/python3.6/site-packages/kubernetes/client/rest.py", line 
> 266, in POST
> body=body)
> File "/usr/local/lib/python3.6/site-packages/kubernetes/client/rest.py", line 
> 222, in request
> raise ApiException(http_resp=r)
> kubernetes.client.rest.ApiException: (403)
> Reason: Forbidden
> HTTP response headers: HTTPHeaderDict({'Audit-Id': 
> 'b00e2cbb-bdb2-41f3-8090-824aee79448c', 'Content-Type': 'application/json', 
> 'Date': 'Mon, 27 Aug 2018 09:51:08 GMT', 'Content-Length': '410'})
> HTTP response body: 
> {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods
>  \"podname-ec366e89ef934d91b2d3ffe96234a725\" is forbidden: exceeded quota: 
> compute-resources, requested: limits.memory=4Gi, used: limits.memory=6508Mi, 
> limited: 
> limits.memory=10Gi","reason":"Forbidden","details":{"name":"podname-ec366e89ef934d91b2d3ffe96234a725","kind":"pods"},"code":403}{code}
>  
> I would expect the scheduler to catch the Exception and at least mark the 
> task as failed, or better yet retry the task later.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to