GitHub user ryandutton edited a comment on the discussion: Failed scheduler 
liveness check on GKE during Kubernetes master upgrade

I have a dedicated pod running the scheduler, with four workers running. Most 
of our jobs run using the `KubernetesPodOperator` in the cncf-kubernetes 
provider. As part of the scheduler configuration you should define what type of 
executor you are using, in my case it's `executor = KubernetesExecutor` which 
is the same as the example in 
[this](https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/executor/index.html#executor)
 Airflow document. This implies to me that the scheduler is very much aware 
it's running on k8s.

During periods of time when the master is unavailable, the `airflow jobs check` 
cli command triggers the following Kubernetes event 

```
5m37s       Warning   Unhealthy   pod/airflow-scheduler-6fb8b5b769-gz8vp        
        Liveness probe failed: No alive jobs found.
5m37s       Normal    Killing     pod/airflow-scheduler-6fb8b5b769-gz8vp        
        Container master failed liveness probe, will be restarted
```
I guess when Airflow is unable to schedule jobs due to master unavailability, 
it wouldn't be able to schedule a job, however, it feels quite sensitive. These 
upgrades typically take around 4-5 minutes, we could increase the periods of 
time between each liveness check or increase the failure threshold however, I 
feel we could be masking other potential scheduler issues which aren't caused 
by k8s master unavailability. 

Here is a log from the time of the unavailability. 

```
Traceback (most recent call last):
  File 
"/usr/local/home/lib/python3.10/site-packages/airflow/cli/commands/scheduler_command.py",
 line 47, in _run_scheduler_job
    run_job(job=job_runner.job, execute_callable=job_runner._execute)
  File "/usr/local/home/lib/python3.10/site-packages/airflow/utils/session.py", 
line 77, in wrapper
    return func(*args, session=session, **kwargs)
  File "/usr/local/home/lib/python3.10/site-packages/airflow/jobs/job.py", line 
289, in run_job
    return execute_job(job, execute_callable=execute_callable)
  File "/usr/local/home/lib/python3.10/site-packages/airflow/jobs/job.py", line 
318, in execute_job
    ret = execute_callable()
  File 
"/usr/local/home/lib/python3.10/site-packages/airflow/jobs/scheduler_job_runner.py",
 line 845, in _execute
    self._run_scheduler_loop()
  File 
"/usr/local/home/lib/python3.10/site-packages/airflow/jobs/scheduler_job_runner.py",
 line 929, in _run_scheduler_loop
    self.adopt_or_reset_orphaned_tasks()
  File "/usr/local/home/lib/python3.10/site-packages/airflow/utils/session.py", 
line 77, in wrapper
    return func(*args, session=session, **kwargs)
  File 
"/usr/local/home/lib/python3.10/site-packages/airflow/jobs/scheduler_job_runner.py",
 line 1589, in adopt_or_reset_orphaned_tasks
    for attempt in run_with_db_retries(logger=self.log):
  File "/usr/local/home/lib/python3.10/site-packages/tenacity/__init__.py", 
line 347, in __iter__
    do = self.iter(retry_state=retry_state)
  File "/usr/local/home/lib/python3.10/site-packages/tenacity/__init__.py", 
line 314, in iter
    return fut.result()
  File "/usr/local/home/lib/python3.10/concurrent/futures/_base.py", line 451, 
in result
    return self.__get_result()
  File "/usr/local/home/lib/python3.10/concurrent/futures/_base.py", line 403, 
in __get_result
    raise self._exception
  File 
"/usr/local/home/lib/python3.10/site-packages/airflow/jobs/scheduler_job_runner.py",
 line 1634, in adopt_or_reset_orphaned_tasks
    to_reset = self.job.executor.try_adopt_task_instances(tis_to_adopt_or_reset)
  File 
"/usr/local/home/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/executors/kubernetes_executor.py",
 line 546, in try_adopt_task_instances
    self._adopt_completed_pods(kube_client)
  File 
"/usr/local/home/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/executors/kubernetes_executor.py",
 line 647, in _adopt_completed_pods
    pod_list = self._list_pods(query_kwargs)
  File 
"/usr/local/home/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/executors/kubernetes_executor.py",
 line 179, in _list_pods
    pods = self.kube_client.list_namespaced_pod(
  File 
"/usr/local/home/lib/python3.10/site-packages/kubernetes/client/api/core_v1_api.py",
 line 15697, in list_namespaced_pod
    return self.list_namespaced_pod_with_http_info(namespace, **kwargs)  # 
noqa: E501
  File 
"/usr/local/home/lib/python3.10/site-packages/kubernetes/client/api/core_v1_api.py",
 line 15812, in list_namespaced_pod_with_http_info
    return self.api_client.call_api(
  File 
"/usr/local/home/lib/python3.10/site-packages/kubernetes/client/api_client.py", 
line 348, in call_api
    return self.__call_api(resource_path, method,
  File 
"/usr/local/home/lib/python3.10/site-packages/kubernetes/client/api_client.py", 
line 180, in __call_api
    response_data = self.request(
  File 
"/usr/local/home/lib/python3.10/site-packages/kubernetes/client/api_client.py", 
line 373, in request
    return self.rest_client.GET(url,
  File 
"/usr/local/home/lib/python3.10/site-packages/kubernetes/client/rest.py", line 
240, in GET
    return self.request("GET", url,
  File 
"/usr/local/home/lib/python3.10/site-packages/kubernetes/client/rest.py", line 
213, in request
    r = self.pool_manager.request(method, url,
  File "/usr/local/home/lib/python3.10/site-packages/urllib3/request.py", line 
74, in request
    return self.request_encode_url(
  File "/usr/local/home/lib/python3.10/site-packages/urllib3/request.py", line 
96, in request_encode_url
    return self.urlopen(method, url, **extra_kw)
  File "/usr/local/home/lib/python3.10/site-packages/urllib3/poolmanager.py", 
line 376, in urlopen
    response = conn.urlopen(method, u.request_uri, **kw)
  File 
"/usr/local/home/lib/python3.10/site-packages/urllib3/connectionpool.py", line 
826, in urlopen
    return self.urlopen(
  File 
"/usr/local/home/lib/python3.10/site-packages/urllib3/connectionpool.py", line 
826, in urlopen
    return self.urlopen(
  File 
"/usr/local/home/lib/python3.10/site-packages/urllib3/connectionpool.py", line 
826, in urlopen
    return self.urlopen(
  [Previous line repeated 7 more times]
  File 
"/usr/local/home/lib/python3.10/site-packages/urllib3/connectionpool.py", line 
798, in urlopen
    retries = retries.increment(
  File "/usr/local/home/lib/python3.10/site-packages/urllib3/util/retry.py", 
line 592, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='1.2.3.4', 
port=443): Max retries exceeded with url: 
/api/v1/namespaces/data-platform-airflow/pods?fieldSelector=status.phase%3DSucceeded&labelSelector=kubernetes_executor%3DTrue%2Cairflow-worker%21%3D8214%2Cairflow_executor_done%21%3DTrue
 (Caused by ProtocolError('Connection aborted.', ConnectionResetError(104, 
'Connection reset by peer')))
```

GitHub link: 
https://github.com/apache/airflow/discussions/35918#discussioncomment-7716492

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to