GitHub user ryandutton edited a comment on the discussion: Failed scheduler liveness check on GKE during Kubernetes master upgrade
I have a dedicated pod running the scheduler, with four workers running. Most of our jobs run using the `KubernetesPodOperator` in the cncf-kubernetes provider. As part of the scheduler configuration you should define what type of executor you are using, in my case it's `executor = KubernetesExecutor` which is the same as the example in [this](https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/executor/index.html#executor) Airflow document. This implies to me that the scheduler is very much aware it's running on k8s. During periods of time when the master is unavailable, the `airflow jobs check` cli command triggers the following Kubernetes event ``` 5m37s Warning Unhealthy pod/airflow-scheduler-6fb8b5b769-gz8vp Liveness probe failed: No alive jobs found. 5m37s Normal Killing pod/airflow-scheduler-6fb8b5b769-gz8vp Container master failed liveness probe, will be restarted ``` I guess when Airflow is unable to schedule jobs due to master unavailability, it wouldn't be able to schedule a job, however, it feels quite sensitive. These upgrades typically take around 4-5 minutes, we could increase the periods of time between each liveness check or increase the failure threshold however, I feel we could be masking other potential scheduler issues which aren't caused by k8s master unavailability. Here is a log from the time of the unavailability. ``` Traceback (most recent call last): File "/usr/local/home/lib/python3.10/site-packages/airflow/cli/commands/scheduler_command.py", line 47, in _run_scheduler_job run_job(job=job_runner.job, execute_callable=job_runner._execute) File "/usr/local/home/lib/python3.10/site-packages/airflow/utils/session.py", line 77, in wrapper return func(*args, session=session, **kwargs) File "/usr/local/home/lib/python3.10/site-packages/airflow/jobs/job.py", line 289, in run_job return execute_job(job, execute_callable=execute_callable) File "/usr/local/home/lib/python3.10/site-packages/airflow/jobs/job.py", line 318, in execute_job ret = execute_callable() File "/usr/local/home/lib/python3.10/site-packages/airflow/jobs/scheduler_job_runner.py", line 845, in _execute self._run_scheduler_loop() File "/usr/local/home/lib/python3.10/site-packages/airflow/jobs/scheduler_job_runner.py", line 929, in _run_scheduler_loop self.adopt_or_reset_orphaned_tasks() File "/usr/local/home/lib/python3.10/site-packages/airflow/utils/session.py", line 77, in wrapper return func(*args, session=session, **kwargs) File "/usr/local/home/lib/python3.10/site-packages/airflow/jobs/scheduler_job_runner.py", line 1589, in adopt_or_reset_orphaned_tasks for attempt in run_with_db_retries(logger=self.log): File "/usr/local/home/lib/python3.10/site-packages/tenacity/__init__.py", line 347, in __iter__ do = self.iter(retry_state=retry_state) File "/usr/local/home/lib/python3.10/site-packages/tenacity/__init__.py", line 314, in iter return fut.result() File "/usr/local/home/lib/python3.10/concurrent/futures/_base.py", line 451, in result return self.__get_result() File "/usr/local/home/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result raise self._exception File "/usr/local/home/lib/python3.10/site-packages/airflow/jobs/scheduler_job_runner.py", line 1634, in adopt_or_reset_orphaned_tasks to_reset = self.job.executor.try_adopt_task_instances(tis_to_adopt_or_reset) File "/usr/local/home/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/executors/kubernetes_executor.py", line 546, in try_adopt_task_instances self._adopt_completed_pods(kube_client) File "/usr/local/home/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/executors/kubernetes_executor.py", line 647, in _adopt_completed_pods pod_list = self._list_pods(query_kwargs) File "/usr/local/home/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/executors/kubernetes_executor.py", line 179, in _list_pods pods = self.kube_client.list_namespaced_pod( File "/usr/local/home/lib/python3.10/site-packages/kubernetes/client/api/core_v1_api.py", line 15697, in list_namespaced_pod return self.list_namespaced_pod_with_http_info(namespace, **kwargs) # noqa: E501 File "/usr/local/home/lib/python3.10/site-packages/kubernetes/client/api/core_v1_api.py", line 15812, in list_namespaced_pod_with_http_info return self.api_client.call_api( File "/usr/local/home/lib/python3.10/site-packages/kubernetes/client/api_client.py", line 348, in call_api return self.__call_api(resource_path, method, File "/usr/local/home/lib/python3.10/site-packages/kubernetes/client/api_client.py", line 180, in __call_api response_data = self.request( File "/usr/local/home/lib/python3.10/site-packages/kubernetes/client/api_client.py", line 373, in request return self.rest_client.GET(url, File "/usr/local/home/lib/python3.10/site-packages/kubernetes/client/rest.py", line 240, in GET return self.request("GET", url, File "/usr/local/home/lib/python3.10/site-packages/kubernetes/client/rest.py", line 213, in request r = self.pool_manager.request(method, url, File "/usr/local/home/lib/python3.10/site-packages/urllib3/request.py", line 74, in request return self.request_encode_url( File "/usr/local/home/lib/python3.10/site-packages/urllib3/request.py", line 96, in request_encode_url return self.urlopen(method, url, **extra_kw) File "/usr/local/home/lib/python3.10/site-packages/urllib3/poolmanager.py", line 376, in urlopen response = conn.urlopen(method, u.request_uri, **kw) File "/usr/local/home/lib/python3.10/site-packages/urllib3/connectionpool.py", line 826, in urlopen return self.urlopen( File "/usr/local/home/lib/python3.10/site-packages/urllib3/connectionpool.py", line 826, in urlopen return self.urlopen( File "/usr/local/home/lib/python3.10/site-packages/urllib3/connectionpool.py", line 826, in urlopen return self.urlopen( [Previous line repeated 7 more times] File "/usr/local/home/lib/python3.10/site-packages/urllib3/connectionpool.py", line 798, in urlopen retries = retries.increment( File "/usr/local/home/lib/python3.10/site-packages/urllib3/util/retry.py", line 592, in increment raise MaxRetryError(_pool, url, error or ResponseError(cause)) urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='1.2.3.4', port=443): Max retries exceeded with url: /api/v1/namespaces/data-platform-airflow/pods?fieldSelector=status.phase%3DSucceeded&labelSelector=kubernetes_executor%3DTrue%2Cairflow-worker%21%3D8214%2Cairflow_executor_done%21%3DTrue (Caused by ProtocolError('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))) ``` GitHub link: https://github.com/apache/airflow/discussions/35918#discussioncomment-7716492 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
