sgomezf opened a new issue, #31648: URL: https://github.com/apache/airflow/issues/31648
### Apache Airflow version 2.6.1 ### What happened After installing 2.6.1 with fix https://github.com/apache/airflow/pull/31391 we could see our DAGs running normally, except that when they take more than 60 minutes, it stops reporting the log/status, and even if task is completed within job pod, is always marked as failure due to the "Unauthorized" errors. Log of job running starting and authenticating (some info redacted): ``` [2023-05-31, 07:00:17 UTC] {base.py:73} INFO - Using connection ID 'gcp_conn' for task execution. [2023-05-31, 07:00:17 UTC] {kubernetes_engine.py:288} INFO - Fetching cluster (project_id=<PROJECT-ID>, location=<REGION>, cluster_name=<CLUSTER-NAME>) [2023-05-31, 07:00:17 UTC] {credentials_provider.py:323} INFO - Getting connection using `google.auth.default()` since no key file is defined for hook. [2023-05-31, 07:00:17 UTC] {_default.py:213} DEBUG - Checking None for explicit credentials as part of auth process... [2023-05-31, 07:00:17 UTC] {_default.py:186} DEBUG - Checking Cloud SDK credentials as part of auth process... [2023-05-31, 07:00:17 UTC] {_default.py:192} DEBUG - Cloud SDK credentials not found on disk; not using them [2023-05-31, 07:00:17 UTC] {_http_client.py:104} DEBUG - Making request: GET http://169.254.169.254 [2023-05-31, 07:00:17 UTC] {_http_client.py:104} DEBUG - Making request: GET http://metadata.google.internal/computeMetadata/v1/project/project-id [2023-05-31, 07:00:17 UTC] {requests.py:192} DEBUG - Making request: GET http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/?recursive=true [2023-05-31, 07:00:17 UTC] {requests.py:192} DEBUG - Making request: GET http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/airflow@<PROJECT-ID>.iam.gserviceaccount.com/token?scopes=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform [2023-05-31, 07:00:17 UTC] {pod.py:769} DEBUG - Creating pod for KubernetesPodOperator task update [2023-05-31, 07:00:17 UTC] {pod.py:850} INFO - Building pod mydag-vfqdiqzm with labels: {'dag_id': 'mydag', 'task_id': 'update', 'run_id': 'scheduled__2023-05-30T0700000000-fa9d70c83', 'kubernetes_pod_operator': 'True', 'try_number': '1'} [2023-05-31, 07:00:17 UTC] {base.py:73} INFO - Using connection ID 'google_cloud_default' for task execution. [2023-05-31, 07:00:17 UTC] {credentials_provider.py:323} INFO - Getting connection using `google.auth.default()` since no key file is defined for hook. [2023-05-31, 07:00:17 UTC] {_default.py:213} DEBUG - Checking None for explicit credentials as part of auth process... [2023-05-31, 07:00:17 UTC] {_default.py:186} DEBUG - Checking Cloud SDK credentials as part of auth process... [2023-05-31, 07:00:17 UTC] {_default.py:192} DEBUG - Cloud SDK credentials not found on disk; not using them [2023-05-31, 07:00:17 UTC] {_http_client.py:104} DEBUG - Making request: GET http://169.254.169.254 [2023-05-31, 07:00:17 UTC] {_http_client.py:104} DEBUG - Making request: GET http://metadata.google.internal/computeMetadata/v1/project/project-id [2023-05-31, 07:00:17 UTC] {requests.py:192} DEBUG - Making request: GET http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/?recursive=true [2023-05-31, 07:00:17 UTC] {requests.py:192} DEBUG - Making request: GET http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/airflow@<PROJECT-ID>.iam.gserviceaccount.com/token?scopes=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform [2023-05-31, 07:00:17 UTC] {rest.py:231} DEBUG - response body: {"kind":"PodList","apiVersion":"v1","metadata":{"resourceVersion":"797683242"},"items":[]} [2023-05-31, 07:00:17 UTC] {pod.py:500} DEBUG - Starting pod: api_version: v1 kind: Pod metadata: annotations: {} cluster_name: null ... ``` Periodically we can see heartbeats and status: ``` [2023-05-31, 07:06:48 UTC] {rest.py:231} DEBUG - response body: {"kind":"Pod","apiVersion":"v1","metadata":{"name":"mydag-vfqdiqzm","namespace":"airflow-namespace","uid":"314deb6c-3c2b-41ae-b49f-1e0c89cf6950","resourceVersion":"797683421","creationTimestamp":"2023-05-31T07:00:17Z"...<REDACTING DETAILS POD>} [2023-05-31, 07:06:48 UTC] {taskinstance.py:789} DEBUG - Refreshing TaskInstance <TaskInstance: mydag.update scheduled__2023-05-30T07:00:00+00:00 [running]> from DB [2023-05-31, 07:06:48 UTC] {job.py:213} DEBUG - [heartbeat] ``` Exactly at the 1 hour mark this error occurs: ``` [2023-05-31, 08:00:55 UTC] {rest.py:231} DEBUG - response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Unauthorized","reason":"Unauthorized","code":401} [2023-05-31, 08:00:56 UTC] {rest.py:231} DEBUG - response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Unauthorized","reason":"Unauthorized","code":401} [2023-05-31, 08:00:58 UTC] {taskinstance.py:789} DEBUG - Refreshing TaskInstance <TaskInstance: mydag.update scheduled__2023-05-30T07:00:00+00:00 [running]> from DB [2023-05-31, 08:00:58 UTC] {rest.py:231} DEBUG - response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Unauthorized","reason":"Unauthorized","code":401} [2023-05-31, 08:00:58 UTC] {job.py:213} DEBUG - [heartbeat] [2023-05-31, 08:00:58 UTC] {rest.py:231} DEBUG - response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Unauthorized","reason":"Unauthorized","code":401} [2023-05-31, 08:00:58 UTC] {pod.py:905} ERROR - (401) Reason: Unauthorized HTTP response headers: HTTPHeaderDict({'Audit-Id': 'a9cfb5cd-9915-4490-8813-e8392a0e20d2', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Date': 'Wed, 31 May 2023 08:00:58 GMT', 'Content-Length': '129'}) HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Unauthorized","reason":"Unauthorized","code":401} Traceback (most recent call last): File "/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/operators/pod.py", line 543, in execute_sync self.pod_manager.fetch_container_logs( File "/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/utils/pod_manager.py", line 361, in fetch_container_logs last_log_time = consume_logs( File "/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/utils/pod_manager.py", line 339, in consume_logs for raw_line in logs: File "/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/utils/pod_manager.py", line 166, in __iter__ if not self.logs_available(): File "/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/utils/pod_manager.py", line 182, in logs_available remote_pod = self.read_pod() File "/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/utils/pod_manager.py", line 200, in read_pod self.read_pod_cache = self.pod_manager.read_pod(self.pod) File "/home/airflow/.local/lib/python3.10/site-packages/tenacity/__init__.py", line 289, in wrapped_f return self(f, *args, **kw) File "/home/airflow/.local/lib/python3.10/site-packages/tenacity/__init__.py", line 379, in __call__ do = self.iter(retry_state=retry_state) File "/home/airflow/.local/lib/python3.10/site-packages/tenacity/__init__.py", line 325, in iter raise retry_exc.reraise() File "/home/airflow/.local/lib/python3.10/site-packages/tenacity/__init__.py", line 158, in reraise raise self.last_attempt.result() File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 451, in result return self.__get_result() File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result raise self._exception File "/home/airflow/.local/lib/python3.10/site-packages/tenacity/__init__.py", line 382, in __call__ result = fn(*args, **kwargs) File "/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/utils/pod_manager.py", line 490, in read_pod return self._client.read_namespaced_pod(pod.metadata.name, pod.metadata.namespace) File "/home/airflow/.local/lib/python3.10/site-packages/kubernetes/client/api/core_v1_api.py", line 23483, in read_namespaced_pod return self.read_namespaced_pod_with_http_info(name, namespace, **kwargs) # noqa: E501 File "/home/airflow/.local/lib/python3.10/site-packages/kubernetes/client/api/core_v1_api.py", line 23570, in read_namespaced_pod_with_http_info return self.api_client.call_api( File "/home/airflow/.local/lib/python3.10/site-packages/kubernetes/client/api_client.py", line 348, in call_api return self.__call_api(resource_path, method, File "/home/airflow/.local/lib/python3.10/site-packages/kubernetes/client/api_client.py", line 180, in __call_api response_data = self.request( File "/home/airflow/.local/lib/python3.10/site-packages/kubernetes/client/api_client.py", line 373, in request return self.rest_client.GET(url, File "/home/airflow/.local/lib/python3.10/site-packages/kubernetes/client/rest.py", line 240, in GET return self.request("GET", url, File "/home/airflow/.local/lib/python3.10/site-packages/kubernetes/client/rest.py", line 234, in request raise ApiException(http_resp=r) ``` Job retries with same result and is marked as failed. ### What you think should happen instead Pod continues to report log and status of pod until completion (even if it takes over 1 hr), and job is marked as successful. ### How to reproduce Create a DAG that makes use of GKEStartPodOperator with a task that will take over one hour. ### Operating System cos_coaintainerd ### Versions of Apache Airflow Providers apache-airflow-providers-cncf-kubernetes==5.2.2 apache-airflow-providers-google==10.1.1 ### Deployment Official Apache Airflow Helm Chart ### Deployment details _No response_ ### Anything else _No response_ ### Are you willing to submit PR? - [ ] Yes I am willing to submit a PR! ### Code of Conduct - [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
