sgomezf opened a new issue, #31648:
URL: https://github.com/apache/airflow/issues/31648

   ### Apache Airflow version
   
   2.6.1
   
   ### What happened
   
   After installing 2.6.1 with fix https://github.com/apache/airflow/pull/31391 
we could see our DAGs running normally, except that when they take more than 60 
minutes, it stops reporting the log/status, and even if task is completed 
within job pod, is always marked as failure due to the "Unauthorized" errors.
   
   Log of job running starting and authenticating (some info redacted):
   
   ```
   [2023-05-31, 07:00:17 UTC] {base.py:73} INFO - Using connection ID 
'gcp_conn' for task execution.
   [2023-05-31, 07:00:17 UTC] {kubernetes_engine.py:288} INFO - Fetching 
cluster (project_id=<PROJECT-ID>, location=<REGION>, 
cluster_name=<CLUSTER-NAME>)
   [2023-05-31, 07:00:17 UTC] {credentials_provider.py:323} INFO - Getting 
connection using `google.auth.default()` since no key file is defined for hook.
   [2023-05-31, 07:00:17 UTC] {_default.py:213} DEBUG - Checking None for 
explicit credentials as part of auth process...
   [2023-05-31, 07:00:17 UTC] {_default.py:186} DEBUG - Checking Cloud SDK 
credentials as part of auth process...
   [2023-05-31, 07:00:17 UTC] {_default.py:192} DEBUG - Cloud SDK credentials 
not found on disk; not using them
   [2023-05-31, 07:00:17 UTC] {_http_client.py:104} DEBUG - Making request: GET 
http://169.254.169.254
   [2023-05-31, 07:00:17 UTC] {_http_client.py:104} DEBUG - Making request: GET 
http://metadata.google.internal/computeMetadata/v1/project/project-id
   [2023-05-31, 07:00:17 UTC] {requests.py:192} DEBUG - Making request: GET 
http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/?recursive=true
   [2023-05-31, 07:00:17 UTC] {requests.py:192} DEBUG - Making request: GET 
http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/airflow@<PROJECT-ID>.iam.gserviceaccount.com/token?scopes=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform
   [2023-05-31, 07:00:17 UTC] {pod.py:769} DEBUG - Creating pod for 
KubernetesPodOperator task update
   [2023-05-31, 07:00:17 UTC] {pod.py:850} INFO - Building pod mydag-vfqdiqzm 
with labels: {'dag_id': 'mydag', 'task_id': 'update', 'run_id': 
'scheduled__2023-05-30T0700000000-fa9d70c83', 'kubernetes_pod_operator': 
'True', 'try_number': '1'}
   [2023-05-31, 07:00:17 UTC] {base.py:73} INFO - Using connection ID 
'google_cloud_default' for task execution.
   [2023-05-31, 07:00:17 UTC] {credentials_provider.py:323} INFO - Getting 
connection using `google.auth.default()` since no key file is defined for hook.
   [2023-05-31, 07:00:17 UTC] {_default.py:213} DEBUG - Checking None for 
explicit credentials as part of auth process...
   [2023-05-31, 07:00:17 UTC] {_default.py:186} DEBUG - Checking Cloud SDK 
credentials as part of auth process...
   [2023-05-31, 07:00:17 UTC] {_default.py:192} DEBUG - Cloud SDK credentials 
not found on disk; not using them
   [2023-05-31, 07:00:17 UTC] {_http_client.py:104} DEBUG - Making request: GET 
http://169.254.169.254
   [2023-05-31, 07:00:17 UTC] {_http_client.py:104} DEBUG - Making request: GET 
http://metadata.google.internal/computeMetadata/v1/project/project-id
   [2023-05-31, 07:00:17 UTC] {requests.py:192} DEBUG - Making request: GET 
http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/?recursive=true
   [2023-05-31, 07:00:17 UTC] {requests.py:192} DEBUG - Making request: GET 
http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/airflow@<PROJECT-ID>.iam.gserviceaccount.com/token?scopes=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform
   [2023-05-31, 07:00:17 UTC] {rest.py:231} DEBUG - response body: 
{"kind":"PodList","apiVersion":"v1","metadata":{"resourceVersion":"797683242"},"items":[]}
   [2023-05-31, 07:00:17 UTC] {pod.py:500} DEBUG - Starting pod:
   api_version: v1
   kind: Pod
   metadata:
     annotations: {}
     cluster_name: null
   ...
   ```
   
   Periodically we can see heartbeats and status:
   
   ```
   [2023-05-31, 07:06:48 UTC] {rest.py:231} DEBUG - response body: 
{"kind":"Pod","apiVersion":"v1","metadata":{"name":"mydag-vfqdiqzm","namespace":"airflow-namespace","uid":"314deb6c-3c2b-41ae-b49f-1e0c89cf6950","resourceVersion":"797683421","creationTimestamp":"2023-05-31T07:00:17Z"...<REDACTING
 DETAILS POD>}
   [2023-05-31, 07:06:48 UTC] {taskinstance.py:789} DEBUG - Refreshing 
TaskInstance <TaskInstance: mydag.update scheduled__2023-05-30T07:00:00+00:00 
[running]> from DB
   [2023-05-31, 07:06:48 UTC] {job.py:213} DEBUG - [heartbeat]
   
   ```
   
   Exactly at the 1 hour mark this error occurs:
   
   ```
   [2023-05-31, 08:00:55 UTC] {rest.py:231} DEBUG - response body: 
{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Unauthorized","reason":"Unauthorized","code":401}
   [2023-05-31, 08:00:56 UTC] {rest.py:231} DEBUG - response body: 
{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Unauthorized","reason":"Unauthorized","code":401}
   [2023-05-31, 08:00:58 UTC] {taskinstance.py:789} DEBUG - Refreshing 
TaskInstance <TaskInstance: mydag.update scheduled__2023-05-30T07:00:00+00:00 
[running]> from DB
   [2023-05-31, 08:00:58 UTC] {rest.py:231} DEBUG - response body: 
{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Unauthorized","reason":"Unauthorized","code":401}
   [2023-05-31, 08:00:58 UTC] {job.py:213} DEBUG - [heartbeat]
   [2023-05-31, 08:00:58 UTC] {rest.py:231} DEBUG - response body: 
{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Unauthorized","reason":"Unauthorized","code":401}
   [2023-05-31, 08:00:58 UTC] {pod.py:905} ERROR - (401)
   Reason: Unauthorized
   HTTP response headers: HTTPHeaderDict({'Audit-Id': 
'a9cfb5cd-9915-4490-8813-e8392a0e20d2', 'Cache-Control': 'no-cache, private', 
'Content-Type': 'application/json', 'Date': 'Wed, 31 May 2023 08:00:58 GMT', 
'Content-Length': '129'})
   HTTP response body: 
{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Unauthorized","reason":"Unauthorized","code":401}
   Traceback (most recent call last):
     File 
"/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/operators/pod.py",
 line 543, in execute_sync
       self.pod_manager.fetch_container_logs(
     File 
"/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/utils/pod_manager.py",
 line 361, in fetch_container_logs
       last_log_time = consume_logs(
     File 
"/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/utils/pod_manager.py",
 line 339, in consume_logs
       for raw_line in logs:
     File 
"/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/utils/pod_manager.py",
 line 166, in __iter__
       if not self.logs_available():
     File 
"/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/utils/pod_manager.py",
 line 182, in logs_available
       remote_pod = self.read_pod()
     File 
"/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/utils/pod_manager.py",
 line 200, in read_pod
       self.read_pod_cache = self.pod_manager.read_pod(self.pod)
     File 
"/home/airflow/.local/lib/python3.10/site-packages/tenacity/__init__.py", line 
289, in wrapped_f
       return self(f, *args, **kw)
     File 
"/home/airflow/.local/lib/python3.10/site-packages/tenacity/__init__.py", line 
379, in __call__
       do = self.iter(retry_state=retry_state)
     File 
"/home/airflow/.local/lib/python3.10/site-packages/tenacity/__init__.py", line 
325, in iter
       raise retry_exc.reraise()
     File 
"/home/airflow/.local/lib/python3.10/site-packages/tenacity/__init__.py", line 
158, in reraise
       raise self.last_attempt.result()
     File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 451, in 
result
       return self.__get_result()
     File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 403, in 
__get_result
       raise self._exception
     File 
"/home/airflow/.local/lib/python3.10/site-packages/tenacity/__init__.py", line 
382, in __call__
       result = fn(*args, **kwargs)
     File 
"/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/utils/pod_manager.py",
 line 490, in read_pod
       return self._client.read_namespaced_pod(pod.metadata.name, 
pod.metadata.namespace)
     File 
"/home/airflow/.local/lib/python3.10/site-packages/kubernetes/client/api/core_v1_api.py",
 line 23483, in read_namespaced_pod
       return self.read_namespaced_pod_with_http_info(name, namespace, 
**kwargs)  # noqa: E501
     File 
"/home/airflow/.local/lib/python3.10/site-packages/kubernetes/client/api/core_v1_api.py",
 line 23570, in read_namespaced_pod_with_http_info
       return self.api_client.call_api(
     File 
"/home/airflow/.local/lib/python3.10/site-packages/kubernetes/client/api_client.py",
 line 348, in call_api
       return self.__call_api(resource_path, method,
     File 
"/home/airflow/.local/lib/python3.10/site-packages/kubernetes/client/api_client.py",
 line 180, in __call_api
       response_data = self.request(
     File 
"/home/airflow/.local/lib/python3.10/site-packages/kubernetes/client/api_client.py",
 line 373, in request
       return self.rest_client.GET(url,
     File 
"/home/airflow/.local/lib/python3.10/site-packages/kubernetes/client/rest.py", 
line 240, in GET
       return self.request("GET", url,
     File 
"/home/airflow/.local/lib/python3.10/site-packages/kubernetes/client/rest.py", 
line 234, in request
       raise ApiException(http_resp=r)
   ```
   
   Job retries with same result and is marked as failed. 
   
   
   
   ### What you think should happen instead
   
   Pod continues to report log and status of pod until completion (even if it 
takes over 1 hr), and job is marked as successful. 
   
   ### How to reproduce
   
   Create a DAG that makes use of GKEStartPodOperator with a task that will 
take over one hour.
   
   ### Operating System
   
   cos_coaintainerd
   
   ### Versions of Apache Airflow Providers
   
   apache-airflow-providers-cncf-kubernetes==5.2.2
   apache-airflow-providers-google==10.1.1
   
   
   ### Deployment
   
   Official Apache Airflow Helm Chart
   
   ### Deployment details
   
   _No response_
   
   ### Anything else
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to