cw-pradyumna opened a new issue, #32744:
URL: https://github.com/apache/airflow/issues/32744

   ### Apache Airflow version
   
   2.6.3
   
   ### What happened
   
   A spark job which runs for >1 hours gets a 410 Expired error regardless of 
what the actual output of the spark application was.
   
   Logs - 
   ```
   [2023-07-21, 19:41:57 IST] {spark_kubernetes.py:126} INFO - 
2023-07-21T14:11:57.424338279Z 23/07/21 19:41:57 INFO MetricsSystemImpl: 
s3a-file-system metrics system stopped.
   [2023-07-21, 19:41:57 IST] {spark_kubernetes.py:126} INFO - 
2023-07-21T14:11:57.424355219Z 23/07/21 19:41:57 INFO MetricsSystemImpl: 
s3a-file-system metrics system shutdown complete.
   [2023-07-21, 19:41:58 IST] {spark_kubernetes.py:117} INFO - Executor 
[adhoc-d9e5ed897882bd27-exec-1] is pending
   [2023-07-21, 19:41:58 IST] {spark_kubernetes.py:117} INFO - Executor 
[adhoc-d9e5ed897882bd27-exec-1] is pending
   [2023-07-21, 19:41:58 IST] {spark_kubernetes.py:117} INFO - Executor 
[adhoc-d9e5ed897882bd27-exec-1] is running
   [2023-07-21, 19:41:58 IST] {spark_kubernetes.py:117} INFO - Executor 
[adhoc-d9e5ed897882bd27-exec-1] completed
   [2023-07-21, 19:41:58 IST] {spark_kubernetes.py:117} INFO - Executor 
[adhoc-d9e5ed897882bd27-exec-2] is pending
   [2023-07-21, 19:41:58 IST] {spark_kubernetes.py:117} INFO - Executor 
[adhoc-d9e5ed897882bd27-exec-2] is running
   [2023-07-21, 19:41:58 IST] {spark_kubernetes.py:117} INFO - Executor 
[adhoc-d9e5ed897882bd27-exec-3] is pending
   [2023-07-21, 19:41:58 IST] {spark_kubernetes.py:117} INFO - Executor 
[adhoc-d9e5ed897882bd27-exec-3] is pending
   [2023-07-21, 19:41:58 IST] {spark_kubernetes.py:117} INFO - Executor 
[adhoc-d9e5ed897882bd27-exec-4] is pending
   [2023-07-21, 19:41:58 IST] {spark_kubernetes.py:117} INFO - Executor 
[adhoc-d9e5ed897882bd27-exec-4] is running
   [2023-07-21, 19:41:58 IST] {spark_kubernetes.py:117} INFO - Executor 
[adhoc-d9e5ed897882bd27-exec-3] is running
   [2023-07-21, 19:41:58 IST] {taskinstance.py:1824} ERROR - Task failed with 
exception
   Traceback (most recent call last):
     File 
"/home/airflow/.local/lib/python3.8/site-packages/airflow/providers/cncf/kubernetes/operators/spark_kubernetes.py",
 line 112, in execute
       for event in namespace_event_stream:
     File 
"/home/airflow/.local/lib/python3.8/site-packages/kubernetes/watch/watch.py", 
line 182, in stream
       raise client.rest.ApiException(
   kubernetes.client.exceptions.ApiException: (410)
   Reason: Expired: The resourceVersion for the provided watch is too old.
   [2023-07-21, 19:41:58 IST] {taskinstance.py:1345} INFO - Marking task as 
FAILED. dag_id=adhoc, task_id=submit_job, execution_date=20230721T125127, 
start_date=20230721T125217, end_date=20230721T141158
   [2023-07-21, 19:41:58 IST] {standard_task_runner.py:104} ERROR - Failed to 
execute job 247326 for task submit_job ((410)
   Reason: Expired: The resourceVersion for the provided watch is too old.
   ; 15)
   [2023-07-21, 19:41:58 IST] {local_task_job_runner.py:225} INFO - Task exited 
with return code 1
   [2023-07-21, 19:41:58 IST] {taskinstance.py:2653} INFO - 0 downstream tasks 
scheduled from follow-on schedule check
   ```
   In this case, I am able to see the logs of the task and only the final state 
is wrongly reported.
   
   But in case the task runs for even longer (>4 hours) then even the logs are 
not seen.
   
   ### What you think should happen instead
   
   In the first case, the tasks should have reported the correct state of the 
spark application. In the second case, logs should have been still visible and 
the correct state of the spark application should have been reported
   
   ### How to reproduce
   
   To reproduce it, have a long running (>1 hour) task submitted using the 
SparkKubernetesOperator.
   
   If the task gets completed before 4 hours then you should see a 410 Expired 
error regardless of what the actual output of the spark application was.
   
   If the task takes longer, you should see the task fail around the 4 hour 
mark due to 410 even when the spark application is still running.
   
   ### Operating System
   
   Debian GNU/Linux 11 (bullseye)
   
   ### Versions of Apache Airflow Providers
   
   apache-airflow-providers-amazon==8.2.0
   apache-airflow-providers-celery==3.2.1
   apache-airflow-providers-cncf-kubernetes==7.3.0
   apache-airflow-providers-common-sql==1.5.2
   apache-airflow-providers-docker==3.7.1
   apache-airflow-providers-elasticsearch==4.5.1
   apache-airflow-providers-ftp==3.4.2
   apache-airflow-providers-google==10.2.0
   apache-airflow-providers-grpc==3.2.1
   apache-airflow-providers-hashicorp==3.4.1
   apache-airflow-providers-http==4.4.2
   apache-airflow-providers-imap==3.2.2
   apache-airflow-providers-microsoft-azure==6.1.2
   apache-airflow-providers-mysql==5.1.1
   apache-airflow-providers-odbc==4.0.0
   apache-airflow-providers-postgres==5.5.1
   apache-airflow-providers-redis==3.2.1
   apache-airflow-providers-sendgrid==3.2.1
   apache-airflow-providers-sftp==4.3.1
   apache-airflow-providers-slack==7.3.1
   apache-airflow-providers-snowflake==4.2.0
   apache-airflow-providers-sqlite==3.4.2
   apache-airflow-providers-ssh==3.7.1
   
   ### Deployment
   
   Official Apache Airflow Helm Chart
   
   ### Deployment details
   
   k8s version - 1.23 hosted using EKS
   python 3.8
   I have upgraded the apache-airflow-providers-cncf-kubernetes to ensure that 
the bug has not been fixed in the newer versions.
   
   ### Anything else
   
   I think this issue is because of the kubernetes 'Watch().stream''s 
APIException not being handled. According to its docs - 
   ```
   Note that watching an API resource can expire. The method tries to
           resume automatically once from the last result, but if that last 
result
           is too old as well, an `ApiException` exception will be thrown with
           ``code`` 410. In that case you have to recover yourself, probably
           by listing the API resource to obtain the latest state and then
           watching from that state on by setting ``resource_version`` to
           one returned from listing.
   ```
   this error needs to be handled by Airflow and not by the kubernetes client.
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to