cw-pradyumna opened a new issue, #32744:
URL: https://github.com/apache/airflow/issues/32744
### Apache Airflow version
2.6.3
### What happened
A spark job which runs for >1 hours gets a 410 Expired error regardless of
what the actual output of the spark application was.
Logs -
```
[2023-07-21, 19:41:57 IST] {spark_kubernetes.py:126} INFO -
2023-07-21T14:11:57.424338279Z 23/07/21 19:41:57 INFO MetricsSystemImpl:
s3a-file-system metrics system stopped.
[2023-07-21, 19:41:57 IST] {spark_kubernetes.py:126} INFO -
2023-07-21T14:11:57.424355219Z 23/07/21 19:41:57 INFO MetricsSystemImpl:
s3a-file-system metrics system shutdown complete.
[2023-07-21, 19:41:58 IST] {spark_kubernetes.py:117} INFO - Executor
[adhoc-d9e5ed897882bd27-exec-1] is pending
[2023-07-21, 19:41:58 IST] {spark_kubernetes.py:117} INFO - Executor
[adhoc-d9e5ed897882bd27-exec-1] is pending
[2023-07-21, 19:41:58 IST] {spark_kubernetes.py:117} INFO - Executor
[adhoc-d9e5ed897882bd27-exec-1] is running
[2023-07-21, 19:41:58 IST] {spark_kubernetes.py:117} INFO - Executor
[adhoc-d9e5ed897882bd27-exec-1] completed
[2023-07-21, 19:41:58 IST] {spark_kubernetes.py:117} INFO - Executor
[adhoc-d9e5ed897882bd27-exec-2] is pending
[2023-07-21, 19:41:58 IST] {spark_kubernetes.py:117} INFO - Executor
[adhoc-d9e5ed897882bd27-exec-2] is running
[2023-07-21, 19:41:58 IST] {spark_kubernetes.py:117} INFO - Executor
[adhoc-d9e5ed897882bd27-exec-3] is pending
[2023-07-21, 19:41:58 IST] {spark_kubernetes.py:117} INFO - Executor
[adhoc-d9e5ed897882bd27-exec-3] is pending
[2023-07-21, 19:41:58 IST] {spark_kubernetes.py:117} INFO - Executor
[adhoc-d9e5ed897882bd27-exec-4] is pending
[2023-07-21, 19:41:58 IST] {spark_kubernetes.py:117} INFO - Executor
[adhoc-d9e5ed897882bd27-exec-4] is running
[2023-07-21, 19:41:58 IST] {spark_kubernetes.py:117} INFO - Executor
[adhoc-d9e5ed897882bd27-exec-3] is running
[2023-07-21, 19:41:58 IST] {taskinstance.py:1824} ERROR - Task failed with
exception
Traceback (most recent call last):
File
"/home/airflow/.local/lib/python3.8/site-packages/airflow/providers/cncf/kubernetes/operators/spark_kubernetes.py",
line 112, in execute
for event in namespace_event_stream:
File
"/home/airflow/.local/lib/python3.8/site-packages/kubernetes/watch/watch.py",
line 182, in stream
raise client.rest.ApiException(
kubernetes.client.exceptions.ApiException: (410)
Reason: Expired: The resourceVersion for the provided watch is too old.
[2023-07-21, 19:41:58 IST] {taskinstance.py:1345} INFO - Marking task as
FAILED. dag_id=adhoc, task_id=submit_job, execution_date=20230721T125127,
start_date=20230721T125217, end_date=20230721T141158
[2023-07-21, 19:41:58 IST] {standard_task_runner.py:104} ERROR - Failed to
execute job 247326 for task submit_job ((410)
Reason: Expired: The resourceVersion for the provided watch is too old.
; 15)
[2023-07-21, 19:41:58 IST] {local_task_job_runner.py:225} INFO - Task exited
with return code 1
[2023-07-21, 19:41:58 IST] {taskinstance.py:2653} INFO - 0 downstream tasks
scheduled from follow-on schedule check
```
In this case, I am able to see the logs of the task and only the final state
is wrongly reported.
But in case the task runs for even longer (>4 hours) then even the logs are
not seen.
### What you think should happen instead
In the first case, the tasks should have reported the correct state of the
spark application. In the second case, logs should have been still visible and
the correct state of the spark application should have been reported
### How to reproduce
To reproduce it, have a long running (>1 hour) task submitted using the
SparkKubernetesOperator.
If the task gets completed before 4 hours then you should see a 410 Expired
error regardless of what the actual output of the spark application was.
If the task takes longer, you should see the task fail around the 4 hour
mark due to 410 even when the spark application is still running.
### Operating System
Debian GNU/Linux 11 (bullseye)
### Versions of Apache Airflow Providers
apache-airflow-providers-amazon==8.2.0
apache-airflow-providers-celery==3.2.1
apache-airflow-providers-cncf-kubernetes==7.3.0
apache-airflow-providers-common-sql==1.5.2
apache-airflow-providers-docker==3.7.1
apache-airflow-providers-elasticsearch==4.5.1
apache-airflow-providers-ftp==3.4.2
apache-airflow-providers-google==10.2.0
apache-airflow-providers-grpc==3.2.1
apache-airflow-providers-hashicorp==3.4.1
apache-airflow-providers-http==4.4.2
apache-airflow-providers-imap==3.2.2
apache-airflow-providers-microsoft-azure==6.1.2
apache-airflow-providers-mysql==5.1.1
apache-airflow-providers-odbc==4.0.0
apache-airflow-providers-postgres==5.5.1
apache-airflow-providers-redis==3.2.1
apache-airflow-providers-sendgrid==3.2.1
apache-airflow-providers-sftp==4.3.1
apache-airflow-providers-slack==7.3.1
apache-airflow-providers-snowflake==4.2.0
apache-airflow-providers-sqlite==3.4.2
apache-airflow-providers-ssh==3.7.1
### Deployment
Official Apache Airflow Helm Chart
### Deployment details
k8s version - 1.23 hosted using EKS
python 3.8
I have upgraded the apache-airflow-providers-cncf-kubernetes to ensure that
the bug has not been fixed in the newer versions.
### Anything else
I think this issue is because of the kubernetes 'Watch().stream''s
APIException not being handled. According to its docs -
```
Note that watching an API resource can expire. The method tries to
resume automatically once from the last result, but if that last
result
is too old as well, an `ApiException` exception will be thrown with
``code`` 410. In that case you have to recover yourself, probably
by listing the API resource to obtain the latest state and then
watching from that state on by setting ``resource_version`` to
one returned from listing.
```
this error needs to be handled by Airflow and not by the kubernetes client.
### Are you willing to submit PR?
- [ ] Yes I am willing to submit a PR!
### Code of Conduct
- [X] I agree to follow this project's [Code of
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]