MammutMKII opened a new issue #19384: URL: https://github.com/apache/airflow/issues/19384
### Description Add an optional retry loop to LivyOperator.poll_for_termination() or LivyHook.get_batch_state() to improve resiliency against temporary errors. The retry counter should reset with successful requests. ### Use case/motivation 1. Using LivyOperator, we run a Spark Streaming job in a cluster behind Knox with LDAP authentication. 2. While the streaming job is running, LivyOperator keeps polling for termination. 3. In our case, the LDAP service might be unavailable for a few of the polling requests per day, resulting in Knox returning an error. 4. LivyOperator marks the task as failed even though the streaming job should still be running, as subsequent polling requests might have revealed. 5. We would like LivyOperator/LivyHook to send a number of retries in order to overcome those brief availability issues. Workarounds we considered: - increase polling interval to reduce the chance of running into an error. For reference, we are currently using an interval of 10s - use BaseOperator retries to start a new job, only send notification email for the final failure. But this would start a new job unnecessarily - activate knox authentication caching to decrease the chance of errors substantially, but it was causing issues not related to Airflow ### Related issues No related issues were found ### Are you willing to submit a PR? - [ ] Yes I am willing to submit a PR! ### Code of Conduct - [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
