tokoko opened a new issue, #24171: URL: https://github.com/apache/airflow/issues/24171
### Description SparkSubmitHook should track yarn cluster-mode application status with yarn CLI rather than rely on spark-submit process logs. This would cut back on excessive memory usage and also make it much easier to make the operator deferrable later on. ### Use case/motivation While running most of our Spark workloads in Yarn cluster mode using SparkSubmitHook, we observed that celery workers were consistently low on memory. The main driver for the high memory consumption were spark-submit processes started from SparkSubmitHook, that took about 500mb of memory even though in yarn cluster mode they were doing essentially next to none of actual work. We refactored the hook to kill spark-submit process right after Yarn accepts the application and track the status with `yarn application -status` calls similar to how spark standalone mode is being tracked. Another motivation for the change is to prepare the operator to be made deferrable later on. Polling from external java process that needs to be kept alive until operator exits can't be made deferrable. Using Yarn CLI for polling would fit easily with how deferrable operators work. ### Related issues _No response_ ### Are you willing to submit a PR? - [X] Yes I am willing to submit a PR! ### Code of Conduct - [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
