tokoko opened a new issue, #24171:
URL: https://github.com/apache/airflow/issues/24171

   ### Description
   
   SparkSubmitHook should track yarn cluster-mode application status with yarn 
CLI rather than rely on spark-submit process logs. This would cut back on 
excessive memory usage and also make it much easier to make the operator 
deferrable later on.
   
   ### Use case/motivation
   
   While running most of our Spark workloads in Yarn cluster mode using 
SparkSubmitHook, we observed that celery workers were consistently low on 
memory. The main driver for the high memory consumption were spark-submit 
processes started from SparkSubmitHook, that took about 500mb of memory even 
though in yarn cluster mode they were doing essentially next to none of actual 
work. We refactored the hook to kill spark-submit process right after Yarn 
accepts the application and track the status with `yarn application -status` 
calls similar to how spark standalone mode is being tracked.
   
   Another motivation for the change is to prepare the operator to be made 
deferrable later on. Polling from external java process that needs to be kept 
alive until operator exits can't be made deferrable. Using Yarn CLI for polling 
would fit easily with how deferrable operators work.
   
   ### Related issues
   
   _No response_
   
   ### Are you willing to submit a PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to