Hi all, I opened a ticket (https://github.com/apache/airflow/issues/24171) a while back and I just want to make sure that it got stale deservedly :)
We used to have an issue with memory consumption on Airflow celery workers where tasks were often killed by OOM killer. Most of our workload was running Spark jobs in Yarn cluster mode using SparkSubmitHook. The main driver for the high memory consumption were spark-submit processes, that took about 500mb of memory each even though in yarn cluster mode they were doing essentially nothing. We changed the hook to kill spark-submit process right after Yarn accepts the application and track the status with "yarn application -status" calls instead similar to how spark standalone mode is being tracked right now and OOM issues went away. It seems like an issue lots of other users with similar usage pattern should probably be experiencing, unless they have unnecessarily large memory allocated to Airflow workers. I want to know if anyone else has had a similar experience. Is it worth it to work on including our fix in the upstream repo? Or maybe everyone else has already switched to managed Spark services and it's just us? :) -- Tornike
