[PR] Spark: track YARN cluster apps via RM REST API to free JVM [airflow]

via GitHub Sun, 31 May 2026 23:20:30 -0700


nailo2c opened a new pull request, #65991:
URL: https://github.com/apache/airflow/pull/65991


   closes: #24171
   
   # Why
   
   In YARN cluster mode, Airflow keeps a local `spark-submit` JVM alive for 
each running Spark task to monitor completion.
   
   At scale, these long-lived JVMs can consume significant Airflow worker 
memory.
   
   # How
   
   Added an opt-in `yarn_track_via_rm_api` flag that releases the local 
spark-submit JVM after YARN submission and tracks completion via the YARN 
ResourceManager REST API.
   
   # What
   
   + Confirmed: the test DAG submitted the Spark job via the RM REST API.
   <img width="1907" height="993" alt="rest_airflow_ui" 
src="https://github.com/user-attachments/assets/08e3466e-d266-443a-846e-7706982c6b01";
 />
   
   + It works as expected.
   <img width="1911" height="879" alt="rest_hadoop_ui" 
src="https://github.com/user-attachments/assets/07ebcb36-94f5-4c79-bce2-5511834e148d";
 />
   
   + Manually verify that the RM REST API can fetch the job status.
   <img width="1471" height="1001" alt="rest_breeze_to_rm_rest" 
src="https://github.com/user-attachments/assets/fd5a6562-352a-4b8c-8841-ca6ca12baae2";
 />
   
   + Test Dag
   ```python
   from datetime import datetime
   
   from airflow.models import DAG
   from airflow.providers.apache.spark.operators.spark_submit import 
SparkSubmitOperator
   
   with DAG(
       dag_id="spark_yarn_repro_24171_rest",
       schedule=None,
       start_date=datetime(2026, 1, 1),
       catchup=False,
       tags=["repro", "issue-24171", "rest"],
   ):
       SparkSubmitOperator(
           task_id="spark_pi_yarn_cluster",
           
application="/opt/airflow/dev/.issue-24171/spark/examples/jars/spark-examples_2.12-3.5.3.jar",
           java_class="org.apache.spark.examples.SparkPi",
           application_args=["200"],
           conn_id="spark_yarn_rm",
           deploy_mode="cluster",
           name="airflow-pi-cluster-rest",
           conf={
               "spark.executor.instances": "1",
               "spark.executor.memory": "512m",
               "spark.driver.memory": "512m",
           },
           yarn_track_via_rm_api=True,
           status_poll_interval=5,
           verbose=True,
       )
   ```
   
   + Test connection
   ```bash
   airflow connections add spark_yarn_rm \
     --conn-type spark \
     --conn-host yarn \
     --conn-extra '{
       "deploy-mode": "cluster",
       "spark-binary": "spark-submit",
       "yarn_resourcemanager_webapp_address": "http://resourcemanager:8088";
     }'
   ```
   
   <br><br>
   
   ---
   
   ##### Was generative AI tooling used to co-author this PR?
   
   <!--
   If generative AI tooling has been used in the process of authoring this PR, 
please
   change below checkbox to `[X]` followed by the name of the tool, uncomment 
the "Generated-by".
   -->
   
   - [x] Yes (please specify the tool below)
   Generated-by: Claude Opus 4.7 following [the 
guidelines](https://github.com/apache/airflow/blob/main/contributing-docs/05_pull_requests.rst#gen-ai-assisted-contributions)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] Spark: track YARN cluster apps via RM REST API to free JVM [airflow]

Reply via email to