[ https://issues.apache.org/jira/browse/AIRFLOW-6388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
t oo updated AIRFLOW-6388: -------------------------- Description: Spark jobs can often take many minutes (or even hours) to complete. The spark submit operator submits a job to a spark cluster, then continually polls its status until it detects the spark job has ended. This means it could be consuming a 'slot' (ie parallelism, dag_concurrency, max_active_dag_runs_per_dag, non_pooled_task_slot_count) for hours when it is not 'doing' anything but polling for status. https://github.com/apache/airflow/pull/6909#discussion_r361838225 suggested it should move to a poke/reschedule model. Another thing to note is that in cluster mode a spark-submit made to a 'full' spark cluster will sit in WAITING state on spark side until some cores/memory is freed, then the driver/app can go into RUNNING "This actually means occupy worker and do nothing for n seconds is it not? It was OK when it was 1 second but users may set it to even 5 min without realising that it occupys the worker. My comment here is more of a concern rather than an action to do. Should this work by occupying the worker "indefinitely" or can it be something like the sensors with (poke/reschedule)?" was: Spark jobs can often take many minutes (or even hours) to complete. The spark submit operator submits a job to a spark cluster, then continually polls its status until it detects the spark job has ended. This means it could be consuming a 'slot' (ie parallelism, dag_concurrency, max_active_dag_runs_per_dag, non_pooled_task_slot_count) for hours when it is not 'doing' anything but polling for status. https://github.com/apache/airflow/pull/6909#discussion_r361838225 suggested it should move to a poke/reschedule model. "This actually means occupy worker and do nothing for n seconds is it not? It was OK when it was 1 second but users may set it to even 5 min without realising that it occupys the worker. My comment here is more of a concern rather than an action to do. Should this work by occupying the worker "indefinitely" or can it be something like the sensors with (poke/reschedule)?" > SparkSubmitOperator polling should not 'consume' a slot > ------------------------------------------------------- > > Key: AIRFLOW-6388 > URL: https://issues.apache.org/jira/browse/AIRFLOW-6388 > Project: Apache Airflow > Issue Type: Improvement > Components: dependencies, scheduler > Affects Versions: 1.10.3 > Reporter: t oo > Priority: Minor > > Spark jobs can often take many minutes (or even hours) to complete. > The spark submit operator submits a job to a spark cluster, then continually > polls its status until it detects the spark job has ended. This means it > could be consuming a 'slot' (ie parallelism, dag_concurrency, > max_active_dag_runs_per_dag, non_pooled_task_slot_count) for hours when it is > not 'doing' anything but polling for status. > https://github.com/apache/airflow/pull/6909#discussion_r361838225 suggested > it should move to a poke/reschedule model. > Another thing to note is that in cluster mode a spark-submit made to a 'full' > spark cluster will sit in WAITING state on spark side until some cores/memory > is freed, then the driver/app can go into RUNNING > "This actually means occupy worker and do nothing for n seconds is it not? > It was OK when it was 1 second but users may set it to even 5 min without > realising that it occupys the worker. > My comment here is more of a concern rather than an action to do. > Should this work by occupying the worker "indefinitely" or can it be > something like the sensors with (poke/reschedule)?" -- This message was sent by Atlassian Jira (v8.3.4#803005)