I agree with Reynold. We don't need to use a separate pool, which would have the problem you raised about FIFO. We just need to do the planning outside of the scheduler loop. The call site thread sounds like a reasonable place to me.
On Mon, Mar 5, 2018 at 12:56 PM, Reynold Xin <r...@databricks.com> wrote: > Rather than using a separate thread pool, perhaps we can just move the > prep code to the call site thread? > > > On Sun, Mar 4, 2018 at 11:15 PM, Ajith shetty <ajith.she...@huawei.com> > wrote: > >> DAGScheduler becomes a bottleneck in cluster when multiple JobSubmitted >> events has to be processed as DAGSchedulerEventProcessLoop is single >> threaded and it will block other tasks in queue like TaskCompletion. >> >> The JobSubmitted event is time consuming depending on the nature of the >> job (Example: calculating parent stage dependencies, shuffle dependencies, >> partitions) and thus it blocks all the events to be processed. >> >> >> >> I see multiple JIRA referring to this behavior >> >> https://issues.apache.org/jira/browse/SPARK-2647 >> >> https://issues.apache.org/jira/browse/SPARK-4961 >> >> >> >> Similarly in my cluster some jobs partition calculation is time consuming >> (Similar to stack at SPARK-2647) hence it slows down the spark >> DAGSchedulerEventProcessLoop which results in user jobs to slowdown, even >> if its tasks are finished within seconds, as TaskCompletion Events are >> processed at a slower rate due to blockage. >> >> >> >> I think we can split a JobSubmitted Event into 2 events >> >> Step 1. JobSubmittedPreperation - Runs in separate thread on >> JobSubmission, this will involve steps org.apache.spark.scheduler.DAG >> Scheduler#createResultStage >> >> Step 2. JobSubmittedExecution - If Step 1 is success, fire an event to >> DAGSchedulerEventProcessLoop and let it process output of >> org.apache.spark.scheduler.DAGScheduler#createResultStage >> >> >> >> I can see the effect of doing this may be that Job Submissions may not be >> FIFO depending on how much time Step 1 mentioned above is going to consume. >> >> >> >> Does above solution suffice for the problem described? And is there any >> other side effect of this solution? >> >> >> >> Regards >> >> Ajith >> > > -- Ryan Blue Software Engineer Netflix