I agree with Reynold. We don't need to use a separate pool, which would
have the problem you raised about FIFO. We just need to do the planning
outside of the scheduler loop. The call site thread sounds like a
reasonable place to me.

On Mon, Mar 5, 2018 at 12:56 PM, Reynold Xin <r...@databricks.com> wrote:

> Rather than using a separate thread pool, perhaps we can just move the
> prep code to the call site thread?
>
>
> On Sun, Mar 4, 2018 at 11:15 PM, Ajith shetty <ajith.she...@huawei.com>
> wrote:
>
>> DAGScheduler becomes a bottleneck in cluster when multiple JobSubmitted
>> events has to be processed as DAGSchedulerEventProcessLoop is single
>> threaded and it will block other tasks in queue like TaskCompletion.
>>
>> The JobSubmitted event is time consuming depending on the nature of the
>> job (Example: calculating parent stage dependencies, shuffle dependencies,
>> partitions) and thus it blocks all the events to be processed.
>>
>>
>>
>> I see multiple JIRA referring to this behavior
>>
>> https://issues.apache.org/jira/browse/SPARK-2647
>>
>> https://issues.apache.org/jira/browse/SPARK-4961
>>
>>
>>
>> Similarly in my cluster some jobs partition calculation is time consuming
>> (Similar to stack at SPARK-2647) hence it slows down the spark
>> DAGSchedulerEventProcessLoop which results in user jobs to slowdown, even
>> if its tasks are finished within seconds, as TaskCompletion Events are
>> processed at a slower rate due to blockage.
>>
>>
>>
>> I think we can split a JobSubmitted Event into 2 events
>>
>> Step 1. JobSubmittedPreperation - Runs in separate thread on
>> JobSubmission, this will involve steps org.apache.spark.scheduler.DAG
>> Scheduler#createResultStage
>>
>> Step 2. JobSubmittedExecution - If Step 1 is success, fire an event to
>> DAGSchedulerEventProcessLoop and let it process output of
>> org.apache.spark.scheduler.DAGScheduler#createResultStage
>>
>>
>>
>> I can see the effect of doing this may be that Job Submissions may not be
>> FIFO depending on how much time Step 1 mentioned above is going to consume.
>>
>>
>>
>> Does above solution suffice for the problem described? And is there any
>> other side effect of this solution?
>>
>>
>>
>> Regards
>>
>> Ajith
>>
>
>


-- 
Ryan Blue
Software Engineer
Netflix

Reply via email to