Hi Miguel, On Sun, Jan 6, 2019 at 11:35 AM Miguel F. S. Vasconcelos < miguel.vasconce...@usp.br> wrote:
> When an action is performed onto a RDD, Spark send it as a job to the > DAGScheduler; > The DAGScheduler compute the execution DAG based on the RDD's lineage, and > split the job into stages (using wide dependencies); > The resulting stages are transformed into a set of tasks, that are sent to > the TaskScheduler; > The TaskScheduler send the set of tasks to the executors, where they will > run. > > Is this flow correct? > yes, more or less, though that's really just the beginning. Then there is an endless back-and-forth as executors complete tasks, send info back to the driver, the driver updates some state, perhaps just launches more tasks in the existing tasksets, or creates more, or finishes jobs, etc. And are the jobs discovered during the application execution and sent > sequentially to the DAGScheduler? > yes, there are very specific apis to create a job -- in the user guide, these are called "actions". Most of these are blocking, eg. when the user calls rdd.count(), a job is created, potentially with a very long lineage and many stages, and only when the entire job is completed does rdd.count() complete. There are a few async versions (eg. countAsync()), but from what I've seen, the more common way to submit multiple concurrent jobs is to use the regular apis from multiple threads. > Regarding this part /"finds a minimal schedule to run the job"/, I have not > found this algorithm for getting the minimal schedule. Can you help me? > I think its just using narrow dependencies, and re-using existing cached data & shuffle data whenever possible. that's implicit in the DAGScheduler & RDD code. > I'm in doubt if the scheduling is at Task level, job level, or both. These > scheduling modes: FIFO and FAIR, are for tasks or jobs? > FIFO and FAIR only matter if you've got multiple concurrent jobs. But then it controls how you schedule tasks *within* those jobs. When a job is submitted, the DAGScheduler will still compute the DAG and the next taskset to run for that job, even if the entire cluster is busy. But then as resources free up, it needs to decide which job to submit tasks from. Note you might have 2 active jobs, with 5 stages ready to submit tasks, and another 20 stages still waiting for their dependencies to be computed, further down the pipeline for those jobs. > Also, as the TaskScheduler is an interface, is possible to "plug" different > scheduling algorithms to it, correct? > yes, though spark itself only has one implementation (I believe Facebook has their own, not sure of others?). There is a ExternalClusterManager api to let users plug in their own. > But what about the DAGScheduler, is there any interface that allows > plugging > different scheduling algorithms to it? > there is no interface currently. In the video "Introduction to AmpLab Spark Internals" its said that > pluggable inter job scheduling is a possible future extension. Anyone knows > if this has already been addressed ? > don't think so. > I'm starting a master degree and I'd really like to contribute to Spark. > Are > there suggestions of issues in the spark scheduling that could be > addressed?? > there is lots to do, but this is tough to answer. Depends on your interests in particular. Also, to be honest, the work that needs to be done often doesn't align well with the kind of work you need to do for a research project. For example, adding existing features into cluster managers (eg. kubernetes), or adding tests & chasing down concurrency bugs might not interest a student. OTOH, if you create an entirely different implementation of the DAGScheduler and do some tests on its properties under various loads -- that would be really interesting, but its also unlikely to get accepted given the quality of code that normal comes from research projects, and without finding folks in the community that understand it well and are ready to maintain it. You can search for jiras with component "Scheduler": https://issues.apache.org/jira/issues/?jql=component%3Dscheduler%20and%20project%3DSPARK there was some high level discussion a while back about a set of changes we might consider making to scheduler, particularly for dealing w/ failures on large clusters, but this never really picked up steam. Might be interesting for a research project: https://issues.apache.org/jira/browse/SPARK-20178