So how do these get created, and are we really handling them correctly?
 What is prompting my questions is that I'm looking at making sure that the
various data structures in the DAGScheduler shrink when appropriate instead
of growing without bounds.  Jobs with no partitions and the "zero split
job" test in the DAGSchedulerSuite really throw a wrench into the works.
 That's because in the DAGScheduler we go part way along in handling this
weird case as though it were a normal job submission, we start initializing
or adding to various data structures, etc.; then we pretty much bail out in
submitMissingTasks when we find out that there actually are no tasks to be
done.  We remove the stage from the set of running stages, but we don't
ever clean up pendingTasks, activeJobs, stageIdToStage, stageToInfos, and
others because no tasks are ever submitted for the stage, so there are
never any completion events, nor is the stage aborted -- i.e. the normal
paths to cleanup are never taken.  The end result is that shuffleMap stages
with no partitions (can these even occur?) never complete, and job's with
no partitions would seem also to persist forever.

In short, RDDs with no partitions do really weird things to the
DAGScheduler.

So, if there is no way to effectively prevent the creation of RDDs with no
partitions, is there any reason why we can't short-circuit their handling
within the DAGScheduler so that data structures are never built or
populated for these weird things, or must we add a bunch of special-case
cleanup code to submitMissingStages?

Reply via email to