So how do these get created, and are we really handling them correctly? What is prompting my questions is that I'm looking at making sure that the various data structures in the DAGScheduler shrink when appropriate instead of growing without bounds. Jobs with no partitions and the "zero split job" test in the DAGSchedulerSuite really throw a wrench into the works. That's because in the DAGScheduler we go part way along in handling this weird case as though it were a normal job submission, we start initializing or adding to various data structures, etc.; then we pretty much bail out in submitMissingTasks when we find out that there actually are no tasks to be done. We remove the stage from the set of running stages, but we don't ever clean up pendingTasks, activeJobs, stageIdToStage, stageToInfos, and others because no tasks are ever submitted for the stage, so there are never any completion events, nor is the stage aborted -- i.e. the normal paths to cleanup are never taken. The end result is that shuffleMap stages with no partitions (can these even occur?) never complete, and job's with no partitions would seem also to persist forever.
In short, RDDs with no partitions do really weird things to the DAGScheduler. So, if there is no way to effectively prevent the creation of RDDs with no partitions, is there any reason why we can't short-circuit their handling within the DAGScheduler so that data structures are never built or populated for these weird things, or must we add a bunch of special-case cleanup code to submitMissingStages?
