Do all of those thousands of Stages end up being actual Stages that need to
be computed, or are the vast majority of them eventually "skipped" Stages?
If the latter, then there is the potential to modify the DAGScheduler to
avoid much of this behavior:
https://issues.apache.org/jira/browse/SPARK-10193
https://github.com/apache/spark/pull/8427

On Sat, Jan 23, 2016 at 1:40 PM, Ryan Williams <
ryan.blake.willi...@gmail.com> wrote:

> I have a recursive algorithm that performs a few jobs on successively
> smaller RDDs, and then a few more jobs on successively larger RDDs as the
> recursion unwinds, resulting in a somewhat deeply-nested (a few dozen
> levels) RDD lineage.
>
> I am observing significant delays starting jobs while the
> MapOutputTrackerMaster calculates the sizes of the output statuses for all
> previous shuffles. By the end of my algorithm's execution, the driver
> spends about a minute doing this before each job, during which time my
> entire cluster is sitting idle. This output-status info is the same every
> time it computes it, no executors have joined or left the cluster.
>
> In this gist
> <https://gist.github.com/ryan-williams/445ef8736a688bd78edb#file-job-108>
> you can see two jobs stalling for almost a minute each between "Starting
> job:" and "Got job"; with larger input datasets my RDD lineages and this
> latency would presumably only grow.
>
> Additionally, the "DAG Visualization" on the job page of the web UI shows
> a huge horizontal-scrolling lineage of thousands of stages, indicating that
> the driver is tracking far more information than would seem necessary.
>
> I'm assuming the short answer is that I need to truncate RDDs' lineage,
> and the only way to do that is by checkpointing them to disk. I've done
> that and it avoids this issue, but means that I am now serializing my
> entire dataset to disk dozens of times during the course of execution,
> which feels unnecessary/wasteful.
>
> Is there a better way to deal with this scenario?
>
> Thanks,
>
> -Ryan
>

Reply via email to