yea, they're all skipped, here's a gif
<http://f.cl.ly/items/413l3k363u290U173W00/Screen%20Recording%202016-01-23%20at%2005.08%20PM.gif>
scrolling through the DAG viz.

Thanks for the JIRA pointer, I'll keep an eye on that one!

On Sat, Jan 23, 2016 at 4:53 PM Mark Hamstra <m...@clearstorydata.com>
wrote:

> Do all of those thousands of Stages end up being actual Stages that need
> to be computed, or are the vast majority of them eventually "skipped"
> Stages?  If the latter, then there is the potential to modify the
> DAGScheduler to avoid much of this behavior:
> https://issues.apache.org/jira/browse/SPARK-10193
> https://github.com/apache/spark/pull/8427
>
> On Sat, Jan 23, 2016 at 1:40 PM, Ryan Williams <
> ryan.blake.willi...@gmail.com> wrote:
>
>> I have a recursive algorithm that performs a few jobs on successively
>> smaller RDDs, and then a few more jobs on successively larger RDDs as the
>> recursion unwinds, resulting in a somewhat deeply-nested (a few dozen
>> levels) RDD lineage.
>>
>> I am observing significant delays starting jobs while the
>> MapOutputTrackerMaster calculates the sizes of the output statuses for all
>> previous shuffles. By the end of my algorithm's execution, the driver
>> spends about a minute doing this before each job, during which time my
>> entire cluster is sitting idle. This output-status info is the same every
>> time it computes it, no executors have joined or left the cluster.
>>
>> In this gist
>> <https://gist.github.com/ryan-williams/445ef8736a688bd78edb#file-job-108>
>> you can see two jobs stalling for almost a minute each between "Starting
>> job:" and "Got job"; with larger input datasets my RDD lineages and this
>> latency would presumably only grow.
>>
>> Additionally, the "DAG Visualization" on the job page of the web UI shows
>> a huge horizontal-scrolling lineage of thousands of stages, indicating that
>> the driver is tracking far more information than would seem necessary.
>>
>> I'm assuming the short answer is that I need to truncate RDDs' lineage,
>> and the only way to do that is by checkpointing them to disk. I've done
>> that and it avoids this issue, but means that I am now serializing my
>> entire dataset to disk dozens of times during the course of execution,
>> which feels unnecessary/wasteful.
>>
>> Is there a better way to deal with this scenario?
>>
>> Thanks,
>>
>> -Ryan
>>
>
>

Reply via email to