I have a recursive algorithm that performs a few jobs on successively
smaller RDDs, and then a few more jobs on successively larger RDDs as the
recursion unwinds, resulting in a somewhat deeply-nested (a few dozen
levels) RDD lineage.

I am observing significant delays starting jobs while the
MapOutputTrackerMaster calculates the sizes of the output statuses for all
previous shuffles. By the end of my algorithm's execution, the driver
spends about a minute doing this before each job, during which time my
entire cluster is sitting idle. This output-status info is the same every
time it computes it, no executors have joined or left the cluster.

In this gist
<https://gist.github.com/ryan-williams/445ef8736a688bd78edb#file-job-108>
you can see two jobs stalling for almost a minute each between "Starting
job:" and "Got job"; with larger input datasets my RDD lineages and this
latency would presumably only grow.

Additionally, the "DAG Visualization" on the job page of the web UI shows a
huge horizontal-scrolling lineage of thousands of stages, indicating that
the driver is tracking far more information than would seem necessary.

I'm assuming the short answer is that I need to truncate RDDs' lineage, and
the only way to do that is by checkpointing them to disk. I've done that
and it avoids this issue, but means that I am now serializing my entire
dataset to disk dozens of times during the course of execution, which feels
unnecessary/wasteful.

Is there a better way to deal with this scenario?

Thanks,

-Ryan

Reply via email to