Hi Jeff,
I think I see what you're saying. I was thinking more of a whole Spark
job, where `spark-submit` is run once to completion and then started up
again, rather than a "job" as seen in the Spark UI. I take it there is no
implicit caching of results between `spark-submit` runs.
(In the case
Hi Eric,
If the 2 jobs share the same parent stages. these stages can be skipped for
the second job.
Here's one simple example:
val rdd1 = sc.parallelize(1 to 10).map(e=>(e,e))
val rdd2 = rdd1.groupByKey()
rdd2.map(e=>e._1).collect() foreach println
rdd2.map(e=> (e._1, e._2.size)).collect foreac
Hi,
I'm noticing that a 30 minute job that was initially IO-bound may not be
during subsequent runs. Is there some kind of between-job caching that
happens in Spark or in Linux that outlives jobs and that might be making
subsequent runs faster? If so, is there a way to avoid the caching in
order