Shuffle results are only reused if you are reusing the exact same RDD. If you are working with Dataframes that you have not explicitly cached, then they are going to be producing new RDDs within their physical plan creation and evaluation, so you won't get implicit shuffle reuse. This is what https://issues.apache.org/jira/browse/SPARK-11838 is about.
On Mon, Dec 26, 2016 at 5:56 AM, assaf.mendelson <assaf.mendel...@rsa.com> wrote: > Hi, > > > > Sorry to be bothering everyone on the holidays but I have found what may > be a bug. > > > > I am doing a “manual” streaming (see http://stackoverflow.com/ > questions/41266956/apache-spark-streaming-performance for the specific > code) where I essentially read an additional dataframe each time from file, > union it with previous dataframes to create a “window” and then do double > aggregation on the result. > > Having looked at the documentation (https://spark.apache.org/ > docs/latest/programming-guide.html#which-storage-level-to-choose right > above the headline) I expected spark to automatically cache the partial > aggregation for each dataframe read and then continue with the aggregations > from there. Instead it seems it reads each dataframe from file all over > again. > > Is this a bug? Am I doing something wrong? > > > > Thanks. > > Assaf. > > ------------------------------ > View this message in context: Shuffle intermidiate results not being > cached > <http://apache-spark-developers-list.1001551.n3.nabble.com/Shuffle-intermidiate-results-not-being-cached-tp20358.html> > Sent from the Apache Spark Developers List mailing list archive > <http://apache-spark-developers-list.1001551.n3.nabble.com/> at > Nabble.com. >