Shuffle intermidiate results not being cached

assaf.mendelson Mon, 26 Dec 2016 02:57:07 -0800

Hi,

Sorry to be bothering everyone on the holidays but I have found what may be a 
bug.


I am doing a "manual" streaming (see 
http://stackoverflow.com/questions/41266956/apache-spark-streaming-performance 
for the specific code) where I essentially read an additional dataframe each 
time from file, union it with previous dataframes to create a "window" and then 
do double aggregation on the result.
Having looked at the documentation 
(https://spark.apache.org/docs/latest/programming-guide.html#which-storage-level-to-choose
 right above the headline) I expected spark to automatically cache the partial 
aggregation for each dataframe read and then continue with the aggregations 
from there. Instead it seems it reads each dataframe from file all over again.
Is this a bug? Am I doing something wrong?

Thanks.
                Assaf.




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Shuffle-intermidiate-results-not-being-cached-tp20358.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Shuffle intermidiate results not being cached

Reply via email to