How is Spark a memory based solution if it writes data to disk before shuffles?

krexos Sat, 02 Jul 2022 05:28:16 -0700

Hello,

One of the main "selling points" of Spark is that unlike Hadoop map-reduce that 
persists intermediate results of its computation to HDFS (disk), Spark keeps 
all its results in memory. I don't understand this as in reality when a Spark 
stage finishes[it writes all of the data into shuffle files stored on the 
disk](https://github.com/JerryLead/SparkInternals/blob/master/markdown/english/4-shuffleDetails.md).
 How then is this an improvement on map-reduce?


Image from https://youtu.be/7ooZ4S7Ay6Y

thanks!

How is Spark a memory based solution if it writes data to disk before shuffles?

Reply via email to