Because only shuffle stages write shuffle files. Most stages are not shuffles
On Sat, Jul 2, 2022, 7:28 AM krexos <kre...@protonmail.com.invalid> wrote: > Hello, > > One of the main "selling points" of Spark is that unlike Hadoop map-reduce > that persists intermediate results of its computation to HDFS (disk), Spark > keeps all its results in memory. I don't understand this as in reality when > a Spark stage finishes it writes all of the data into shuffle files > stored on the disk > <https://github.com/JerryLead/SparkInternals/blob/master/markdown/english/4-shuffleDetails.md>. > How then is this an improvement on map-reduce? > > Image from https://youtu.be/7ooZ4S7Ay6Y > > > thanks! >