Re: How is Spark a memory based solution if it writes data to disk before shuffles?

Sean Owen Sat, 02 Jul 2022 06:13:34 -0700

Because only shuffle stages write shuffle files. Most stages are not
shuffles


On Sat, Jul 2, 2022, 7:28 AM krexos <kre...@protonmail.com.invalid> wrote:

> Hello,
>
> One of the main "selling points" of Spark is that unlike Hadoop map-reduce
> that persists intermediate results of its computation to HDFS (disk), Spark
> keeps all its results in memory. I don't understand this as in reality when
> a Spark stage finishes it writes all of the data into shuffle files
> stored on the disk
> <https://github.com/JerryLead/SparkInternals/blob/master/markdown/english/4-shuffleDetails.md>.
> How then is this an improvement on map-reduce?
>
> Image from https://youtu.be/7ooZ4S7Ay6Y
>
>
> thanks!
>

Re: How is Spark a memory based solution if it writes data to disk before shuffles?

Reply via email to