Isn't Spark the same in this regard? You can execute all of the narrow dependencies of a Spark stage in one mapper, thus having the same amount of mappers + reducers as spark stages for the same job, no?
thanks, krexos ------- Original Message ------- On Saturday, July 2nd, 2022 at 4:45 PM, Apostolos N. Papadopoulos <papad...@csd.auth.gr> wrote: > Yes, wide-dependency transformations are the cause of shuffles. However, > between shuffles there is no write. > > On the other hand, in Hadoop MapReduce the output of mappers goes to the > local FS of the mappers every single time. > > a. > > On 2/7/22 16:41, krexos wrote: > >> This doesn't add up with what's described in the internals page I included. >> What you are talking about is shuffle spills at the beginning of the stage. >> What I am talking about is that at the end of the stage spark writes all of >> the stage's results to shuffle files on disk, thus we will have the same >> amount of IO writes as there are stages. >> >> thanks, >> krexos >> >> ------- Original Message ------- >> On Saturday, July 2nd, 2022 at 3:34 PM, Sid >> [<flinkbyhe...@gmail.com>](mailto:flinkbyhe...@gmail.com) wrote: >> >>> Hi Krexos, >>> >>> If I understand correctly, you are trying to ask that even spark involves >>> disk i/o then how it is an advantage over map reduce. >>> >>> Basically, Map Reduce phase writes every intermediate results to the disk. >>> So on an average it involves 6 times disk I/O whereas spark(assuming it has >>> an enough memory to store intermediate results) on an average involves 3 >>> times less disk I/O i.e only while reading the data and writing the final >>> data to the disk. >>> >>> Thanks, >>> Sid >>> >>> On Sat, 2 Jul 2022, 17:58 krexos, >>> [<kre...@protonmail.com.invalid>](mailto:kre...@protonmail.com.invalid) >>> wrote: >>> >>>> Hello, >>>> >>>> One of the main "selling points" of Spark is that unlike Hadoop map-reduce >>>> that persists intermediate results of its computation to HDFS (disk), >>>> Spark keeps all its results in memory. I don't understand this as in >>>> reality when a Spark stage finishes[it writes all of the data into shuffle >>>> files stored on the >>>> disk](https://github.com/JerryLead/SparkInternals/blob/master/markdown/english/4-shuffleDetails.md). >>>> How then is this an improvement on map-reduce? >>>> >>>> Image from https://youtu.be/7ooZ4S7Ay6Y >>>> >>>> thanks! > > -- > Apostolos N. Papadopoulos, Associate Professor > Department of Informatics > Aristotle University of Thessaloniki > Thessaloniki, GREECE > tel: ++0030312310991918 > email: > papad...@csd.auth.gr > twitter: @papadopoulos_ap > web: > http://datalab.csd.auth.gr/~apostol