Re: How is Spark a memory based solution if it writes data to disk before shuffles?

krexos Sat, 02 Jul 2022 07:01:28 -0700

Isn't Spark the same in this regard? You can execute all of the narrow 
dependencies of a Spark stage in one mapper, thus having the same amount of 
mappers + reducers as spark stages for the same job, no?


thanks,
krexos

------- Original Message -------
On Saturday, July 2nd, 2022 at 4:45 PM, Apostolos N. Papadopoulos 
<papad...@csd.auth.gr> wrote:

> Yes, wide-dependency transformations are the cause of shuffles. However, 
> between shuffles there is no write.
>
> On the other hand, in Hadoop MapReduce the output of mappers goes to the 
> local FS of the mappers every single time.
>
> a.
>
> On 2/7/22 16:41, krexos wrote:
>
>> This doesn't add up with what's described in the internals page I included. 
>> What you are talking about is shuffle spills at the beginning of the stage. 
>> What I am talking about is that at the end of the stage spark writes all of 
>> the stage's results to shuffle files on disk, thus we will have the same 
>> amount of IO writes as there are stages.
>>
>> thanks,
>> krexos
>>
>> ------- Original Message -------
>> On Saturday, July 2nd, 2022 at 3:34 PM, Sid 
>> [<flinkbyhe...@gmail.com>](mailto:flinkbyhe...@gmail.com) wrote:
>>
>>> Hi Krexos,
>>>
>>> If I understand correctly, you are trying to ask that even spark involves 
>>> disk i/o then how it is an advantage over map reduce.
>>>
>>> Basically, Map Reduce phase writes every intermediate results to the disk. 
>>> So on an average it involves 6 times disk I/O whereas spark(assuming it has 
>>> an enough memory to store intermediate results) on an average involves 3 
>>> times less disk I/O i.e only while reading the data and writing the final 
>>> data to the disk.
>>>
>>> Thanks,
>>> Sid
>>>
>>> On Sat, 2 Jul 2022, 17:58 krexos, 
>>> [<kre...@protonmail.com.invalid>](mailto:kre...@protonmail.com.invalid) 
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> One of the main "selling points" of Spark is that unlike Hadoop map-reduce 
>>>> that persists intermediate results of its computation to HDFS (disk), 
>>>> Spark keeps all its results in memory. I don't understand this as in 
>>>> reality when a Spark stage finishes[it writes all of the data into shuffle 
>>>> files stored on the 
>>>> disk](https://github.com/JerryLead/SparkInternals/blob/master/markdown/english/4-shuffleDetails.md).
>>>>  How then is this an improvement on map-reduce?
>>>>
>>>> Image from https://youtu.be/7ooZ4S7Ay6Y
>>>>
>>>> thanks!
>
> --
> Apostolos N. Papadopoulos, Associate Professor
> Department of Informatics
> Aristotle University of Thessaloniki
> Thessaloniki, GREECE
> tel: ++0030312310991918
> email:
> papad...@csd.auth.gr
> twitter: @papadopoulos_ap
> web:
> http://datalab.csd.auth.gr/~apostol

Re: How is Spark a memory based solution if it writes data to disk before shuffles?

Reply via email to