Re: How is Spark a memory based solution if it writes data to disk before shuffles?

Gourav Sengupta Tue, 05 Jul 2022 02:30:12 -0700

Hi,

SPARK is just one of the technologies out there now, there are several
other technologies far outperforming SPARK or at least as good as SPARK.




Regards,
Gourav

On Sat, Jul 2, 2022 at 7:42 PM Sid <flinkbyhe...@gmail.com> wrote:

> So as per the discussion, shuffle stages output is also stored on disk and
> not in memory?
>
> On Sat, Jul 2, 2022 at 8:44 PM krexos <kre...@protonmail.com> wrote:
>
>>
>> thanks a lot!
>>
>> ------- Original Message -------
>> On Saturday, July 2nd, 2022 at 6:07 PM, Sean Owen <sro...@gmail.com>
>> wrote:
>>
>> I think that is more accurate yes. Though, shuffle files are local, not
>> on distributed storage too, which is an advantage. MR also had map only
>> transforms and chained mappers, but harder to use. Not impossible but you
>> could also say Spark just made it easier to do the more efficient thing.
>>
>> On Sat, Jul 2, 2022, 9:34 AM krexos <kre...@protonmail.com.invalid>
>> wrote:
>>
>>>
>>> You said Spark performs IO only when reading data and writing final data
>>> to the disk. I though by that you meant that it only reads the input files
>>> of the job and writes the output of the whole job to the disk, but in
>>> reality spark does store intermediate results on disk, just in less places
>>> than MR
>>>
>>> ------- Original Message -------
>>> On Saturday, July 2nd, 2022 at 5:27 PM, Sid <flinkbyhe...@gmail.com>
>>> wrote:
>>>
>>> I have explained the same thing in a very layman's terms. Go through it
>>> once.
>>>
>>> On Sat, 2 Jul 2022, 19:45 krexos, <kre...@protonmail.com.invalid> wrote:
>>>
>>>>
>>>> I think I understand where Spark saves IO.
>>>>
>>>> in MR we have map -> reduce -> map -> reduce -> map -> reduce ...
>>>>
>>>> which writes results do disk at the end of each such "arrow",
>>>>
>>>> on the other hand in spark we have
>>>>
>>>> map -> reduce + map -> reduce + map -> reduce ...
>>>>
>>>> which saves about 2 times the IO
>>>>
>>>> thanks everyone,
>>>> krexos
>>>>
>>>> ------- Original Message -------
>>>> On Saturday, July 2nd, 2022 at 1:35 PM, krexos <kre...@protonmail.com>
>>>> wrote:
>>>>
>>>> Hello,
>>>>
>>>> One of the main "selling points" of Spark is that unlike Hadoop
>>>> map-reduce that persists intermediate results of its computation to HDFS
>>>> (disk), Spark keeps all its results in memory. I don't understand this as
>>>> in reality when a Spark stage finishes it writes all of the data into
>>>> shuffle files stored on the disk
>>>> <https://github.com/JerryLead/SparkInternals/blob/master/markdown/english/4-shuffleDetails.md>.
>>>> How then is this an improvement on map-reduce?
>>>>
>>>> Image from https://youtu.be/7ooZ4S7Ay6Y
>>>>
>>>>
>>>> thanks!
>>>>
>>>>
>>>>
>>>
>>

Re: How is Spark a memory based solution if it writes data to disk before shuffles?

Reply via email to