Re: Understanding shuffle file name conflicts

Kannan Rajah Tue, 24 Mar 2015 23:06:59 -0700

Saisai,
This is the not the case when I use spark-submit to run 2 jobs, one after
another. The shuffle id remains the same.



--
Kannan

On Tue, Mar 24, 2015 at 7:35 PM, Saisai Shao <sai.sai.s...@gmail.com> wrote:

> Hi Kannan,
>
> As I know the shuffle Id in ShuffleDependency will be increased, so even
> if you run the same job twice, the shuffle dependency as well as shuffle id
> is different, so the shuffle file name which is combined by
> (shuffleId+mapId+reduceId) will be changed, so there's no name conflict
> even in the same directory as I know.
>
> Thanks
> Jerry
>
>
> 2015-03-25 1:56 GMT+08:00 Kannan Rajah <kra...@maprtech.com>:
>
>> I am working on SPARK-1529. I ran into an issue with my change, where the
>> same shuffle file was being reused across 2 jobs. Please note this only
>> happens when I use a hard coded location to use for shuffle files, say
>> "/tmp". It does not happen with normal code path that uses
>> DiskBlockManager
>> to pick different directories for each run. So I want to understand how
>> DiskBlockManager guarantees that such a conflict will never happen.
>>
>> Let's say the shuffle block id has a value of shuffle_0_0_0. So the data
>> file name is shuffle_0_0_0.data and index file name is
>> shuffle_0_0_0.index.
>> If I run a spark job twice, one after another, these files get created
>> under different directories because of the hashing logic in
>> DiskBlockManager. But the hash is based off the file name, so how are we
>> sure that there won't be a conflict ever?
>>
>> --
>> Kannan
>>
>
>

Re: Understanding shuffle file name conflicts

Reply via email to