Re: Understanding shuffle file name conflicts

Josh Rosen Tue, 24 Mar 2015 23:24:48 -0700

Which version of Spark are you using?  What do you mean when you say that
you used a hardcoded location for shuffle files?


If you look at the current DiskBlockManager code, it looks like it will
create a per-application subdirectory in each of the local root directories.

Here's the call to create a subdirectory in each root dir:
https://github.com/apache/spark/blob/c5cc41468e8709d09c09289bb55bc8edc99404b1/core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala#L126

This call to Utils.createDirectory() should result in a fresh subdirectory
being created for just this application (note the use of random UUIDs, plus
the check to ensure that the directory doesn't already exist):
https://github.com/apache/spark/blob/c5cc41468e8709d09c09289bb55bc8edc99404b1/core/src/main/scala/org/apache/spark/util/Utils.scala#L273

So, although the filenames for shuffle files are not globally unique, their
full paths should be unique due to these unique per-application
subdirectories.  Have you observed an instance where this isn't the case?

- Josh

On Tue, Mar 24, 2015 at 11:04 PM, Kannan Rajah <kra...@maprtech.com> wrote:

> Saisai,
> This is the not the case when I use spark-submit to run 2 jobs, one after
> another. The shuffle id remains the same.
>
>
> --
> Kannan
>
> On Tue, Mar 24, 2015 at 7:35 PM, Saisai Shao <sai.sai.s...@gmail.com>
> wrote:
>
> > Hi Kannan,
> >
> > As I know the shuffle Id in ShuffleDependency will be increased, so even
> > if you run the same job twice, the shuffle dependency as well as shuffle
> id
> > is different, so the shuffle file name which is combined by
> > (shuffleId+mapId+reduceId) will be changed, so there's no name conflict
> > even in the same directory as I know.
> >
> > Thanks
> > Jerry
> >
> >
> > 2015-03-25 1:56 GMT+08:00 Kannan Rajah <kra...@maprtech.com>:
> >
> >> I am working on SPARK-1529. I ran into an issue with my change, where
> the
> >> same shuffle file was being reused across 2 jobs. Please note this only
> >> happens when I use a hard coded location to use for shuffle files, say
> >> "/tmp". It does not happen with normal code path that uses
> >> DiskBlockManager
> >> to pick different directories for each run. So I want to understand how
> >> DiskBlockManager guarantees that such a conflict will never happen.
> >>
> >> Let's say the shuffle block id has a value of shuffle_0_0_0. So the data
> >> file name is shuffle_0_0_0.data and index file name is
> >> shuffle_0_0_0.index.
> >> If I run a spark job twice, one after another, these files get created
> >> under different directories because of the hashing logic in
> >> DiskBlockManager. But the hash is based off the file name, so how are we
> >> sure that there won't be a conflict ever?
> >>
> >> --
> >> Kannan
> >>
> >
> >
>

Re: Understanding shuffle file name conflicts

Reply via email to