Which version of Spark are you using? What do you mean when you say that you used a hardcoded location for shuffle files?
If you look at the current DiskBlockManager code, it looks like it will create a per-application subdirectory in each of the local root directories. Here's the call to create a subdirectory in each root dir: https://github.com/apache/spark/blob/c5cc41468e8709d09c09289bb55bc8edc99404b1/core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala#L126 This call to Utils.createDirectory() should result in a fresh subdirectory being created for just this application (note the use of random UUIDs, plus the check to ensure that the directory doesn't already exist): https://github.com/apache/spark/blob/c5cc41468e8709d09c09289bb55bc8edc99404b1/core/src/main/scala/org/apache/spark/util/Utils.scala#L273 So, although the filenames for shuffle files are not globally unique, their full paths should be unique due to these unique per-application subdirectories. Have you observed an instance where this isn't the case? - Josh On Tue, Mar 24, 2015 at 11:04 PM, Kannan Rajah <kra...@maprtech.com> wrote: > Saisai, > This is the not the case when I use spark-submit to run 2 jobs, one after > another. The shuffle id remains the same. > > > -- > Kannan > > On Tue, Mar 24, 2015 at 7:35 PM, Saisai Shao <sai.sai.s...@gmail.com> > wrote: > > > Hi Kannan, > > > > As I know the shuffle Id in ShuffleDependency will be increased, so even > > if you run the same job twice, the shuffle dependency as well as shuffle > id > > is different, so the shuffle file name which is combined by > > (shuffleId+mapId+reduceId) will be changed, so there's no name conflict > > even in the same directory as I know. > > > > Thanks > > Jerry > > > > > > 2015-03-25 1:56 GMT+08:00 Kannan Rajah <kra...@maprtech.com>: > > > >> I am working on SPARK-1529. I ran into an issue with my change, where > the > >> same shuffle file was being reused across 2 jobs. Please note this only > >> happens when I use a hard coded location to use for shuffle files, say > >> "/tmp". It does not happen with normal code path that uses > >> DiskBlockManager > >> to pick different directories for each run. So I want to understand how > >> DiskBlockManager guarantees that such a conflict will never happen. > >> > >> Let's say the shuffle block id has a value of shuffle_0_0_0. So the data > >> file name is shuffle_0_0_0.data and index file name is > >> shuffle_0_0_0.index. > >> If I run a spark job twice, one after another, these files get created > >> under different directories because of the hashing logic in > >> DiskBlockManager. But the hash is based off the file name, so how are we > >> sure that there won't be a conflict ever? > >> > >> -- > >> Kannan > >> > > > > >