Saisai, This is the not the case when I use spark-submit to run 2 jobs, one after another. The shuffle id remains the same.
-- Kannan On Tue, Mar 24, 2015 at 7:35 PM, Saisai Shao <sai.sai.s...@gmail.com> wrote: > Hi Kannan, > > As I know the shuffle Id in ShuffleDependency will be increased, so even > if you run the same job twice, the shuffle dependency as well as shuffle id > is different, so the shuffle file name which is combined by > (shuffleId+mapId+reduceId) will be changed, so there's no name conflict > even in the same directory as I know. > > Thanks > Jerry > > > 2015-03-25 1:56 GMT+08:00 Kannan Rajah <kra...@maprtech.com>: > >> I am working on SPARK-1529. I ran into an issue with my change, where the >> same shuffle file was being reused across 2 jobs. Please note this only >> happens when I use a hard coded location to use for shuffle files, say >> "/tmp". It does not happen with normal code path that uses >> DiskBlockManager >> to pick different directories for each run. So I want to understand how >> DiskBlockManager guarantees that such a conflict will never happen. >> >> Let's say the shuffle block id has a value of shuffle_0_0_0. So the data >> file name is shuffle_0_0_0.data and index file name is >> shuffle_0_0_0.index. >> If I run a spark job twice, one after another, these files get created >> under different directories because of the hashing logic in >> DiskBlockManager. But the hash is based off the file name, so how are we >> sure that there won't be a conflict ever? >> >> -- >> Kannan >> > >