Re: Understanding shuffle file name conflicts

Cheng Lian Wed, 25 Mar 2015 04:43:43 -0700

Hi Jerry & Josh

It has been a while since the last time I looked into Spark core shufflecode, maybe I’m wrong here. But the shuffle ID is created along withShuffleDependency, which is part of the RDD DAG. So if we submitmultiple jobs over the same RDD DAG, I think the shuffle IDs in thesejobs should duplicate. For example:


|val  dag  =  sc.parallelize(Array(1,2,3)).map(i => i -> i).reduceByKey(_ + _)
dag.collect()
dag.collect()
|

From the debug log output, I did see duplicated shuffle IDs in bothjobs. Something like this:


|# Job 1
15/03/25 19:26:34 DEBUG BlockStoreShuffleFetcher: Fetching outputs for shuffle 
0, reduce 2

# Job 2
15/03/25 19:26:36 DEBUG BlockStoreShuffleFetcher: Fetching outputs for shuffle 
0, reduce 5
|

So it’s also possible that some shuffle output files get reused indifferent jobs. But Kannan, did you submit separate jobs over the sameRDD DAG as I did above? If not, I’d agree with Jerry and Josh.


(Did I miss something here?)

Cheng

On 3/25/15 10:35 AM, Saisai Shao wrote:

Hi Kannan,

As I know the shuffle Id in ShuffleDependency will be increased, so even if
you run the same job twice, the shuffle dependency as well as shuffle id is
different, so the shuffle file name which is combined by
(shuffleId+mapId+reduceId) will be changed, so there's no name conflict
even in the same directory as I know.

Thanks
Jerry


2015-03-25 1:56 GMT+08:00 Kannan Rajah <kra...@maprtech.com>:

I am working on SPARK-1529. I ran into an issue with my change, where the
same shuffle file was being reused across 2 jobs. Please note this only
happens when I use a hard coded location to use for shuffle files, say
"/tmp". It does not happen with normal code path that uses DiskBlockManager
to pick different directories for each run. So I want to understand how
DiskBlockManager guarantees that such a conflict will never happen.

Let's say the shuffle block id has a value of shuffle_0_0_0. So the data
file name is shuffle_0_0_0.data and index file name is shuffle_0_0_0.index.
If I run a spark job twice, one after another, these files get created
under different directories because of the hashing logic in
DiskBlockManager. But the hash is based off the file name, so how are we
sure that there won't be a conflict ever?

--
Kannan

Re: Understanding shuffle file name conflicts

Reply via email to