Optimal Partition Strategy

2014-09-25 Thread Muttineni, Vinay
Hello, A bit of a background. I have a dataset with about 200 million records and around 10 columns. The size of this dataset is around 1.5Tb and is split into around 600 files. When I read this dataset, using sparkContext, by default it creates around 3000 partitions if I do not specify the

Re: Optimal Partition Strategy

2014-09-25 Thread Andrew Ash
Hi Vinay, What I'm guessing is happening is that Spark is taking the locality of files into account and you don't have node-local data on all your machines. This might be the case if you're reading out of HDFS and your 600 files are somehow skewed to only be on about 200 of your 400 machines. A