Hello,
A bit of a background.
I have a dataset with about 200 million records and around 10 columns. The size
of this dataset is around 1.5Tb and is split into around 600 files.
When I read this dataset, using sparkContext, by default it creates around 3000
partitions if I do not specify the
Hi Vinay,
What I'm guessing is happening is that Spark is taking the locality of
files into account and you don't have node-local data on all your
machines. This might be the case if you're reading out of HDFS and your
600 files are somehow skewed to only be on about 200 of your 400 machines.
A