Would it make sense to read each file in as a separate RDD? This way you would be guaranteed the data is partitioned as you expected.
Possibly you could then repartition each of those RDDs into a single partition and then union them. I think that would achieve what you expect. But it would be easy to accidentally screw this up (have some operation that causes a shuffle), so I think you're better off just leaving them as separate RDDs. On Wed, Nov 12, 2014 at 10:27 PM, Pala M Muthaia < mchett...@rocketfuelinc.com> wrote: > Hi, > > I have a set of input files for a spark program, with each file > corresponding to a logical data partition. What is the API/mechanism to > assign each input file (or a set of files) to a spark partition, when > initializing RDDs? > > When i create a spark RDD pointing to the directory of files, my > understanding is it's not guaranteed that each input file will be treated > as separate partition. > > My job semantics require that the data is partitioned, and i want to > leverage the partitioning that has already been done, rather than > repartitioning again in the spark job. > > I tried to lookup online but haven't found any pointers so far. > > > Thanks > pala > -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 54 W 40th St, New York, NY 10018 E: daniel.siegm...@velos.io W: www.velos.io