Would it make sense to read each file in as a separate RDD? This way you
would be guaranteed the data is partitioned as you expected.

Possibly you could then repartition each of those RDDs into a single
partition and then union them. I think that would achieve what you expect.
But it would be easy to accidentally screw this up (have some operation
that causes a shuffle), so I think you're better off just leaving them as
separate RDDs.

On Wed, Nov 12, 2014 at 10:27 PM, Pala M Muthaia <
mchett...@rocketfuelinc.com> wrote:

> Hi,
>
> I have a set of input files for a spark program, with each file
> corresponding to a logical data partition. What is the API/mechanism to
> assign each input file (or a set of files) to a spark partition, when
> initializing RDDs?
>
> When i create a spark RDD pointing to the directory of files, my
> understanding is it's not guaranteed that each input file will be treated
> as separate partition.
>
> My job semantics require that the data is partitioned, and i want to
> leverage the partitioning that has already been done, rather than
> repartitioning again in the spark job.
>
> I tried to lookup online but haven't found any pointers so far.
>
>
> Thanks
> pala
>



-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

54 W 40th St, New York, NY 10018
E: daniel.siegm...@velos.io W: www.velos.io

Reply via email to