Re: Assigning input files to spark partitions

2014-11-17 Thread Pala M Muthaia
Hi Daniel, Yes that should work also. However, is it possible to setup so that each RDD has exactly one partition, without repartitioning (and thus incurring extra cost)? Is there a mechanism similar to MR where we can ensure each partition is assigned some amount of data by size, by setting some

Re: Assigning input files to spark partitions

2014-11-17 Thread Daniel Siegmann
I'm not aware of any such mechanism. On Mon, Nov 17, 2014 at 2:55 PM, Pala M Muthaia mchett...@rocketfuelinc.com wrote: Hi Daniel, Yes that should work also. However, is it possible to setup so that each RDD has exactly one partition, without repartitioning (and thus incurring extra cost)?

Re: Assigning input files to spark partitions

2014-11-13 Thread Daniel Siegmann
Would it make sense to read each file in as a separate RDD? This way you would be guaranteed the data is partitioned as you expected. Possibly you could then repartition each of those RDDs into a single partition and then union them. I think that would achieve what you expect. But it would be

Re: Assigning input files to spark partitions

2014-11-13 Thread Rishi Yadav
If your data is in hdfs and you are reading as textFile and each file is less than block size, my understanding is it would always have one partition per file. On Thursday, November 13, 2014, Daniel Siegmann daniel.siegm...@velos.io wrote: Would it make sense to read each file in as a separate

Re: Assigning input files to spark partitions

2014-11-13 Thread Daniel Siegmann
I believe Rishi is correct. I wouldn't rely on that though - all it would take is for one file to exceed the block size and you'd be setting yourself up for pain. Also, if your files are small - small enough to fit in a single record - you could use SparkContext.wholeTextFile. On Thu, Nov 13,

Re: Assigning input files to spark partitions

2014-11-13 Thread Pala M Muthaia
Thanks for the responses Daniel and Rishi. No i don't want separate RDD because each of these partitions are being processed the same way (in my case, each partition corresponds to HBase keys belonging to one region server, and i will do HBase lookups). After that i have aggregations too, hence

Re: Assigning input files to spark partitions

2014-11-13 Thread Daniel Siegmann
On Thu, Nov 13, 2014 at 3:24 PM, Pala M Muthaia mchett...@rocketfuelinc.com wrote No i don't want separate RDD because each of these partitions are being processed the same way (in my case, each partition corresponds to HBase keys belonging to one region server, and i will do HBase lookups).

Assigning input files to spark partitions

2014-11-12 Thread Pala M Muthaia
Hi, I have a set of input files for a spark program, with each file corresponding to a logical data partition. What is the API/mechanism to assign each input file (or a set of files) to a spark partition, when initializing RDDs? When i create a spark RDD pointing to the directory of files, my