I'm not aware of any such mechanism. On Mon, Nov 17, 2014 at 2:55 PM, Pala M Muthaia <mchett...@rocketfuelinc.com > wrote:
> Hi Daniel, > > Yes that should work also. However, is it possible to setup so that each > RDD has exactly one partition, without repartitioning (and thus incurring > extra cost)? Is there a mechanism similar to MR where we can ensure each > partition is assigned some amount of data by size, by setting some block > size parameter? > > > > On Thu, Nov 13, 2014 at 1:05 PM, Daniel Siegmann <daniel.siegm...@velos.io > > wrote: > >> On Thu, Nov 13, 2014 at 3:24 PM, Pala M Muthaia < >> mchett...@rocketfuelinc.com> wrote >> >>> >>> No i don't want separate RDD because each of these partitions are being >>> processed the same way (in my case, each partition corresponds to HBase >>> keys belonging to one region server, and i will do HBase lookups). After >>> that i have aggregations too, hence all these partitions should be in the >>> same RDD. The reason to follow the partition structure is to limit >>> concurrent HBase lookups targeting a single region server. >>> >> >> Neither of these is necessarily a barrier to using separate RDDs. You can >> define the function you want to use and then pass it to multiple map >> methods. Then you could union all the RDDs to do your aggregations. For >> example, it might look something like this: >> >> val paths: String = ... // the paths to the files you want to load >> def myFunc(t: T) = ... // the function to apply to every RDD >> val rdds = paths.map { path => >> sc.textFile(path).map(myFunc) >> } >> val completeRdd = sc.union(rdds) >> >> Does that make any sense? >> >> -- >> Daniel Siegmann, Software Developer >> Velos >> Accelerating Machine Learning >> >> 54 W 40th St, New York, NY 10018 >> E: daniel.siegm...@velos.io W: www.velos.io >> > > -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 54 W 40th St, New York, NY 10018 E: daniel.siegm...@velos.io W: www.velos.io