I'm not aware of any such mechanism.

On Mon, Nov 17, 2014 at 2:55 PM, Pala M Muthaia <mchett...@rocketfuelinc.com
> wrote:

> Hi Daniel,
>
> Yes that should work also. However, is it possible to setup so that each
> RDD has exactly one partition, without repartitioning (and thus incurring
> extra cost)? Is there a mechanism similar to MR where we can ensure each
> partition is assigned some amount of data by size, by setting some block
> size parameter?
>
>
>
> On Thu, Nov 13, 2014 at 1:05 PM, Daniel Siegmann <daniel.siegm...@velos.io
> > wrote:
>
>> On Thu, Nov 13, 2014 at 3:24 PM, Pala M Muthaia <
>> mchett...@rocketfuelinc.com> wrote
>>
>>>
>>> No i don't want separate RDD because each of these partitions are being
>>> processed the same way (in my case, each partition corresponds to HBase
>>> keys belonging to one region server, and i will do HBase lookups). After
>>> that i have aggregations too, hence all these partitions should be in the
>>> same RDD. The reason to follow the partition structure is to limit
>>> concurrent HBase lookups targeting a single region server.
>>>
>>
>> Neither of these is necessarily a barrier to using separate RDDs. You can
>> define the function you want to use and then pass it to multiple map
>> methods. Then you could union all the RDDs to do your aggregations. For
>> example, it might look something like this:
>>
>> val paths: String = ... // the paths to the files you want to load
>> def myFunc(t: T) = ... // the function to apply to every RDD
>> val rdds = paths.map { path =>
>>     sc.textFile(path).map(myFunc)
>> }
>> val completeRdd = sc.union(rdds)
>>
>> Does that make any sense?
>>
>> --
>> Daniel Siegmann, Software Developer
>> Velos
>> Accelerating Machine Learning
>>
>> 54 W 40th St, New York, NY 10018
>> E: daniel.siegm...@velos.io W: www.velos.io
>>
>
>


-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

54 W 40th St, New York, NY 10018
E: daniel.siegm...@velos.io W: www.velos.io

Reply via email to