Re: How can I create an RDD with millions of entries created programmatically

2014-12-09 Thread Daniel Darabos
Ah... I think you're right about the flatMap then :). Or you could use mapPartitions. (I'm not sure if it makes a difference.) On Mon, Dec 8, 2014 at 10:09 PM, Steve Lewis wrote: > looks good but how do I say that in Java > as far as I can see sc.parallelize (in Java) has only one implementatio

Re: How can I create an RDD with millions of entries created programmatically

2014-12-08 Thread Steve Lewis
looks good but how do I say that in Java as far as I can see sc.parallelize (in Java) has only one implementation which takes a List - requiring an in memory representation On Mon, Dec 8, 2014 at 12:06 PM, Daniel Darabos < daniel.dara...@lynxanalytics.com> wrote: > Hi, > I think you have the rig

Re: How can I create an RDD with millions of entries created programmatically

2014-12-08 Thread Daniel Darabos
Hi, I think you have the right idea. I would not even worry about flatMap. val rdd = sc.parallelize(1 to 100, numSlices = 1000).map(x => generateRandomObject(x)) Then when you try to evaluate something on this RDD, it will happen partition-by-partition. So 1000 random objects will be generate

How can I create an RDD with millions of entries created programmatically

2014-12-08 Thread Steve Lewis
I have a function which generates a Java object and I want to explore failures which only happen when processing large numbers of these object. the real code is reading a many gigabyte file but in the test code I can generate similar objects programmatically. I could create a small list, paralleli