`zipWithIndex` is both compute intensive and breaks Spark's "transformations are lazy" model, so it is probably not appropriate to add this to the public RDD API. If `zipWithIndex` weren't already what I consider to be broken, I'd be much friendlier to building something more on top of it, but I really don't like `zipWithIndex` or this `sample` idea of yours as they stand. Aside from those concerns, the name `sample` doesn't really work since it is too easy to assume that it means some kind of random sampling.
None of that is to say that you can't make effective use of `zipWithIndex` and your `sample` in your own code; but I'm not a fan of extending the Spark public API in this way. On Tue, Dec 9, 2014 at 12:38 PM, Ganelin, Ilya <ilya.gane...@capitalone.com> wrote: > Hi all – a utility that I’ve found useful several times now when working > with RDDs is to be able to reason about segments of the RDD. > > For example, if I have two large RDDs and I want to combine them in a way > that would be intractable in terms of memory or disk storage (e.g. A > cartesian) but a piece-wise approach is tractable, there are not many > methods that enable this. > > Existing relevant methods are : > takeSample – Which allows me to take a number of random values from an RDD > but does not let me process the entire RDD > randomSplit – Which segments the RDD into an array of smaller RDDs – with > this, however, I’ve seen numerous memory issues during execution on larger > datasets or when splitting into many RDDs > forEach or collect – Which require the RDD to fit into memory – which is > not possible for larger datasets > > What I am proposing is the addition of a simple method to operate on an > RDD which would operate equivalently to a substring operation in String: > > For an RDD[T]: > > def sample(startIdx : Int, endIdx : Int) : RDD[T] = { > val zipped = this.zipWithIndex() > val sampled = filter(idx => (idx >= startIdx) && (idx < endIdx)) > sampled.map(_._1) > } > > I would appreciate any feedback – thank you! > > > ________________________________________________________ > > The information contained in this e-mail is confidential and/or > proprietary to Capital One and/or its affiliates. The information > transmitted herewith is intended only for use by the individual or entity > to which it is addressed. If the reader of this message is not the > intended recipient, you are hereby notified that any review, > retransmission, dissemination, distribution, copying or other use of, or > taking of any action in reliance upon this information is strictly > prohibited. If you have received this communication in error, please > contact the sender and delete the material from your computer. >