Rather than embrace non-lazy transformations and add more of them, I'd rather we 1) try to fully characterize the needs that are driving their creation/usage; and 2) design and implement new Spark abstractions that will allow us to meet those needs and eliminate existing non-lazy transformation. They really mess up things like creation of asynchronous FutureActions, job cancellation and accounting of job resource usage, etc., so I'd rather we seek a way out of the existing hole rather than make it deeper.
On Mon, Jul 21, 2014 at 10:24 AM, Erik Erlandson <e...@redhat.com> wrote: > > > ----- Original Message ----- > > Sure, drop() would be useful, but breaking the "transformations are lazy; > > only actions launch jobs" model is abhorrent -- which is not to say that > we > > haven't already broken that model for useful operations (cf. > > RangePartitioner, which is used for sorted RDDs), but rather that each > such > > exception to the model is a significant source of pain that can be hard > to > > work with or work around. > > A thought that comes to my mind here is that there are in fact already two > categories of transform: ones that are truly lazy, and ones that are not. > A possible option is to embrace that, and commit to documenting the two > categories as such, with an obvious bias towards favoring lazy transforms > (to paraphrase Churchill, we're down to haggling over the price). > > > > > > I really wouldn't like to see another such model-breaking transformation > > added to the API. On the other hand, being able to write transformations > > with dependencies on these kind of "internal" jobs is sometimes very > > useful, so a significant reworking of Spark's Dependency model that would > > allow for lazily running such internal jobs and making the results > > available to subsequent stages may be something worth pursuing. > > > This seems like a very interesting angle. I don't have much feel for > what a solution would look like, but it sounds as if it would involve > caching all operations embodied by RDD transform method code for > provisional execution. I believe that these levels of invocation are > currently executed in the master, not executor nodes. > > > > > > > > On Mon, Jul 21, 2014 at 8:27 AM, Andrew Ash <and...@andrewash.com> > wrote: > > > > > Personally I'd find the method useful -- I've often had a .csv file > with a > > > header row that I want to drop so filter it out, which touches all > > > partitions anyway. I don't have any comments on the implementation > quite > > > yet though. > > > > > > > > > On Mon, Jul 21, 2014 at 8:24 AM, Erik Erlandson <e...@redhat.com> > wrote: > > > > > > > A few weeks ago I submitted a PR for supporting rdd.drop(n), under > > > > SPARK-2315: > > > > https://issues.apache.org/jira/browse/SPARK-2315 > > > > > > > > Supporting the drop method would make some operations convenient, > however > > > > it forces computation of >= 1 partition of the parent RDD, and so it > > > would > > > > behave like a "partial action" that returns an RDD as the result. > > > > > > > > I wrote up a discussion of these trade-offs here: > > > > > > > > > > > > http://erikerlandson.github.io/blog/2014/07/20/some-implications-of-supporting-the-scala-drop-method-for-spark-rdds/ > > > > > > > > > >