Re: RFC: Supporting the Scala drop Method for Spark RDDs

Erik Erlandson Mon, 21 Jul 2014 08:54:11 -0700


----- Original Message -----
> I too would like this feature. Erik's post makes sense. However, shouldn't
> the RDD also repartition itself after drop to effectively make use of
> cluster resources?



My thinking is that in most use cases(*), one is dropping a small number of 
rows, and they are in only the 1st partition, and so repartitioning would not 
be worth the cost.  The first partition would be passed mostly intact, and the 
remainder would be completely unchanged.

(*) or at least most use cases that I've considered.


> On Jul 21, 2014 8:58 PM, "Andrew Ash [via Apache Spark Developers List]" <
> ml-node+s1001551n7434...@n3.nabble.com> wrote:
> 
> > Personally I'd find the method useful -- I've often had a .csv file with a
> > header row that I want to drop so filter it out, which touches all
> > partitions anyway.  I don't have any comments on the implementation quite
> > yet though.
> >
> >
> > On Mon, Jul 21, 2014 at 8:24 AM, Erik Erlandson <[hidden email]
> > <http://user/SendEmail.jtp?type=node&node=7434&i=0>> wrote:
> >
> > > A few weeks ago I submitted a PR for supporting rdd.drop(n), under
> > > SPARK-2315:
> > > https://issues.apache.org/jira/browse/SPARK-2315
> > >
> > > Supporting the drop method would make some operations convenient,
> > however
> > > it forces computation of >= 1 partition of the parent RDD, and so it
> > would
> > > behave like a "partial action" that returns an RDD as the result.
> > >
> > > I wrote up a discussion of these trade-offs here:
> > >
> > >
> > http://erikerlandson.github.io/blog/2014/07/20/some-implications-of-supporting-the-scala-drop-method-for-spark-rdds/
> > >
> >
> >
> > ------------------------------
> >  If you reply to this email, your message will be added to the discussion
> > below:
> >
> > http://apache-spark-developers-list.1001551.n3.nabble.com/RFC-Supporting-the-Scala-drop-Method-for-Spark-RDDs-tp7433p7434.html
> >  To start a new topic under Apache Spark Developers List, email
> > ml-node+s1001551n1...@n3.nabble.com
> > To unsubscribe from Apache Spark Developers List, click here
> > <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=YW5pa2V0LmJoYXRuYWdhckBnbWFpbC5jb218MXwxMzE3NTAzMzQz>
> > .
> > NAML
> > <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
> >
> 
> 
> 
> 
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/RFC-Supporting-the-Scala-drop-Method-for-Spark-RDDs-tp7433p7436.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.

Re: RFC: Supporting the Scala drop Method for Spark RDDs

Reply via email to