I think the worry here is that people often use count() to force execution, and when coupled with transformations with side-effect, it is no longer safe to not run it.
However, maybe we can add a new lazy val .size that doesn't require recomputation. On Sat, Mar 28, 2015 at 7:42 AM, Sandy Ryza <sandy.r...@cloudera.com> wrote: > I definitely see the value in this. However, I think at this point it > would be an incompatible behavioral change. People often use count in > Spark to exercise their DAG. Omitting processing steps that were > previously included would likely mislead many users into thinking their > pipeline was running faster. > > It's possible there might be room for something like a new smartCount API > or a new argument to count that allows it to avoid unnecessary > transformations. > > -Sandy > > On Sat, Mar 28, 2015 at 6:10 AM, Sean Owen <so...@cloudera.com> wrote: > > > No, I'm not saying side effects change the count. But not executing > > the map() function at all certainly has an effect on the side effects > > of that function: the side effects which should take place never do. I > > am not sure that is something to be 'fixed'; it's a legitimate > > question. > > > > You can persist an RDD if you do not want to compute it twice. > > > > On Sat, Mar 28, 2015 at 1:05 PM, jimfcarroll <jimfcarr...@gmail.com> > > wrote: > > > Hi Sean, > > > > > > Thanks for the response. > > > > > > I can't imagine a case (though my imagination may be somewhat limited) > > where > > > even map side effects could change the number of elements in the > > resulting > > > map. > > > > > > I guess "count" wouldn't officially be an 'action' if it were > implemented > > > this way. At least it wouldn't ALWAYS be one. > > > > > > My example was contrived. We're passing RDDs to functions. If that RDD > > is an > > > instance of my class, then its count() may take a shortcut. If I > > > map/zip/zipWithIndex/mapPartition/etc. first then I'm stuck with a call > > that > > > literally takes 100s to 1000s of times longer (seconds vs hours on some > > of > > > our datasets) and since my custom RDDs are immutable they cache the > count > > > call so a second invocation is the cost of a method call's overhead. > > > > > > I could fix this in Spark if there's any interest in that change. > > Otherwise > > > I'll need to overload more RDD methods for my own purposes (like all of > > the > > > transformations). Of course, that will be more difficult because those > > > intermediate classes (like MappedRDD) are private, so I can't extend > > them. > > > > > > Jim > > > > > > > > > > > > > > > -- > > > View this message in context: > > > http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-count-tp11298p11302.html > > > Sent from the Apache Spark Developers List mailing list archive at > > Nabble.com. > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > > > For additional commands, e-mail: dev-h...@spark.apache.org > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > > For additional commands, e-mail: dev-h...@spark.apache.org > > > > >