Hello all, I worked around this for now using the class (that I already had) that inherits from RDD and is the one all of our custom RDDs inherit from. I did the following:
1) Overload all of the transformations (that get used in our app) that don't change the RDD size wrapping the results with a proxy rdd that intercepts the count() call returning a cached version or calling an abstract "calculateSize" if it doesn't already know the count. 2) piggyback a count calculation on all actions that we use (aggregate, reduce, fold, foreach) so that as a side effect of calling any of these, if the count isn't already known, it's calculated and stored. The one thing I couldn't do (at least yet) was get zipWithIndex to calculate the count because it's implementation is too opaque inside of the RDD. If anyone wants to see the code I can post it. Thanks for the responses. Jim -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-count-tp11298p11311.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org