I'd like to get some feedback on an API design issue pertaining to RDDs. 

The design goal to avoid RDD nesting, which I agree with, leads the methods
operating on subsets of an RDD (not necessarily partitions) to use Iterable
as an abstraction. The mapPartitions and groupBy* family of methods are good
examples. The problem with that API choice is that developers often very
quickly run out of the benefits of the RDD API, independent of partitioning. 

Consider two very simple problems that demonstrate the issue. The input is
the same for all: an RDD of integers that has been grouped into odd and
even.

1. Sample the odds at 10% and the evens at 20%. Trivial, as stratified
sampling (sampleByKey) is built into PairRDDFunctions.

2. Sample at 10% if there are more than 1,000 elements in a group and at 20%
otherwise. Suddenly, the problem becomes a lot less easy. The sub-groups are
no longer RDDs and we can't use the RDD sampling API.

Note that the only reason the first problem is easy is because it was part
of Spark. If that hadn't happened, implementing it with the higher-level API
abstractions wouldn't have been easy. As more an more people use Spark for
ever more diverse sets of problems the likelihood that the RDD APIs provide
pre-existing high-level abstractions will diminish. 

How do you feel about this? Do you think it is desirable to lose all
high-level RDD API abstractions the very moment we group an RDD or call
mapPartitions? Does the goal of no nested RDDs mean there are absolutely no
high-level abstractions that we can expose via the Iterables borne of RDDs?

I'd love your thoughts.

/Sim
http://linkedin.com/in/simeons <http://linkedin.com/in/simeons>  



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-API-patterns-tp14116.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Reply via email to