Re: A proposal for Spark 2.0

Reynold Xin Tue, 22 Dec 2015 15:53:25 -0800

I started a wiki page:
https://cwiki.apache.org/confluence/display/SPARK/Development+Discussions



On Tue, Dec 22, 2015 at 6:27 AM, Tom Graves <tgraves...@yahoo.com> wrote:

> Do we have a summary of all the discussions and what is planned for 2.0
> then?  Perhaps we should put on the wiki for reference.
>
> Tom
>
>
> On Tuesday, December 22, 2015 12:12 AM, Reynold Xin <r...@databricks.com>
> wrote:
>
>
> FYI I updated the master branch's Spark version to 2.0.0-SNAPSHOT.
>
> On Tue, Nov 10, 2015 at 3:10 PM, Reynold Xin <r...@databricks.com> wrote:
>
> I’m starting a new thread since the other one got intermixed with feature
> requests. Please refrain from making feature request in this thread. Not
> that we shouldn’t be adding features, but we can always add features in
> 1.7, 2.1, 2.2, ...
>
> First - I want to propose a premise for how to think about Spark 2.0 and
> major releases in Spark, based on discussion with several members of the
> community: a major release should be low overhead and minimally disruptive
> to the Spark community. A major release should not be very different from a
> minor release and should not be gated based on new features. The main
> purpose of a major release is an opportunity to fix things that are broken
> in the current API and remove certain deprecated APIs (examples follow).
>
> For this reason, I would *not* propose doing major releases to break
> substantial API's or perform large re-architecting that prevent users from
> upgrading. Spark has always had a culture of evolving architecture
> incrementally and making changes - and I don't think we want to change this
> model. In fact, we’ve released many architectural changes on the 1.X line.
>
> If the community likes the above model, then to me it seems reasonable to
> do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or immediately
> after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A cadence of
> major releases every 2 years seems doable within the above model.
>
> Under this model, here is a list of example things I would propose doing
> in Spark 2.0, separated into APIs and Operation/Deployment:
>
>
> APIs
>
> 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in
> Spark 1.x.
>
> 2. Remove Akka from Spark’s API dependency (in streaming), so user
> applications can use Akka (SPARK-5293). We have gotten a lot of complaints
> about user applications being unable to use Akka due to Spark’s dependency
> on Akka.
>
> 3. Remove Guava from Spark’s public API (JavaRDD Optional).
>
> 4. Better class package structure for low level developer API’s. In
> particular, we have some DeveloperApi (mostly various listener-related
> classes) added over the years. Some packages include only one or two public
> classes but a lot of private classes. A better structure is to have public
> classes isolated to a few public packages, and these public packages should
> have minimal private classes for low level developer APIs.
>
> 5. Consolidate task metric and accumulator API. Although having some
> subtle differences, these two are very similar but have completely
> different code path.
>
> 6. Possibly making Catalyst, Dataset, and DataFrame more general by moving
> them to other package(s). They are already used beyond SQL, e.g. in ML
> pipelines, and will be used by streaming also.
>
>
> Operation/Deployment
>
> 1. Scala 2.11 as the default build. We should still support Scala 2.10,
> but it has been end-of-life.
>
> 2. Remove Hadoop 1 support.
>
> 3. Assembly-free distribution of Spark: don’t require building an enormous
> assembly jar in order to run Spark.
>
>
>
>
>

Re: A proposal for Spark 2.0

Reply via email to