Re: A proposal for Spark 2.0

Shivaram Venkataraman Tue, 10 Nov 2015 16:24:35 -0800

+1

On a related note I think making it lightweight will ensure that we
stay on the current release schedule and don't unnecessarily delay 2.0
to wait for new features / big architectural changes.


In terms of fixes to 1.x, I think our current policy of back-porting
fixes to older releases would still apply. I don't think developing
new features on both 1.x and 2.x makes a lot of sense as we would like
users to switch to 2.x.

Shivaram

On Tue, Nov 10, 2015 at 4:02 PM, Kostas Sakellis <kos...@cloudera.com> wrote:
> +1 on a lightweight 2.0
>
> What is the thinking around the 1.x line after Spark 2.0 is released? If not
> terminated, how will we determine what goes into each major version line?
> Will 1.x only be for stability fixes?
>
> Thanks,
> Kostas
>
> On Tue, Nov 10, 2015 at 3:41 PM, Patrick Wendell <pwend...@gmail.com> wrote:
>>
>> I also feel the same as Reynold. I agree we should minimize API breaks and
>> focus on fixing things around the edge that were mistakes (e.g. exposing
>> Guava and Akka) rather than any overhaul that could fragment the community.
>> Ideally a major release is a lightweight process we can do every couple of
>> years, with minimal impact for users.
>>
>> - Patrick
>>
>> On Tue, Nov 10, 2015 at 3:35 PM, Nicholas Chammas
>> <nicholas.cham...@gmail.com> wrote:
>>>
>>> > For this reason, I would *not* propose doing major releases to break
>>> > substantial API's or perform large re-architecting that prevent users from
>>> > upgrading. Spark has always had a culture of evolving architecture
>>> > incrementally and making changes - and I don't think we want to change 
>>> > this
>>> > model.
>>>
>>> +1 for this. The Python community went through a lot of turmoil over the
>>> Python 2 -> Python 3 transition because the upgrade process was too painful
>>> for too long. The Spark community will benefit greatly from our explicitly
>>> looking to avoid a similar situation.
>>>
>>> > 3. Assembly-free distribution of Spark: don’t require building an
>>> > enormous assembly jar in order to run Spark.
>>>
>>> Could you elaborate a bit on this? I'm not sure what an assembly-free
>>> distribution means.
>>>
>>> Nick
>>>
>>> On Tue, Nov 10, 2015 at 6:11 PM Reynold Xin <r...@databricks.com> wrote:
>>>>
>>>> I’m starting a new thread since the other one got intermixed with
>>>> feature requests. Please refrain from making feature request in this 
>>>> thread.
>>>> Not that we shouldn’t be adding features, but we can always add features in
>>>> 1.7, 2.1, 2.2, ...
>>>>
>>>> First - I want to propose a premise for how to think about Spark 2.0 and
>>>> major releases in Spark, based on discussion with several members of the
>>>> community: a major release should be low overhead and minimally disruptive
>>>> to the Spark community. A major release should not be very different from a
>>>> minor release and should not be gated based on new features. The main
>>>> purpose of a major release is an opportunity to fix things that are broken
>>>> in the current API and remove certain deprecated APIs (examples follow).
>>>>
>>>> For this reason, I would *not* propose doing major releases to break
>>>> substantial API's or perform large re-architecting that prevent users from
>>>> upgrading. Spark has always had a culture of evolving architecture
>>>> incrementally and making changes - and I don't think we want to change this
>>>> model. In fact, we’ve released many architectural changes on the 1.X line.
>>>>
>>>> If the community likes the above model, then to me it seems reasonable
>>>> to do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or 
>>>> immediately
>>>> after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A cadence of
>>>> major releases every 2 years seems doable within the above model.
>>>>
>>>> Under this model, here is a list of example things I would propose doing
>>>> in Spark 2.0, separated into APIs and Operation/Deployment:
>>>>
>>>>
>>>> APIs
>>>>
>>>> 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in
>>>> Spark 1.x.
>>>>
>>>> 2. Remove Akka from Spark’s API dependency (in streaming), so user
>>>> applications can use Akka (SPARK-5293). We have gotten a lot of complaints
>>>> about user applications being unable to use Akka due to Spark’s dependency
>>>> on Akka.
>>>>
>>>> 3. Remove Guava from Spark’s public API (JavaRDD Optional).
>>>>
>>>> 4. Better class package structure for low level developer API’s. In
>>>> particular, we have some DeveloperApi (mostly various listener-related
>>>> classes) added over the years. Some packages include only one or two public
>>>> classes but a lot of private classes. A better structure is to have public
>>>> classes isolated to a few public packages, and these public packages should
>>>> have minimal private classes for low level developer APIs.
>>>>
>>>> 5. Consolidate task metric and accumulator API. Although having some
>>>> subtle differences, these two are very similar but have completely 
>>>> different
>>>> code path.
>>>>
>>>> 6. Possibly making Catalyst, Dataset, and DataFrame more general by
>>>> moving them to other package(s). They are already used beyond SQL, e.g. in
>>>> ML pipelines, and will be used by streaming also.
>>>>
>>>>
>>>> Operation/Deployment
>>>>
>>>> 1. Scala 2.11 as the default build. We should still support Scala 2.10,
>>>> but it has been end-of-life.
>>>>
>>>> 2. Remove Hadoop 1 support.
>>>>
>>>> 3. Assembly-free distribution of Spark: don’t require building an
>>>> enormous assembly jar in order to run Spark.
>>>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: A proposal for Spark 2.0

Reply via email to