Re: A proposal for Spark 2.0

Jean-Baptiste Onofré Tue, 10 Nov 2015 20:41:44 -0800

Agree, it makes sense.

Regards
JB


On 11/11/2015 01:28 AM, Reynold Xin wrote:

Echoing Shivaram here. I don't think it makes a lot of sense to add more
features to the 1.x line. We should still do critical bug fixes though.


On Tue, Nov 10, 2015 at 4:23 PM, Shivaram Venkataraman
<shiva...@eecs.berkeley.edu <mailto:shiva...@eecs.berkeley.edu>> wrote:

    +1

    On a related note I think making it lightweight will ensure that we
    stay on the current release schedule and don't unnecessarily delay 2.0
    to wait for new features / big architectural changes.

    In terms of fixes to 1.x, I think our current policy of back-porting
    fixes to older releases would still apply. I don't think developing
    new features on both 1.x and 2.x makes a lot of sense as we would like
    users to switch to 2.x.

    Shivaram

    On Tue, Nov 10, 2015 at 4:02 PM, Kostas Sakellis
    <kos...@cloudera.com <mailto:kos...@cloudera.com>> wrote:
     > +1 on a lightweight 2.0
     >
     > What is the thinking around the 1.x line after Spark 2.0 is
    released? If not
     > terminated, how will we determine what goes into each major
    version line?
     > Will 1.x only be for stability fixes?
     >
     > Thanks,
     > Kostas
     >
     > On Tue, Nov 10, 2015 at 3:41 PM, Patrick Wendell
    <pwend...@gmail.com <mailto:pwend...@gmail.com>> wrote:
     >>
     >> I also feel the same as Reynold. I agree we should minimize API
    breaks and
     >> focus on fixing things around the edge that were mistakes (e.g.
    exposing
     >> Guava and Akka) rather than any overhaul that could fragment the
    community.
     >> Ideally a major release is a lightweight process we can do every
    couple of
     >> years, with minimal impact for users.
     >>
     >> - Patrick
     >>
     >> On Tue, Nov 10, 2015 at 3:35 PM, Nicholas Chammas
     >> <nicholas.cham...@gmail.com <mailto:nicholas.cham...@gmail.com>>
    wrote:
     >>>
     >>> > For this reason, I would *not* propose doing major releases
    to break
     >>> > substantial API's or perform large re-architecting that
    prevent users from
     >>> > upgrading. Spark has always had a culture of evolving
    architecture
     >>> > incrementally and making changes - and I don't think we want
    to change this
     >>> > model.
     >>>
     >>> +1 for this. The Python community went through a lot of turmoil
    over the
     >>> Python 2 -> Python 3 transition because the upgrade process was
    too painful
     >>> for too long. The Spark community will benefit greatly from our
    explicitly
     >>> looking to avoid a similar situation.
     >>>
     >>> > 3. Assembly-free distribution of Spark: don’t require building an
     >>> > enormous assembly jar in order to run Spark.
     >>>
     >>> Could you elaborate a bit on this? I'm not sure what an
    assembly-free
     >>> distribution means.
     >>>
     >>> Nick
     >>>
     >>> On Tue, Nov 10, 2015 at 6:11 PM Reynold Xin
    <r...@databricks.com <mailto:r...@databricks.com>> wrote:
     >>>>
     >>>> I’m starting a new thread since the other one got intermixed with
     >>>> feature requests. Please refrain from making feature request
    in this thread.
     >>>> Not that we shouldn’t be adding features, but we can always
    add features in
     >>>> 1.7, 2.1, 2.2, ...
     >>>>
     >>>> First - I want to propose a premise for how to think about
    Spark 2.0 and
     >>>> major releases in Spark, based on discussion with several
    members of the
     >>>> community: a major release should be low overhead and
    minimally disruptive
     >>>> to the Spark community. A major release should not be very
    different from a
     >>>> minor release and should not be gated based on new features.
    The main
     >>>> purpose of a major release is an opportunity to fix things
    that are broken
     >>>> in the current API and remove certain deprecated APIs
    (examples follow).
     >>>>
     >>>> For this reason, I would *not* propose doing major releases to
    break
     >>>> substantial API's or perform large re-architecting that
    prevent users from
     >>>> upgrading. Spark has always had a culture of evolving architecture
     >>>> incrementally and making changes - and I don't think we want
    to change this
     >>>> model. In fact, we’ve released many architectural changes on
    the 1.X line.
     >>>>
     >>>> If the community likes the above model, then to me it seems
    reasonable
     >>>> to do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7)
    or immediately
     >>>> after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A
    cadence of
     >>>> major releases every 2 years seems doable within the above model.
     >>>>
     >>>> Under this model, here is a list of example things I would
    propose doing
     >>>> in Spark 2.0, separated into APIs and Operation/Deployment:
     >>>>
     >>>>
     >>>> APIs
     >>>>
     >>>> 1. Remove interfaces, configs, and modules (e.g. Bagel)
    deprecated in
     >>>> Spark 1.x.
     >>>>
     >>>> 2. Remove Akka from Spark’s API dependency (in streaming), so user
     >>>> applications can use Akka (SPARK-5293). We have gotten a lot
    of complaints
     >>>> about user applications being unable to use Akka due to
    Spark’s dependency
     >>>> on Akka.
     >>>>
     >>>> 3. Remove Guava from Spark’s public API (JavaRDD Optional).
     >>>>
     >>>> 4. Better class package structure for low level developer
    API’s. In
     >>>> particular, we have some DeveloperApi (mostly various
    listener-related
     >>>> classes) added over the years. Some packages include only one
    or two public
     >>>> classes but a lot of private classes. A better structure is to
    have public
     >>>> classes isolated to a few public packages, and these public
    packages should
     >>>> have minimal private classes for low level developer APIs.
     >>>>
     >>>> 5. Consolidate task metric and accumulator API. Although
    having some
     >>>> subtle differences, these two are very similar but have
    completely different
     >>>> code path.
     >>>>
     >>>> 6. Possibly making Catalyst, Dataset, and DataFrame more
    general by
     >>>> moving them to other package(s). They are already used beyond
    SQL, e.g. in
     >>>> ML pipelines, and will be used by streaming also.
     >>>>
     >>>>
     >>>> Operation/Deployment
     >>>>
     >>>> 1. Scala 2.11 as the default build. We should still support
    Scala 2.10,
     >>>> but it has been end-of-life.
     >>>>
     >>>> 2. Remove Hadoop 1 support.
     >>>>
     >>>> 3. Assembly-free distribution of Spark: don’t require building an
     >>>> enormous assembly jar in order to run Spark.
     >>>>
     >>
     >


--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: A proposal for Spark 2.0

Reply via email to