Re: A proposal for Spark 2.0

Koert Kuipers Thu, 26 Nov 2015 13:00:06 -0800

I also thought the idea was to drop 2.10. Do we want to cross build for 3
scala versions?
On Nov 25, 2015 3:54 AM, "Sandy Ryza" <sandy.r...@cloudera.com> wrote:


> I see.  My concern is / was that cluster operators will be reluctant to
> upgrade to 2.0, meaning that developers using those clusters need to stay
> on 1.x, and, if they want to move to DataFrames, essentially need to port
> their app twice.
>
> I misunderstood and thought part of the proposal was to drop support for
> 2.10 though.  If your broad point is that there aren't changes in 2.0 that
> will make it less palatable to cluster administrators than releases in the
> 1.x line, then yes, 2.0 as the next release sounds fine to me.
>
> -Sandy
>
>
> On Tue, Nov 24, 2015 at 11:55 AM, Matei Zaharia <matei.zaha...@gmail.com>
> wrote:
>
>> What are the other breaking changes in 2.0 though? Note that we're not
>> removing Scala 2.10, we're just making the default build be against Scala
>> 2.11 instead of 2.10. There seem to be very few changes that people would
>> worry about. If people are going to update their apps, I think it's better
>> to make the other small changes in 2.0 at the same time than to update once
>> for Dataset and another time for 2.0.
>>
>> BTW just refer to Reynold's original post for the other proposed API
>> changes.
>>
>> Matei
>>
>> On Nov 24, 2015, at 12:27 PM, Sandy Ryza <sandy.r...@cloudera.com> wrote:
>>
>> I think that Kostas' logic still holds.  The majority of Spark users, and
>> likely an even vaster majority of people running vaster jobs, are still on
>> RDDs and on the cusp of upgrading to DataFrames.  Users will probably want
>> to upgrade to the stable version of the Dataset / DataFrame API so they
>> don't need to do so twice.  Requiring that they absorb all the other ways
>> that Spark breaks compatibility in the move to 2.0 makes it much more
>> difficult for them to make this transition.
>>
>> Using the same set of APIs also means that it will be easier to backport
>> critical fixes to the 1.x line.
>>
>> It's not clear to me that avoiding breakage of an experimental API in the
>> 1.x line outweighs these issues.
>>
>> -Sandy
>>
>> On Mon, Nov 23, 2015 at 10:51 PM, Reynold Xin <r...@databricks.com>
>> wrote:
>>
>>> I actually think the next one (after 1.6) should be Spark 2.0. The
>>> reason is that I already know we have to break some part of the
>>> DataFrame/Dataset API as part of the Dataset design. (e.g. DataFrame.map
>>> should return Dataset rather than RDD). In that case, I'd rather break this
>>> sooner (in one release) than later (in two releases). so the damage is
>>> smaller.
>>>
>>> I don't think whether we call Dataset/DataFrame experimental or not
>>> matters too much for 2.0. We can still call Dataset experimental in 2.0 and
>>> then mark them as stable in 2.1. Despite being "experimental", there has
>>> been no breaking changes to DataFrame from 1.3 to 1.6.
>>>
>>>
>>>
>>> On Wed, Nov 18, 2015 at 3:43 PM, Mark Hamstra <m...@clearstorydata.com>
>>> wrote:
>>>
>>>> Ah, got it; by "stabilize" you meant changing the API, not just bug
>>>> fixing.  We're on the same page now.
>>>>
>>>> On Wed, Nov 18, 2015 at 3:39 PM, Kostas Sakellis <kos...@cloudera.com>
>>>> wrote:
>>>>
>>>>> A 1.6.x release will only fix bugs - we typically don't change APIs in
>>>>> z releases. The Dataset API is experimental and so we might be changing 
>>>>> the
>>>>> APIs before we declare it stable. This is why I think it is important to
>>>>> first stabilize the Dataset API with a Spark 1.7 release before moving to
>>>>> Spark 2.0. This will benefit users that would like to use the new Dataset
>>>>> APIs but can't move to Spark 2.0 because of the backwards incompatible
>>>>> changes, like removal of deprecated APIs, Scala 2.11 etc.
>>>>>
>>>>> Kostas
>>>>>
>>>>>
>>>>> On Fri, Nov 13, 2015 at 12:26 PM, Mark Hamstra <
>>>>> m...@clearstorydata.com> wrote:
>>>>>
>>>>>> Why does stabilization of those two features require a 1.7 release
>>>>>> instead of 1.6.1?
>>>>>>
>>>>>> On Fri, Nov 13, 2015 at 11:40 AM, Kostas Sakellis <
>>>>>> kos...@cloudera.com> wrote:
>>>>>>
>>>>>>> We have veered off the topic of Spark 2.0 a little bit here - yes we
>>>>>>> can talk about RDD vs. DS/DF more but lets refocus on Spark 2.0. I'd 
>>>>>>> like
>>>>>>> to propose we have one more 1.x release after Spark 1.6. This will 
>>>>>>> allow us
>>>>>>> to stabilize a few of the new features that were added in 1.6:
>>>>>>>
>>>>>>> 1) the experimental Datasets API
>>>>>>> 2) the new unified memory manager.
>>>>>>>
>>>>>>> I understand our goal for Spark 2.0 is to offer an easy transition
>>>>>>> but there will be users that won't be able to seamlessly upgrade given 
>>>>>>> what
>>>>>>> we have discussed as in scope for 2.0. For these users, having a 1.x
>>>>>>> release with these new features/APIs stabilized will be very beneficial.
>>>>>>> This might make Spark 1.7 a lighter release but that is not necessarily 
>>>>>>> a
>>>>>>> bad thing.
>>>>>>>
>>>>>>> Any thoughts on this timeline?
>>>>>>>
>>>>>>> Kostas Sakellis
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Nov 12, 2015 at 8:39 PM, Cheng, Hao <hao.ch...@intel.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Agree, more features/apis/optimization need to be added in DF/DS.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I mean, we need to think about what kind of RDD APIs we have to
>>>>>>>> provide to developer, maybe the fundamental API is enough, like, the
>>>>>>>> ShuffledRDD etc..  But PairRDDFunctions probably not in this category, 
>>>>>>>> as
>>>>>>>> we can do the same thing easily with DF/DS, even better performance.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *From:* Mark Hamstra [mailto:m...@clearstorydata.com]
>>>>>>>> *Sent:* Friday, November 13, 2015 11:23 AM
>>>>>>>> *To:* Stephen Boesch
>>>>>>>>
>>>>>>>> *Cc:* dev@spark.apache.org
>>>>>>>> *Subject:* Re: A proposal for Spark 2.0
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Hmmm... to me, that seems like precisely the kind of thing that
>>>>>>>> argues for retaining the RDD API but not as the first thing presented 
>>>>>>>> to
>>>>>>>> new Spark developers: "Here's how to use groupBy with DataFrames.... 
>>>>>>>> Until
>>>>>>>> the optimizer is more fully developed, that won't always get you the 
>>>>>>>> best
>>>>>>>> performance that can be obtained.  In these particular circumstances, 
>>>>>>>> ...,
>>>>>>>> you may want to use the low-level RDD API while setting
>>>>>>>> preservesPartitioning to true.  Like this...."
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Nov 12, 2015 at 7:05 PM, Stephen Boesch <java...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> My understanding is that  the RDD's presently have more support for
>>>>>>>> complete control of partitioning which is a key consideration at scale.
>>>>>>>> While partitioning control is still piecemeal in  DF/DS  it would seem
>>>>>>>> premature to make RDD's a second-tier approach to spark dev.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> An example is the use of groupBy when we know that the source
>>>>>>>> relation (/RDD) is already partitioned on the grouping expressions.  
>>>>>>>> AFAIK
>>>>>>>> the spark sql still does not allow that knowledge to be applied to the
>>>>>>>> optimizer - so a full shuffle will be performed. However in the native 
>>>>>>>> RDD
>>>>>>>> we can use preservesPartitioning=true.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 2015-11-12 17:42 GMT-08:00 Mark Hamstra <m...@clearstorydata.com>:
>>>>>>>>
>>>>>>>> The place of the RDD API in 2.0 is also something I've been
>>>>>>>> wondering about.  I think it may be going too far to deprecate it, but
>>>>>>>> changing emphasis is something that we might consider.  The RDD API 
>>>>>>>> came
>>>>>>>> well before DataFrames and DataSets, so programming guides, 
>>>>>>>> introductory
>>>>>>>> how-to articles and the like have, to this point, also tended to 
>>>>>>>> emphasize
>>>>>>>> RDDs -- or at least to deal with them early.  What I'm thinking is that
>>>>>>>> with 2.0 maybe we should overhaul all the documentation to 
>>>>>>>> de-emphasize and
>>>>>>>> reposition RDDs.  In this scheme, DataFrames and DataSets would be
>>>>>>>> introduced and fully addressed before RDDs.  They would be presented 
>>>>>>>> as the
>>>>>>>> normal/default/standard way to do things in Spark.  RDDs, in contrast,
>>>>>>>> would be presented later as a kind of lower-level, closer-to-the-metal 
>>>>>>>> API
>>>>>>>> that can be used in atypical, more specialized contexts where 
>>>>>>>> DataFrames or
>>>>>>>> DataSets don't fully fit.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Nov 12, 2015 at 5:17 PM, Cheng, Hao <hao.ch...@intel.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> I am not sure what the best practice for this specific problem, but
>>>>>>>> it’s really worth to think about it in 2.0, as it is a painful issue 
>>>>>>>> for
>>>>>>>> lots of users.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> By the way, is it also an opportunity to deprecate the RDD API (or
>>>>>>>> internal API only?)? As lots of its functionality overlapping with
>>>>>>>> DataFrame or DataSet.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Hao
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *From:* Kostas Sakellis [mailto:kos...@cloudera.com]
>>>>>>>> *Sent:* Friday, November 13, 2015 5:27 AM
>>>>>>>> *To:* Nicholas Chammas
>>>>>>>> *Cc:* Ulanov, Alexander; Nan Zhu; wi...@qq.com;
>>>>>>>> dev@spark.apache.org; Reynold Xin
>>>>>>>>
>>>>>>>>
>>>>>>>> *Subject:* Re: A proposal for Spark 2.0
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I know we want to keep breaking changes to a minimum but I'm hoping
>>>>>>>> that with Spark 2.0 we can also look at better classpath isolation with
>>>>>>>> user programs. I propose we build on
>>>>>>>> spark.{driver|executor}.userClassPathFirst, setting it true by 
>>>>>>>> default, and
>>>>>>>> not allow any spark transitive dependencies to leak into user code. For
>>>>>>>> backwards compatibility we can have a whitelist if we want but I'd be 
>>>>>>>> good
>>>>>>>> if we start requiring user apps to explicitly pull in all their
>>>>>>>> dependencies. From what I can tell, Hadoop 3 is also moving in this
>>>>>>>> direction.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Kostas
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Nov 12, 2015 at 9:56 AM, Nicholas Chammas <
>>>>>>>> nicholas.cham...@gmail.com> wrote:
>>>>>>>>
>>>>>>>> With regards to Machine learning, it would be great to move useful
>>>>>>>> features from MLlib to ML and deprecate the former. Current structure 
>>>>>>>> of
>>>>>>>> two separate machine learning packages seems to be somewhat confusing.
>>>>>>>>
>>>>>>>> With regards to GraphX, it would be great to deprecate the use of
>>>>>>>> RDD in GraphX and switch to Dataframe. This will allow GraphX evolve 
>>>>>>>> with
>>>>>>>> Tungsten.
>>>>>>>>
>>>>>>>> On that note of deprecating stuff, it might be good to deprecate
>>>>>>>> some things in 2.0 without removing or replacing them immediately. 
>>>>>>>> That way
>>>>>>>> 2.0 doesn’t have to wait for everything that we want to deprecate to be
>>>>>>>> replaced all at once.
>>>>>>>>
>>>>>>>> Nick
>>>>>>>>
>>>>>>>> 
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Nov 12, 2015 at 12:45 PM Ulanov, Alexander <
>>>>>>>> alexander.ula...@hpe.com> wrote:
>>>>>>>>
>>>>>>>> Parameter Server is a new feature and thus does not match the goal
>>>>>>>> of 2.0 is “to fix things that are broken in the current API and remove
>>>>>>>> certain deprecated APIs”. At the same time I would be happy to have 
>>>>>>>> that
>>>>>>>> feature.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> With regards to Machine learning, it would be great to move useful
>>>>>>>> features from MLlib to ML and deprecate the former. Current structure 
>>>>>>>> of
>>>>>>>> two separate machine learning packages seems to be somewhat confusing.
>>>>>>>>
>>>>>>>> With regards to GraphX, it would be great to deprecate the use of
>>>>>>>> RDD in GraphX and switch to Dataframe. This will allow GraphX evolve 
>>>>>>>> with
>>>>>>>> Tungsten.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Best regards, Alexander
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *From:* Nan Zhu [mailto:zhunanmcg...@gmail.com]
>>>>>>>> *Sent:* Thursday, November 12, 2015 7:28 AM
>>>>>>>> *To:* wi...@qq.com
>>>>>>>> *Cc:* dev@spark.apache.org
>>>>>>>> *Subject:* Re: A proposal for Spark 2.0
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Being specific to Parameter Server, I think the current agreement
>>>>>>>> is that PS shall exist as a third-party library instead of a component 
>>>>>>>> of
>>>>>>>> the core code base, isn’t?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Best,
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>> Nan Zhu
>>>>>>>>
>>>>>>>> http://codingcat.me
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thursday, November 12, 2015 at 9:49 AM, wi...@qq.com wrote:
>>>>>>>>
>>>>>>>> Who has the idea of machine learning? Spark missing some features
>>>>>>>> for machine learning, For example, the parameter server.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 在 2015年11月12日，05:32，Matei Zaharia <matei.zaha...@gmail.com> 写道：
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I like the idea of popping out Tachyon to an optional component too
>>>>>>>> to reduce the number of dependencies. In the future, it might even be
>>>>>>>> useful to do this for Hadoop, but it requires too many API changes to 
>>>>>>>> be
>>>>>>>> worth doing now.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Regarding Scala 2.12, we should definitely support it eventually,
>>>>>>>> but I don't think we need to block 2.0 on that because it can be added
>>>>>>>> later too. Has anyone investigated what it would take to run on there? 
>>>>>>>> I
>>>>>>>> imagine we don't need many code changes, just maybe some REPL stuff.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Needless to say, but I'm all for the idea of making "major"
>>>>>>>> releases as undisruptive as possible in the model Reynold proposed. 
>>>>>>>> Keeping
>>>>>>>> everyone working with the same set of releases is super important.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Matei
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Nov 11, 2015, at 4:58 AM, Sean Owen <so...@cloudera.com> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin <r...@databricks.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> to the Spark community. A major release should not be very
>>>>>>>> different from a
>>>>>>>>
>>>>>>>> minor release and should not be gated based on new features. The
>>>>>>>> main
>>>>>>>>
>>>>>>>> purpose of a major release is an opportunity to fix things that are
>>>>>>>> broken
>>>>>>>>
>>>>>>>> in the current API and remove certain deprecated APIs (examples
>>>>>>>> follow).
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Agree with this stance. Generally, a major release might also be a
>>>>>>>>
>>>>>>>> time to replace some big old API or implementation with a new one,
>>>>>>>> but
>>>>>>>>
>>>>>>>> I don't see obvious candidates.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I wouldn't mind turning attention to 2.x sooner than later, unless
>>>>>>>>
>>>>>>>> there's a fairly good reason to continue adding features in 1.x to a
>>>>>>>>
>>>>>>>> 1.7 release. The scope as of 1.6 is already pretty darned big.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 1. Scala 2.11 as the default build. We should still support Scala
>>>>>>>> 2.10, but
>>>>>>>>
>>>>>>>> it has been end-of-life.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> By the time 2.x rolls around, 2.12 will be the main version, 2.11
>>>>>>>> will
>>>>>>>>
>>>>>>>> be quite stable, and 2.10 will have been EOL for a while. I'd
>>>>>>>> propose
>>>>>>>>
>>>>>>>> dropping 2.10. Otherwise it's supported for 2 more years.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 2. Remove Hadoop 1 support.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I'd go further to drop support for <2.2 for sure (2.0 and 2.1 were
>>>>>>>>
>>>>>>>> sort of 'alpha' and 'beta' releases) and even <2.6.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I'm sure we'll think of a number of other small things -- shading a
>>>>>>>>
>>>>>>>> bunch of stuff? reviewing and updating dependencies in light of
>>>>>>>>
>>>>>>>> simpler, more recent dependencies to support from Hadoop etc?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Farming out Tachyon to a module? (I felt like someone proposed
>>>>>>>> this?)
>>>>>>>>
>>>>>>>> Pop out any Docker stuff to another repo?
>>>>>>>>
>>>>>>>> Continue that same effort for EC2?
>>>>>>>>
>>>>>>>> Farming out some of the "external" integrations to another repo (?
>>>>>>>>
>>>>>>>> controversial)
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> See also anything marked version "2+" in JIRA.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>
>>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>>>>>>>
>>>>>>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>
>>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>>>>>>>
>>>>>>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>
>>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>>>>>>>
>>>>>>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>>
>

Re: A proposal for Spark 2.0

Reply via email to