I also thought the idea was to drop 2.10. Do we want to cross build for 3 scala versions? On Nov 25, 2015 3:54 AM, "Sandy Ryza" <sandy.r...@cloudera.com> wrote:
> I see. My concern is / was that cluster operators will be reluctant to > upgrade to 2.0, meaning that developers using those clusters need to stay > on 1.x, and, if they want to move to DataFrames, essentially need to port > their app twice. > > I misunderstood and thought part of the proposal was to drop support for > 2.10 though. If your broad point is that there aren't changes in 2.0 that > will make it less palatable to cluster administrators than releases in the > 1.x line, then yes, 2.0 as the next release sounds fine to me. > > -Sandy > > > On Tue, Nov 24, 2015 at 11:55 AM, Matei Zaharia <matei.zaha...@gmail.com> > wrote: > >> What are the other breaking changes in 2.0 though? Note that we're not >> removing Scala 2.10, we're just making the default build be against Scala >> 2.11 instead of 2.10. There seem to be very few changes that people would >> worry about. If people are going to update their apps, I think it's better >> to make the other small changes in 2.0 at the same time than to update once >> for Dataset and another time for 2.0. >> >> BTW just refer to Reynold's original post for the other proposed API >> changes. >> >> Matei >> >> On Nov 24, 2015, at 12:27 PM, Sandy Ryza <sandy.r...@cloudera.com> wrote: >> >> I think that Kostas' logic still holds. The majority of Spark users, and >> likely an even vaster majority of people running vaster jobs, are still on >> RDDs and on the cusp of upgrading to DataFrames. Users will probably want >> to upgrade to the stable version of the Dataset / DataFrame API so they >> don't need to do so twice. Requiring that they absorb all the other ways >> that Spark breaks compatibility in the move to 2.0 makes it much more >> difficult for them to make this transition. >> >> Using the same set of APIs also means that it will be easier to backport >> critical fixes to the 1.x line. >> >> It's not clear to me that avoiding breakage of an experimental API in the >> 1.x line outweighs these issues. >> >> -Sandy >> >> On Mon, Nov 23, 2015 at 10:51 PM, Reynold Xin <r...@databricks.com> >> wrote: >> >>> I actually think the next one (after 1.6) should be Spark 2.0. The >>> reason is that I already know we have to break some part of the >>> DataFrame/Dataset API as part of the Dataset design. (e.g. DataFrame.map >>> should return Dataset rather than RDD). In that case, I'd rather break this >>> sooner (in one release) than later (in two releases). so the damage is >>> smaller. >>> >>> I don't think whether we call Dataset/DataFrame experimental or not >>> matters too much for 2.0. We can still call Dataset experimental in 2.0 and >>> then mark them as stable in 2.1. Despite being "experimental", there has >>> been no breaking changes to DataFrame from 1.3 to 1.6. >>> >>> >>> >>> On Wed, Nov 18, 2015 at 3:43 PM, Mark Hamstra <m...@clearstorydata.com> >>> wrote: >>> >>>> Ah, got it; by "stabilize" you meant changing the API, not just bug >>>> fixing. We're on the same page now. >>>> >>>> On Wed, Nov 18, 2015 at 3:39 PM, Kostas Sakellis <kos...@cloudera.com> >>>> wrote: >>>> >>>>> A 1.6.x release will only fix bugs - we typically don't change APIs in >>>>> z releases. The Dataset API is experimental and so we might be changing >>>>> the >>>>> APIs before we declare it stable. This is why I think it is important to >>>>> first stabilize the Dataset API with a Spark 1.7 release before moving to >>>>> Spark 2.0. This will benefit users that would like to use the new Dataset >>>>> APIs but can't move to Spark 2.0 because of the backwards incompatible >>>>> changes, like removal of deprecated APIs, Scala 2.11 etc. >>>>> >>>>> Kostas >>>>> >>>>> >>>>> On Fri, Nov 13, 2015 at 12:26 PM, Mark Hamstra < >>>>> m...@clearstorydata.com> wrote: >>>>> >>>>>> Why does stabilization of those two features require a 1.7 release >>>>>> instead of 1.6.1? >>>>>> >>>>>> On Fri, Nov 13, 2015 at 11:40 AM, Kostas Sakellis < >>>>>> kos...@cloudera.com> wrote: >>>>>> >>>>>>> We have veered off the topic of Spark 2.0 a little bit here - yes we >>>>>>> can talk about RDD vs. DS/DF more but lets refocus on Spark 2.0. I'd >>>>>>> like >>>>>>> to propose we have one more 1.x release after Spark 1.6. This will >>>>>>> allow us >>>>>>> to stabilize a few of the new features that were added in 1.6: >>>>>>> >>>>>>> 1) the experimental Datasets API >>>>>>> 2) the new unified memory manager. >>>>>>> >>>>>>> I understand our goal for Spark 2.0 is to offer an easy transition >>>>>>> but there will be users that won't be able to seamlessly upgrade given >>>>>>> what >>>>>>> we have discussed as in scope for 2.0. For these users, having a 1.x >>>>>>> release with these new features/APIs stabilized will be very beneficial. >>>>>>> This might make Spark 1.7 a lighter release but that is not necessarily >>>>>>> a >>>>>>> bad thing. >>>>>>> >>>>>>> Any thoughts on this timeline? >>>>>>> >>>>>>> Kostas Sakellis >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Thu, Nov 12, 2015 at 8:39 PM, Cheng, Hao <hao.ch...@intel.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Agree, more features/apis/optimization need to be added in DF/DS. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> I mean, we need to think about what kind of RDD APIs we have to >>>>>>>> provide to developer, maybe the fundamental API is enough, like, the >>>>>>>> ShuffledRDD etc.. But PairRDDFunctions probably not in this category, >>>>>>>> as >>>>>>>> we can do the same thing easily with DF/DS, even better performance. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> *From:* Mark Hamstra [mailto:m...@clearstorydata.com] >>>>>>>> *Sent:* Friday, November 13, 2015 11:23 AM >>>>>>>> *To:* Stephen Boesch >>>>>>>> >>>>>>>> *Cc:* dev@spark.apache.org >>>>>>>> *Subject:* Re: A proposal for Spark 2.0 >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Hmmm... to me, that seems like precisely the kind of thing that >>>>>>>> argues for retaining the RDD API but not as the first thing presented >>>>>>>> to >>>>>>>> new Spark developers: "Here's how to use groupBy with DataFrames.... >>>>>>>> Until >>>>>>>> the optimizer is more fully developed, that won't always get you the >>>>>>>> best >>>>>>>> performance that can be obtained. In these particular circumstances, >>>>>>>> ..., >>>>>>>> you may want to use the low-level RDD API while setting >>>>>>>> preservesPartitioning to true. Like this...." >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Nov 12, 2015 at 7:05 PM, Stephen Boesch <java...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>> My understanding is that the RDD's presently have more support for >>>>>>>> complete control of partitioning which is a key consideration at scale. >>>>>>>> While partitioning control is still piecemeal in DF/DS it would seem >>>>>>>> premature to make RDD's a second-tier approach to spark dev. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> An example is the use of groupBy when we know that the source >>>>>>>> relation (/RDD) is already partitioned on the grouping expressions. >>>>>>>> AFAIK >>>>>>>> the spark sql still does not allow that knowledge to be applied to the >>>>>>>> optimizer - so a full shuffle will be performed. However in the native >>>>>>>> RDD >>>>>>>> we can use preservesPartitioning=true. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> 2015-11-12 17:42 GMT-08:00 Mark Hamstra <m...@clearstorydata.com>: >>>>>>>> >>>>>>>> The place of the RDD API in 2.0 is also something I've been >>>>>>>> wondering about. I think it may be going too far to deprecate it, but >>>>>>>> changing emphasis is something that we might consider. The RDD API >>>>>>>> came >>>>>>>> well before DataFrames and DataSets, so programming guides, >>>>>>>> introductory >>>>>>>> how-to articles and the like have, to this point, also tended to >>>>>>>> emphasize >>>>>>>> RDDs -- or at least to deal with them early. What I'm thinking is that >>>>>>>> with 2.0 maybe we should overhaul all the documentation to >>>>>>>> de-emphasize and >>>>>>>> reposition RDDs. In this scheme, DataFrames and DataSets would be >>>>>>>> introduced and fully addressed before RDDs. They would be presented >>>>>>>> as the >>>>>>>> normal/default/standard way to do things in Spark. RDDs, in contrast, >>>>>>>> would be presented later as a kind of lower-level, closer-to-the-metal >>>>>>>> API >>>>>>>> that can be used in atypical, more specialized contexts where >>>>>>>> DataFrames or >>>>>>>> DataSets don't fully fit. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Nov 12, 2015 at 5:17 PM, Cheng, Hao <hao.ch...@intel.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>> I am not sure what the best practice for this specific problem, but >>>>>>>> it’s really worth to think about it in 2.0, as it is a painful issue >>>>>>>> for >>>>>>>> lots of users. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> By the way, is it also an opportunity to deprecate the RDD API (or >>>>>>>> internal API only?)? As lots of its functionality overlapping with >>>>>>>> DataFrame or DataSet. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Hao >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> *From:* Kostas Sakellis [mailto:kos...@cloudera.com] >>>>>>>> *Sent:* Friday, November 13, 2015 5:27 AM >>>>>>>> *To:* Nicholas Chammas >>>>>>>> *Cc:* Ulanov, Alexander; Nan Zhu; wi...@qq.com; >>>>>>>> dev@spark.apache.org; Reynold Xin >>>>>>>> >>>>>>>> >>>>>>>> *Subject:* Re: A proposal for Spark 2.0 >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> I know we want to keep breaking changes to a minimum but I'm hoping >>>>>>>> that with Spark 2.0 we can also look at better classpath isolation with >>>>>>>> user programs. I propose we build on >>>>>>>> spark.{driver|executor}.userClassPathFirst, setting it true by >>>>>>>> default, and >>>>>>>> not allow any spark transitive dependencies to leak into user code. For >>>>>>>> backwards compatibility we can have a whitelist if we want but I'd be >>>>>>>> good >>>>>>>> if we start requiring user apps to explicitly pull in all their >>>>>>>> dependencies. From what I can tell, Hadoop 3 is also moving in this >>>>>>>> direction. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Kostas >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Nov 12, 2015 at 9:56 AM, Nicholas Chammas < >>>>>>>> nicholas.cham...@gmail.com> wrote: >>>>>>>> >>>>>>>> With regards to Machine learning, it would be great to move useful >>>>>>>> features from MLlib to ML and deprecate the former. Current structure >>>>>>>> of >>>>>>>> two separate machine learning packages seems to be somewhat confusing. >>>>>>>> >>>>>>>> With regards to GraphX, it would be great to deprecate the use of >>>>>>>> RDD in GraphX and switch to Dataframe. This will allow GraphX evolve >>>>>>>> with >>>>>>>> Tungsten. >>>>>>>> >>>>>>>> On that note of deprecating stuff, it might be good to deprecate >>>>>>>> some things in 2.0 without removing or replacing them immediately. >>>>>>>> That way >>>>>>>> 2.0 doesn’t have to wait for everything that we want to deprecate to be >>>>>>>> replaced all at once. >>>>>>>> >>>>>>>> Nick >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Nov 12, 2015 at 12:45 PM Ulanov, Alexander < >>>>>>>> alexander.ula...@hpe.com> wrote: >>>>>>>> >>>>>>>> Parameter Server is a new feature and thus does not match the goal >>>>>>>> of 2.0 is “to fix things that are broken in the current API and remove >>>>>>>> certain deprecated APIs”. At the same time I would be happy to have >>>>>>>> that >>>>>>>> feature. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> With regards to Machine learning, it would be great to move useful >>>>>>>> features from MLlib to ML and deprecate the former. Current structure >>>>>>>> of >>>>>>>> two separate machine learning packages seems to be somewhat confusing. >>>>>>>> >>>>>>>> With regards to GraphX, it would be great to deprecate the use of >>>>>>>> RDD in GraphX and switch to Dataframe. This will allow GraphX evolve >>>>>>>> with >>>>>>>> Tungsten. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Best regards, Alexander >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> *From:* Nan Zhu [mailto:zhunanmcg...@gmail.com] >>>>>>>> *Sent:* Thursday, November 12, 2015 7:28 AM >>>>>>>> *To:* wi...@qq.com >>>>>>>> *Cc:* dev@spark.apache.org >>>>>>>> *Subject:* Re: A proposal for Spark 2.0 >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Being specific to Parameter Server, I think the current agreement >>>>>>>> is that PS shall exist as a third-party library instead of a component >>>>>>>> of >>>>>>>> the core code base, isn’t? >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Best, >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> >>>>>>>> Nan Zhu >>>>>>>> >>>>>>>> http://codingcat.me >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Thursday, November 12, 2015 at 9:49 AM, wi...@qq.com wrote: >>>>>>>> >>>>>>>> Who has the idea of machine learning? Spark missing some features >>>>>>>> for machine learning, For example, the parameter server. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> 在 2015年11月12日,05:32,Matei Zaharia <matei.zaha...@gmail.com> 写道: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> I like the idea of popping out Tachyon to an optional component too >>>>>>>> to reduce the number of dependencies. In the future, it might even be >>>>>>>> useful to do this for Hadoop, but it requires too many API changes to >>>>>>>> be >>>>>>>> worth doing now. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Regarding Scala 2.12, we should definitely support it eventually, >>>>>>>> but I don't think we need to block 2.0 on that because it can be added >>>>>>>> later too. Has anyone investigated what it would take to run on there? >>>>>>>> I >>>>>>>> imagine we don't need many code changes, just maybe some REPL stuff. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Needless to say, but I'm all for the idea of making "major" >>>>>>>> releases as undisruptive as possible in the model Reynold proposed. >>>>>>>> Keeping >>>>>>>> everyone working with the same set of releases is super important. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Matei >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Nov 11, 2015, at 4:58 AM, Sean Owen <so...@cloudera.com> wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin <r...@databricks.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>> to the Spark community. A major release should not be very >>>>>>>> different from a >>>>>>>> >>>>>>>> minor release and should not be gated based on new features. The >>>>>>>> main >>>>>>>> >>>>>>>> purpose of a major release is an opportunity to fix things that are >>>>>>>> broken >>>>>>>> >>>>>>>> in the current API and remove certain deprecated APIs (examples >>>>>>>> follow). >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Agree with this stance. Generally, a major release might also be a >>>>>>>> >>>>>>>> time to replace some big old API or implementation with a new one, >>>>>>>> but >>>>>>>> >>>>>>>> I don't see obvious candidates. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> I wouldn't mind turning attention to 2.x sooner than later, unless >>>>>>>> >>>>>>>> there's a fairly good reason to continue adding features in 1.x to a >>>>>>>> >>>>>>>> 1.7 release. The scope as of 1.6 is already pretty darned big. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> 1. Scala 2.11 as the default build. We should still support Scala >>>>>>>> 2.10, but >>>>>>>> >>>>>>>> it has been end-of-life. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> By the time 2.x rolls around, 2.12 will be the main version, 2.11 >>>>>>>> will >>>>>>>> >>>>>>>> be quite stable, and 2.10 will have been EOL for a while. I'd >>>>>>>> propose >>>>>>>> >>>>>>>> dropping 2.10. Otherwise it's supported for 2 more years. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> 2. Remove Hadoop 1 support. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> I'd go further to drop support for <2.2 for sure (2.0 and 2.1 were >>>>>>>> >>>>>>>> sort of 'alpha' and 'beta' releases) and even <2.6. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> I'm sure we'll think of a number of other small things -- shading a >>>>>>>> >>>>>>>> bunch of stuff? reviewing and updating dependencies in light of >>>>>>>> >>>>>>>> simpler, more recent dependencies to support from Hadoop etc? >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Farming out Tachyon to a module? (I felt like someone proposed >>>>>>>> this?) >>>>>>>> >>>>>>>> Pop out any Docker stuff to another repo? >>>>>>>> >>>>>>>> Continue that same effort for EC2? >>>>>>>> >>>>>>>> Farming out some of the "external" integrations to another repo (? >>>>>>>> >>>>>>>> controversial) >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> See also anything marked version "2+" in JIRA. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> --------------------------------------------------------------------- >>>>>>>> >>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >>>>>>>> >>>>>>>> For additional commands, e-mail: dev-h...@spark.apache.org >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> --------------------------------------------------------------------- >>>>>>>> >>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >>>>>>>> >>>>>>>> For additional commands, e-mail: dev-h...@spark.apache.org >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> --------------------------------------------------------------------- >>>>>>>> >>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >>>>>>>> >>>>>>>> For additional commands, e-mail: dev-h...@spark.apache.org >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >> >