Ah, got it; by "stabilize" you meant changing the API, not just bug fixing. We're on the same page now.
On Wed, Nov 18, 2015 at 3:39 PM, Kostas Sakellis <kos...@cloudera.com> wrote: > A 1.6.x release will only fix bugs - we typically don't change APIs in z > releases. The Dataset API is experimental and so we might be changing the > APIs before we declare it stable. This is why I think it is important to > first stabilize the Dataset API with a Spark 1.7 release before moving to > Spark 2.0. This will benefit users that would like to use the new Dataset > APIs but can't move to Spark 2.0 because of the backwards incompatible > changes, like removal of deprecated APIs, Scala 2.11 etc. > > Kostas > > > On Fri, Nov 13, 2015 at 12:26 PM, Mark Hamstra <m...@clearstorydata.com> > wrote: > >> Why does stabilization of those two features require a 1.7 release >> instead of 1.6.1? >> >> On Fri, Nov 13, 2015 at 11:40 AM, Kostas Sakellis <kos...@cloudera.com> >> wrote: >> >>> We have veered off the topic of Spark 2.0 a little bit here - yes we can >>> talk about RDD vs. DS/DF more but lets refocus on Spark 2.0. I'd like to >>> propose we have one more 1.x release after Spark 1.6. This will allow us to >>> stabilize a few of the new features that were added in 1.6: >>> >>> 1) the experimental Datasets API >>> 2) the new unified memory manager. >>> >>> I understand our goal for Spark 2.0 is to offer an easy transition but >>> there will be users that won't be able to seamlessly upgrade given what we >>> have discussed as in scope for 2.0. For these users, having a 1.x release >>> with these new features/APIs stabilized will be very beneficial. This might >>> make Spark 1.7 a lighter release but that is not necessarily a bad thing. >>> >>> Any thoughts on this timeline? >>> >>> Kostas Sakellis >>> >>> >>> >>> On Thu, Nov 12, 2015 at 8:39 PM, Cheng, Hao <hao.ch...@intel.com> wrote: >>> >>>> Agree, more features/apis/optimization need to be added in DF/DS. >>>> >>>> >>>> >>>> I mean, we need to think about what kind of RDD APIs we have to provide >>>> to developer, maybe the fundamental API is enough, like, the ShuffledRDD >>>> etc.. But PairRDDFunctions probably not in this category, as we can do the >>>> same thing easily with DF/DS, even better performance. >>>> >>>> >>>> >>>> *From:* Mark Hamstra [mailto:m...@clearstorydata.com] >>>> *Sent:* Friday, November 13, 2015 11:23 AM >>>> *To:* Stephen Boesch >>>> >>>> *Cc:* dev@spark.apache.org >>>> *Subject:* Re: A proposal for Spark 2.0 >>>> >>>> >>>> >>>> Hmmm... to me, that seems like precisely the kind of thing that argues >>>> for retaining the RDD API but not as the first thing presented to new Spark >>>> developers: "Here's how to use groupBy with DataFrames.... Until the >>>> optimizer is more fully developed, that won't always get you the best >>>> performance that can be obtained. In these particular circumstances, ..., >>>> you may want to use the low-level RDD API while setting >>>> preservesPartitioning to true. Like this...." >>>> >>>> >>>> >>>> On Thu, Nov 12, 2015 at 7:05 PM, Stephen Boesch <java...@gmail.com> >>>> wrote: >>>> >>>> My understanding is that the RDD's presently have more support for >>>> complete control of partitioning which is a key consideration at scale. >>>> While partitioning control is still piecemeal in DF/DS it would seem >>>> premature to make RDD's a second-tier approach to spark dev. >>>> >>>> >>>> >>>> An example is the use of groupBy when we know that the source relation >>>> (/RDD) is already partitioned on the grouping expressions. AFAIK the spark >>>> sql still does not allow that knowledge to be applied to the optimizer - so >>>> a full shuffle will be performed. However in the native RDD we can use >>>> preservesPartitioning=true. >>>> >>>> >>>> >>>> 2015-11-12 17:42 GMT-08:00 Mark Hamstra <m...@clearstorydata.com>: >>>> >>>> The place of the RDD API in 2.0 is also something I've been wondering >>>> about. I think it may be going too far to deprecate it, but changing >>>> emphasis is something that we might consider. The RDD API came well before >>>> DataFrames and DataSets, so programming guides, introductory how-to >>>> articles and the like have, to this point, also tended to emphasize RDDs -- >>>> or at least to deal with them early. What I'm thinking is that with 2.0 >>>> maybe we should overhaul all the documentation to de-emphasize and >>>> reposition RDDs. In this scheme, DataFrames and DataSets would be >>>> introduced and fully addressed before RDDs. They would be presented as the >>>> normal/default/standard way to do things in Spark. RDDs, in contrast, >>>> would be presented later as a kind of lower-level, closer-to-the-metal API >>>> that can be used in atypical, more specialized contexts where DataFrames or >>>> DataSets don't fully fit. >>>> >>>> >>>> >>>> On Thu, Nov 12, 2015 at 5:17 PM, Cheng, Hao <hao.ch...@intel.com> >>>> wrote: >>>> >>>> I am not sure what the best practice for this specific problem, but >>>> it’s really worth to think about it in 2.0, as it is a painful issue for >>>> lots of users. >>>> >>>> >>>> >>>> By the way, is it also an opportunity to deprecate the RDD API (or >>>> internal API only?)? As lots of its functionality overlapping with >>>> DataFrame or DataSet. >>>> >>>> >>>> >>>> Hao >>>> >>>> >>>> >>>> *From:* Kostas Sakellis [mailto:kos...@cloudera.com] >>>> *Sent:* Friday, November 13, 2015 5:27 AM >>>> *To:* Nicholas Chammas >>>> *Cc:* Ulanov, Alexander; Nan Zhu; wi...@qq.com; dev@spark.apache.org; >>>> Reynold Xin >>>> >>>> >>>> *Subject:* Re: A proposal for Spark 2.0 >>>> >>>> >>>> >>>> I know we want to keep breaking changes to a minimum but I'm hoping >>>> that with Spark 2.0 we can also look at better classpath isolation with >>>> user programs. I propose we build on >>>> spark.{driver|executor}.userClassPathFirst, setting it true by default, and >>>> not allow any spark transitive dependencies to leak into user code. For >>>> backwards compatibility we can have a whitelist if we want but I'd be good >>>> if we start requiring user apps to explicitly pull in all their >>>> dependencies. From what I can tell, Hadoop 3 is also moving in this >>>> direction. >>>> >>>> >>>> >>>> Kostas >>>> >>>> >>>> >>>> On Thu, Nov 12, 2015 at 9:56 AM, Nicholas Chammas < >>>> nicholas.cham...@gmail.com> wrote: >>>> >>>> With regards to Machine learning, it would be great to move useful >>>> features from MLlib to ML and deprecate the former. Current structure of >>>> two separate machine learning packages seems to be somewhat confusing. >>>> >>>> With regards to GraphX, it would be great to deprecate the use of RDD >>>> in GraphX and switch to Dataframe. This will allow GraphX evolve with >>>> Tungsten. >>>> >>>> On that note of deprecating stuff, it might be good to deprecate some >>>> things in 2.0 without removing or replacing them immediately. That way 2.0 >>>> doesn’t have to wait for everything that we want to deprecate to be >>>> replaced all at once. >>>> >>>> Nick >>>> >>>> >>>> >>>> >>>> >>>> On Thu, Nov 12, 2015 at 12:45 PM Ulanov, Alexander < >>>> alexander.ula...@hpe.com> wrote: >>>> >>>> Parameter Server is a new feature and thus does not match the goal of >>>> 2.0 is “to fix things that are broken in the current API and remove certain >>>> deprecated APIs”. At the same time I would be happy to have that feature. >>>> >>>> >>>> >>>> With regards to Machine learning, it would be great to move useful >>>> features from MLlib to ML and deprecate the former. Current structure of >>>> two separate machine learning packages seems to be somewhat confusing. >>>> >>>> With regards to GraphX, it would be great to deprecate the use of RDD >>>> in GraphX and switch to Dataframe. This will allow GraphX evolve with >>>> Tungsten. >>>> >>>> >>>> >>>> Best regards, Alexander >>>> >>>> >>>> >>>> *From:* Nan Zhu [mailto:zhunanmcg...@gmail.com] >>>> *Sent:* Thursday, November 12, 2015 7:28 AM >>>> *To:* wi...@qq.com >>>> *Cc:* dev@spark.apache.org >>>> *Subject:* Re: A proposal for Spark 2.0 >>>> >>>> >>>> >>>> Being specific to Parameter Server, I think the current agreement is >>>> that PS shall exist as a third-party library instead of a component of the >>>> core code base, isn’t? >>>> >>>> >>>> >>>> Best, >>>> >>>> >>>> >>>> -- >>>> >>>> Nan Zhu >>>> >>>> http://codingcat.me >>>> >>>> >>>> >>>> On Thursday, November 12, 2015 at 9:49 AM, wi...@qq.com wrote: >>>> >>>> Who has the idea of machine learning? Spark missing some features for >>>> machine learning, For example, the parameter server. >>>> >>>> >>>> >>>> >>>> >>>> 在 2015年11月12日,05:32,Matei Zaharia <matei.zaha...@gmail.com> 写道: >>>> >>>> >>>> >>>> I like the idea of popping out Tachyon to an optional component too to >>>> reduce the number of dependencies. In the future, it might even be useful >>>> to do this for Hadoop, but it requires too many API changes to be worth >>>> doing now. >>>> >>>> >>>> >>>> Regarding Scala 2.12, we should definitely support it eventually, but I >>>> don't think we need to block 2.0 on that because it can be added later too. >>>> Has anyone investigated what it would take to run on there? I imagine we >>>> don't need many code changes, just maybe some REPL stuff. >>>> >>>> >>>> >>>> Needless to say, but I'm all for the idea of making "major" releases as >>>> undisruptive as possible in the model Reynold proposed. Keeping everyone >>>> working with the same set of releases is super important. >>>> >>>> >>>> >>>> Matei >>>> >>>> >>>> >>>> On Nov 11, 2015, at 4:58 AM, Sean Owen <so...@cloudera.com> wrote: >>>> >>>> >>>> >>>> On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin <r...@databricks.com> >>>> wrote: >>>> >>>> to the Spark community. A major release should not be very different >>>> from a >>>> >>>> minor release and should not be gated based on new features. The main >>>> >>>> purpose of a major release is an opportunity to fix things that are >>>> broken >>>> >>>> in the current API and remove certain deprecated APIs (examples follow). >>>> >>>> >>>> >>>> Agree with this stance. Generally, a major release might also be a >>>> >>>> time to replace some big old API or implementation with a new one, but >>>> >>>> I don't see obvious candidates. >>>> >>>> >>>> >>>> I wouldn't mind turning attention to 2.x sooner than later, unless >>>> >>>> there's a fairly good reason to continue adding features in 1.x to a >>>> >>>> 1.7 release. The scope as of 1.6 is already pretty darned big. >>>> >>>> >>>> >>>> >>>> >>>> 1. Scala 2.11 as the default build. We should still support Scala 2.10, >>>> but >>>> >>>> it has been end-of-life. >>>> >>>> >>>> >>>> By the time 2.x rolls around, 2.12 will be the main version, 2.11 will >>>> >>>> be quite stable, and 2.10 will have been EOL for a while. I'd propose >>>> >>>> dropping 2.10. Otherwise it's supported for 2 more years. >>>> >>>> >>>> >>>> >>>> >>>> 2. Remove Hadoop 1 support. >>>> >>>> >>>> >>>> I'd go further to drop support for <2.2 for sure (2.0 and 2.1 were >>>> >>>> sort of 'alpha' and 'beta' releases) and even <2.6. >>>> >>>> >>>> >>>> I'm sure we'll think of a number of other small things -- shading a >>>> >>>> bunch of stuff? reviewing and updating dependencies in light of >>>> >>>> simpler, more recent dependencies to support from Hadoop etc? >>>> >>>> >>>> >>>> Farming out Tachyon to a module? (I felt like someone proposed this?) >>>> >>>> Pop out any Docker stuff to another repo? >>>> >>>> Continue that same effort for EC2? >>>> >>>> Farming out some of the "external" integrations to another repo (? >>>> >>>> controversial) >>>> >>>> >>>> >>>> See also anything marked version "2+" in JIRA. >>>> >>>> >>>> >>>> --------------------------------------------------------------------- >>>> >>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >>>> >>>> For additional commands, e-mail: dev-h...@spark.apache.org >>>> >>>> >>>> >>>> >>>> >>>> --------------------------------------------------------------------- >>>> >>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >>>> >>>> For additional commands, e-mail: dev-h...@spark.apache.org >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> --------------------------------------------------------------------- >>>> >>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >>>> >>>> For additional commands, e-mail: dev-h...@spark.apache.org >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>> >>> >> >