Re: A proposal for Spark 2.0

Mark Hamstra Wed, 18 Nov 2015 15:44:34 -0800

Ah, got it; by "stabilize" you meant changing the API, not just bug
fixing.  We're on the same page now.


On Wed, Nov 18, 2015 at 3:39 PM, Kostas Sakellis <[email protected]>
wrote:

> A 1.6.x release will only fix bugs - we typically don't change APIs in z
> releases. The Dataset API is experimental and so we might be changing the
> APIs before we declare it stable. This is why I think it is important to
> first stabilize the Dataset API with a Spark 1.7 release before moving to
> Spark 2.0. This will benefit users that would like to use the new Dataset
> APIs but can't move to Spark 2.0 because of the backwards incompatible
> changes, like removal of deprecated APIs, Scala 2.11 etc.
>
> Kostas
>
>
> On Fri, Nov 13, 2015 at 12:26 PM, Mark Hamstra <[email protected]>
> wrote:
>
>> Why does stabilization of those two features require a 1.7 release
>> instead of 1.6.1?
>>
>> On Fri, Nov 13, 2015 at 11:40 AM, Kostas Sakellis <[email protected]>
>> wrote:
>>
>>> We have veered off the topic of Spark 2.0 a little bit here - yes we can
>>> talk about RDD vs. DS/DF more but lets refocus on Spark 2.0. I'd like to
>>> propose we have one more 1.x release after Spark 1.6. This will allow us to
>>> stabilize a few of the new features that were added in 1.6:
>>>
>>> 1) the experimental Datasets API
>>> 2) the new unified memory manager.
>>>
>>> I understand our goal for Spark 2.0 is to offer an easy transition but
>>> there will be users that won't be able to seamlessly upgrade given what we
>>> have discussed as in scope for 2.0. For these users, having a 1.x release
>>> with these new features/APIs stabilized will be very beneficial. This might
>>> make Spark 1.7 a lighter release but that is not necessarily a bad thing.
>>>
>>> Any thoughts on this timeline?
>>>
>>> Kostas Sakellis
>>>
>>>
>>>
>>> On Thu, Nov 12, 2015 at 8:39 PM, Cheng, Hao <[email protected]> wrote:
>>>
>>>> Agree, more features/apis/optimization need to be added in DF/DS.
>>>>
>>>>
>>>>
>>>> I mean, we need to think about what kind of RDD APIs we have to provide
>>>> to developer, maybe the fundamental API is enough, like, the ShuffledRDD
>>>> etc..  But PairRDDFunctions probably not in this category, as we can do the
>>>> same thing easily with DF/DS, even better performance.
>>>>
>>>>
>>>>
>>>> *From:* Mark Hamstra [mailto:[email protected]]
>>>> *Sent:* Friday, November 13, 2015 11:23 AM
>>>> *To:* Stephen Boesch
>>>>
>>>> *Cc:* [email protected]
>>>> *Subject:* Re: A proposal for Spark 2.0
>>>>
>>>>
>>>>
>>>> Hmmm... to me, that seems like precisely the kind of thing that argues
>>>> for retaining the RDD API but not as the first thing presented to new Spark
>>>> developers: "Here's how to use groupBy with DataFrames.... Until the
>>>> optimizer is more fully developed, that won't always get you the best
>>>> performance that can be obtained.  In these particular circumstances, ...,
>>>> you may want to use the low-level RDD API while setting
>>>> preservesPartitioning to true.  Like this...."
>>>>
>>>>
>>>>
>>>> On Thu, Nov 12, 2015 at 7:05 PM, Stephen Boesch <[email protected]>
>>>> wrote:
>>>>
>>>> My understanding is that  the RDD's presently have more support for
>>>> complete control of partitioning which is a key consideration at scale.
>>>> While partitioning control is still piecemeal in  DF/DS  it would seem
>>>> premature to make RDD's a second-tier approach to spark dev.
>>>>
>>>>
>>>>
>>>> An example is the use of groupBy when we know that the source relation
>>>> (/RDD) is already partitioned on the grouping expressions.  AFAIK the spark
>>>> sql still does not allow that knowledge to be applied to the optimizer - so
>>>> a full shuffle will be performed. However in the native RDD we can use
>>>> preservesPartitioning=true.
>>>>
>>>>
>>>>
>>>> 2015-11-12 17:42 GMT-08:00 Mark Hamstra <[email protected]>:
>>>>
>>>> The place of the RDD API in 2.0 is also something I've been wondering
>>>> about.  I think it may be going too far to deprecate it, but changing
>>>> emphasis is something that we might consider.  The RDD API came well before
>>>> DataFrames and DataSets, so programming guides, introductory how-to
>>>> articles and the like have, to this point, also tended to emphasize RDDs --
>>>> or at least to deal with them early.  What I'm thinking is that with 2.0
>>>> maybe we should overhaul all the documentation to de-emphasize and
>>>> reposition RDDs.  In this scheme, DataFrames and DataSets would be
>>>> introduced and fully addressed before RDDs.  They would be presented as the
>>>> normal/default/standard way to do things in Spark.  RDDs, in contrast,
>>>> would be presented later as a kind of lower-level, closer-to-the-metal API
>>>> that can be used in atypical, more specialized contexts where DataFrames or
>>>> DataSets don't fully fit.
>>>>
>>>>
>>>>
>>>> On Thu, Nov 12, 2015 at 5:17 PM, Cheng, Hao <[email protected]>
>>>> wrote:
>>>>
>>>> I am not sure what the best practice for this specific problem, but
>>>> it’s really worth to think about it in 2.0, as it is a painful issue for
>>>> lots of users.
>>>>
>>>>
>>>>
>>>> By the way, is it also an opportunity to deprecate the RDD API (or
>>>> internal API only?)? As lots of its functionality overlapping with
>>>> DataFrame or DataSet.
>>>>
>>>>
>>>>
>>>> Hao
>>>>
>>>>
>>>>
>>>> *From:* Kostas Sakellis [mailto:[email protected]]
>>>> *Sent:* Friday, November 13, 2015 5:27 AM
>>>> *To:* Nicholas Chammas
>>>> *Cc:* Ulanov, Alexander; Nan Zhu; [email protected]; [email protected];
>>>> Reynold Xin
>>>>
>>>>
>>>> *Subject:* Re: A proposal for Spark 2.0
>>>>
>>>>
>>>>
>>>> I know we want to keep breaking changes to a minimum but I'm hoping
>>>> that with Spark 2.0 we can also look at better classpath isolation with
>>>> user programs. I propose we build on
>>>> spark.{driver|executor}.userClassPathFirst, setting it true by default, and
>>>> not allow any spark transitive dependencies to leak into user code. For
>>>> backwards compatibility we can have a whitelist if we want but I'd be good
>>>> if we start requiring user apps to explicitly pull in all their
>>>> dependencies. From what I can tell, Hadoop 3 is also moving in this
>>>> direction.
>>>>
>>>>
>>>>
>>>> Kostas
>>>>
>>>>
>>>>
>>>> On Thu, Nov 12, 2015 at 9:56 AM, Nicholas Chammas <
>>>> [email protected]> wrote:
>>>>
>>>> With regards to Machine learning, it would be great to move useful
>>>> features from MLlib to ML and deprecate the former. Current structure of
>>>> two separate machine learning packages seems to be somewhat confusing.
>>>>
>>>> With regards to GraphX, it would be great to deprecate the use of RDD
>>>> in GraphX and switch to Dataframe. This will allow GraphX evolve with
>>>> Tungsten.
>>>>
>>>> On that note of deprecating stuff, it might be good to deprecate some
>>>> things in 2.0 without removing or replacing them immediately. That way 2.0
>>>> doesn’t have to wait for everything that we want to deprecate to be
>>>> replaced all at once.
>>>>
>>>> Nick
>>>>
>>>> 
>>>>
>>>>
>>>>
>>>> On Thu, Nov 12, 2015 at 12:45 PM Ulanov, Alexander <
>>>> [email protected]> wrote:
>>>>
>>>> Parameter Server is a new feature and thus does not match the goal of
>>>> 2.0 is “to fix things that are broken in the current API and remove certain
>>>> deprecated APIs”. At the same time I would be happy to have that feature.
>>>>
>>>>
>>>>
>>>> With regards to Machine learning, it would be great to move useful
>>>> features from MLlib to ML and deprecate the former. Current structure of
>>>> two separate machine learning packages seems to be somewhat confusing.
>>>>
>>>> With regards to GraphX, it would be great to deprecate the use of RDD
>>>> in GraphX and switch to Dataframe. This will allow GraphX evolve with
>>>> Tungsten.
>>>>
>>>>
>>>>
>>>> Best regards, Alexander
>>>>
>>>>
>>>>
>>>> *From:* Nan Zhu [mailto:[email protected]]
>>>> *Sent:* Thursday, November 12, 2015 7:28 AM
>>>> *To:* [email protected]
>>>> *Cc:* [email protected]
>>>> *Subject:* Re: A proposal for Spark 2.0
>>>>
>>>>
>>>>
>>>> Being specific to Parameter Server, I think the current agreement is
>>>> that PS shall exist as a third-party library instead of a component of the
>>>> core code base, isn’t?
>>>>
>>>>
>>>>
>>>> Best,
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Nan Zhu
>>>>
>>>> http://codingcat.me
>>>>
>>>>
>>>>
>>>> On Thursday, November 12, 2015 at 9:49 AM, [email protected] wrote:
>>>>
>>>> Who has the idea of machine learning? Spark missing some features for
>>>> machine learning, For example, the parameter server.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> 在 2015年11月12日，05:32，Matei Zaharia <[email protected]> 写道：
>>>>
>>>>
>>>>
>>>> I like the idea of popping out Tachyon to an optional component too to
>>>> reduce the number of dependencies. In the future, it might even be useful
>>>> to do this for Hadoop, but it requires too many API changes to be worth
>>>> doing now.
>>>>
>>>>
>>>>
>>>> Regarding Scala 2.12, we should definitely support it eventually, but I
>>>> don't think we need to block 2.0 on that because it can be added later too.
>>>> Has anyone investigated what it would take to run on there? I imagine we
>>>> don't need many code changes, just maybe some REPL stuff.
>>>>
>>>>
>>>>
>>>> Needless to say, but I'm all for the idea of making "major" releases as
>>>> undisruptive as possible in the model Reynold proposed. Keeping everyone
>>>> working with the same set of releases is super important.
>>>>
>>>>
>>>>
>>>> Matei
>>>>
>>>>
>>>>
>>>> On Nov 11, 2015, at 4:58 AM, Sean Owen <[email protected]> wrote:
>>>>
>>>>
>>>>
>>>> On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin <[email protected]>
>>>> wrote:
>>>>
>>>> to the Spark community. A major release should not be very different
>>>> from a
>>>>
>>>> minor release and should not be gated based on new features. The main
>>>>
>>>> purpose of a major release is an opportunity to fix things that are
>>>> broken
>>>>
>>>> in the current API and remove certain deprecated APIs (examples follow).
>>>>
>>>>
>>>>
>>>> Agree with this stance. Generally, a major release might also be a
>>>>
>>>> time to replace some big old API or implementation with a new one, but
>>>>
>>>> I don't see obvious candidates.
>>>>
>>>>
>>>>
>>>> I wouldn't mind turning attention to 2.x sooner than later, unless
>>>>
>>>> there's a fairly good reason to continue adding features in 1.x to a
>>>>
>>>> 1.7 release. The scope as of 1.6 is already pretty darned big.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> 1. Scala 2.11 as the default build. We should still support Scala 2.10,
>>>> but
>>>>
>>>> it has been end-of-life.
>>>>
>>>>
>>>>
>>>> By the time 2.x rolls around, 2.12 will be the main version, 2.11 will
>>>>
>>>> be quite stable, and 2.10 will have been EOL for a while. I'd propose
>>>>
>>>> dropping 2.10. Otherwise it's supported for 2 more years.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> 2. Remove Hadoop 1 support.
>>>>
>>>>
>>>>
>>>> I'd go further to drop support for <2.2 for sure (2.0 and 2.1 were
>>>>
>>>> sort of 'alpha' and 'beta' releases) and even <2.6.
>>>>
>>>>
>>>>
>>>> I'm sure we'll think of a number of other small things -- shading a
>>>>
>>>> bunch of stuff? reviewing and updating dependencies in light of
>>>>
>>>> simpler, more recent dependencies to support from Hadoop etc?
>>>>
>>>>
>>>>
>>>> Farming out Tachyon to a module? (I felt like someone proposed this?)
>>>>
>>>> Pop out any Docker stuff to another repo?
>>>>
>>>> Continue that same effort for EC2?
>>>>
>>>> Farming out some of the "external" integrations to another repo (?
>>>>
>>>> controversial)
>>>>
>>>>
>>>>
>>>> See also anything marked version "2+" in JIRA.
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>>
>>>> To unsubscribe, e-mail: [email protected]
>>>>
>>>> For additional commands, e-mail: [email protected]
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>>
>>>> To unsubscribe, e-mail: [email protected]
>>>>
>>>> For additional commands, e-mail: [email protected]
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>>
>>>> To unsubscribe, e-mail: [email protected]
>>>>
>>>> For additional commands, e-mail: [email protected]
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>

Re: A proposal for Spark 2.0

Reply via email to