subject:"\"Re\\\: A proposal for Spark 2.0\""

Re: A proposal for Spark 2.0

2015-12-25 Thread Tao Wang

How about the Hive dependency? We use ThriftServer, serdes and even the
parser/execute logic in Hive. Where will we direct about this part?



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/A-proposal-for-Spark-2-0-tp15122p15793.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: A proposal for Spark 2.0

2015-12-23 Thread Nicholas Chammas

Yeah, I'd also favor maintaining docs with strictly temporary relevance on
JIRA when possible. The wiki is like this weird backwater I only rarely
visit.

Don't we typically do this kind of stuff with an umbrella issue on JIRA?
Tom, wouldn't that work well for you?

Nick

On Wed, Dec 23, 2015 at 5:06 AM Sean Owen  wrote:

> I think this will be hard to maintain; we already have JIRA as the de
> facto central place to store discussions and prioritize work, and the
> 2.x stuff is already a JIRA. The wiki doesn't really hurt, just
> probably will never be looked at again. Let's point people in all
> cases to JIRA.
>
> On Tue, Dec 22, 2015 at 11:52 PM, Reynold Xin  wrote:
> > I started a wiki page:
> >
> https://cwiki.apache.org/confluence/display/SPARK/Development+Discussions
> >
> >
> > On Tue, Dec 22, 2015 at 6:27 AM, Tom Graves 
> wrote:
> >>
> >> Do we have a summary of all the discussions and what is planned for 2.0
> >> then?  Perhaps we should put on the wiki for reference.
> >>
> >> Tom
> >>
> >>
> >> On Tuesday, December 22, 2015 12:12 AM, Reynold Xin <
> r...@databricks.com>
> >> wrote:
> >>
> >>
> >> FYI I updated the master branch's Spark version to 2.0.0-SNAPSHOT.
> >>
> >> On Tue, Nov 10, 2015 at 3:10 PM, Reynold Xin 
> wrote:
> >>
> >> I’m starting a new thread since the other one got intermixed with
> feature
> >> requests. Please refrain from making feature request in this thread. Not
> >> that we shouldn’t be adding features, but we can always add features in
> 1.7,
> >> 2.1, 2.2, ...
> >>
> >> First - I want to propose a premise for how to think about Spark 2.0 and
> >> major releases in Spark, based on discussion with several members of the
> >> community: a major release should be low overhead and minimally
> disruptive
> >> to the Spark community. A major release should not be very different
> from a
> >> minor release and should not be gated based on new features. The main
> >> purpose of a major release is an opportunity to fix things that are
> broken
> >> in the current API and remove certain deprecated APIs (examples follow).
> >>
> >> For this reason, I would *not* propose doing major releases to break
> >> substantial API's or perform large re-architecting that prevent users
> from
> >> upgrading. Spark has always had a culture of evolving architecture
> >> incrementally and making changes - and I don't think we want to change
> this
> >> model. In fact, we’ve released many architectural changes on the 1.X
> line.
> >>
> >> If the community likes the above model, then to me it seems reasonable
> to
> >> do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or
> immediately
> >> after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A cadence
> of
> >> major releases every 2 years seems doable within the above model.
> >>
> >> Under this model, here is a list of example things I would propose doing
> >> in Spark 2.0, separated into APIs and Operation/Deployment:
> >>
> >>
> >> APIs
> >>
> >> 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in
> >> Spark 1.x.
> >>
> >> 2. Remove Akka from Spark’s API dependency (in streaming), so user
> >> applications can use Akka (SPARK-5293). We have gotten a lot of
> complaints
> >> about user applications being unable to use Akka due to Spark’s
> dependency
> >> on Akka.
> >>
> >> 3. Remove Guava from Spark’s public API (JavaRDD Optional).
> >>
> >> 4. Better class package structure for low level developer API’s. In
> >> particular, we have some DeveloperApi (mostly various listener-related
> >> classes) added over the years. Some packages include only one or two
> public
> >> classes but a lot of private classes. A better structure is to have
> public
> >> classes isolated to a few public packages, and these public packages
> should
> >> have minimal private classes for low level developer APIs.
> >>
> >> 5. Consolidate task metric and accumulator API. Although having some
> >> subtle differences, these two are very similar but have completely
> different
> >> code path.
> >>
> >> 6. Possibly making Catalyst, Dataset, and DataFrame more general by
> moving
> >> them to other package(s). They are already used beyond SQL, e.g. in ML
> >> pipelines, and will be used by streaming also.
> >>
> >>
> >> Operation/Deployment
> >>
> >> 1. Scala 2.11 as the default build. We should still support Scala 2.10,
> >> but it has been end-of-life.
> >>
> >> 2. Remove Hadoop 1 support.
> >>
> >> 3. Assembly-free distribution of Spark: don’t require building an
> enormous
> >> assembly jar in order to run Spark.
> >>
> >>
> >>
> >>
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: A proposal for Spark 2.0

2015-12-23 Thread Sean Owen

I think this will be hard to maintain; we already have JIRA as the de
facto central place to store discussions and prioritize work, and the
2.x stuff is already a JIRA. The wiki doesn't really hurt, just
probably will never be looked at again. Let's point people in all
cases to JIRA.

On Tue, Dec 22, 2015 at 11:52 PM, Reynold Xin  wrote:
> I started a wiki page:
> https://cwiki.apache.org/confluence/display/SPARK/Development+Discussions
>
>
> On Tue, Dec 22, 2015 at 6:27 AM, Tom Graves  wrote:
>>
>> Do we have a summary of all the discussions and what is planned for 2.0
>> then?  Perhaps we should put on the wiki for reference.
>>
>> Tom
>>
>>
>> On Tuesday, December 22, 2015 12:12 AM, Reynold Xin 
>> wrote:
>>
>>
>> FYI I updated the master branch's Spark version to 2.0.0-SNAPSHOT.
>>
>> On Tue, Nov 10, 2015 at 3:10 PM, Reynold Xin  wrote:
>>
>> I’m starting a new thread since the other one got intermixed with feature
>> requests. Please refrain from making feature request in this thread. Not
>> that we shouldn’t be adding features, but we can always add features in 1.7,
>> 2.1, 2.2, ...
>>
>> First - I want to propose a premise for how to think about Spark 2.0 and
>> major releases in Spark, based on discussion with several members of the
>> community: a major release should be low overhead and minimally disruptive
>> to the Spark community. A major release should not be very different from a
>> minor release and should not be gated based on new features. The main
>> purpose of a major release is an opportunity to fix things that are broken
>> in the current API and remove certain deprecated APIs (examples follow).
>>
>> For this reason, I would *not* propose doing major releases to break
>> substantial API's or perform large re-architecting that prevent users from
>> upgrading. Spark has always had a culture of evolving architecture
>> incrementally and making changes - and I don't think we want to change this
>> model. In fact, we’ve released many architectural changes on the 1.X line.
>>
>> If the community likes the above model, then to me it seems reasonable to
>> do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or immediately
>> after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A cadence of
>> major releases every 2 years seems doable within the above model.
>>
>> Under this model, here is a list of example things I would propose doing
>> in Spark 2.0, separated into APIs and Operation/Deployment:
>>
>>
>> APIs
>>
>> 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in
>> Spark 1.x.
>>
>> 2. Remove Akka from Spark’s API dependency (in streaming), so user
>> applications can use Akka (SPARK-5293). We have gotten a lot of complaints
>> about user applications being unable to use Akka due to Spark’s dependency
>> on Akka.
>>
>> 3. Remove Guava from Spark’s public API (JavaRDD Optional).
>>
>> 4. Better class package structure for low level developer API’s. In
>> particular, we have some DeveloperApi (mostly various listener-related
>> classes) added over the years. Some packages include only one or two public
>> classes but a lot of private classes. A better structure is to have public
>> classes isolated to a few public packages, and these public packages should
>> have minimal private classes for low level developer APIs.
>>
>> 5. Consolidate task metric and accumulator API. Although having some
>> subtle differences, these two are very similar but have completely different
>> code path.
>>
>> 6. Possibly making Catalyst, Dataset, and DataFrame more general by moving
>> them to other package(s). They are already used beyond SQL, e.g. in ML
>> pipelines, and will be used by streaming also.
>>
>>
>> Operation/Deployment
>>
>> 1. Scala 2.11 as the default build. We should still support Scala 2.10,
>> but it has been end-of-life.
>>
>> 2. Remove Hadoop 1 support.
>>
>> 3. Assembly-free distribution of Spark: don’t require building an enormous
>> assembly jar in order to run Spark.
>>
>>
>>
>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: A proposal for Spark 2.0

2015-12-22 Thread Reynold Xin

I started a wiki page:
https://cwiki.apache.org/confluence/display/SPARK/Development+Discussions


On Tue, Dec 22, 2015 at 6:27 AM, Tom Graves  wrote:

> Do we have a summary of all the discussions and what is planned for 2.0
> then?  Perhaps we should put on the wiki for reference.
>
> Tom
>
>
> On Tuesday, December 22, 2015 12:12 AM, Reynold Xin 
> wrote:
>
>
> FYI I updated the master branch's Spark version to 2.0.0-SNAPSHOT.
>
> On Tue, Nov 10, 2015 at 3:10 PM, Reynold Xin  wrote:
>
> I’m starting a new thread since the other one got intermixed with feature
> requests. Please refrain from making feature request in this thread. Not
> that we shouldn’t be adding features, but we can always add features in
> 1.7, 2.1, 2.2, ...
>
> First - I want to propose a premise for how to think about Spark 2.0 and
> major releases in Spark, based on discussion with several members of the
> community: a major release should be low overhead and minimally disruptive
> to the Spark community. A major release should not be very different from a
> minor release and should not be gated based on new features. The main
> purpose of a major release is an opportunity to fix things that are broken
> in the current API and remove certain deprecated APIs (examples follow).
>
> For this reason, I would *not* propose doing major releases to break
> substantial API's or perform large re-architecting that prevent users from
> upgrading. Spark has always had a culture of evolving architecture
> incrementally and making changes - and I don't think we want to change this
> model. In fact, we’ve released many architectural changes on the 1.X line.
>
> If the community likes the above model, then to me it seems reasonable to
> do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or immediately
> after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A cadence of
> major releases every 2 years seems doable within the above model.
>
> Under this model, here is a list of example things I would propose doing
> in Spark 2.0, separated into APIs and Operation/Deployment:
>
>
> APIs
>
> 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in
> Spark 1.x.
>
> 2. Remove Akka from Spark’s API dependency (in streaming), so user
> applications can use Akka (SPARK-5293). We have gotten a lot of complaints
> about user applications being unable to use Akka due to Spark’s dependency
> on Akka.
>
> 3. Remove Guava from Spark’s public API (JavaRDD Optional).
>
> 4. Better class package structure for low level developer API’s. In
> particular, we have some DeveloperApi (mostly various listener-related
> classes) added over the years. Some packages include only one or two public
> classes but a lot of private classes. A better structure is to have public
> classes isolated to a few public packages, and these public packages should
> have minimal private classes for low level developer APIs.
>
> 5. Consolidate task metric and accumulator API. Although having some
> subtle differences, these two are very similar but have completely
> different code path.
>
> 6. Possibly making Catalyst, Dataset, and DataFrame more general by moving
> them to other package(s). They are already used beyond SQL, e.g. in ML
> pipelines, and will be used by streaming also.
>
>
> Operation/Deployment
>
> 1. Scala 2.11 as the default build. We should still support Scala 2.10,
> but it has been end-of-life.
>
> 2. Remove Hadoop 1 support.
>
> 3. Assembly-free distribution of Spark: don’t require building an enormous
> assembly jar in order to run Spark.
>
>
>
>
>

Re: A proposal for Spark 2.0

2015-12-22 Thread Tom Graves

Do we have a summary of all the discussions and what is planned for 2.0 then?  
Perhaps we should put on the wiki for reference.
Tom 

On Tuesday, December 22, 2015 12:12 AM, Reynold Xin  
wrote:
 

 FYI I updated the master branch's Spark version to 2.0.0-SNAPSHOT. 
On Tue, Nov 10, 2015 at 3:10 PM, Reynold Xin  wrote:

I’m starting a new thread since the other one got intermixed with feature 
requests. Please refrain from making feature request in this thread. Not that 
we shouldn’t be adding features, but we can always add features in 1.7, 2.1, 
2.2, ...
First - I want to propose a premise for how to think about Spark 2.0 and major 
releases in Spark, based on discussion with several members of the community: a 
major release should be low overhead and minimally disruptive to the Spark 
community. A major release should not be very different from a minor release 
and should not be gated based on new features. The main purpose of a major 
release is an opportunity to fix things that are broken in the current API and 
remove certain deprecated APIs (examples follow).
For this reason, I would *not* propose doing major releases to break 
substantial API's or perform large re-architecting that prevent users from 
upgrading. Spark has always had a culture of evolving architecture 
incrementally and making changes - and I don't think we want to change this 
model. In fact, we’ve released many architectural changes on the 1.X line.
If the community likes the above model, then to me it seems reasonable to do 
Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or immediately after 
Spark 1.7. It will be 18 or 21 months since Spark 1.0. A cadence of major 
releases every 2 years seems doable within the above model.
Under this model, here is a list of example things I would propose doing in 
Spark 2.0, separated into APIs and Operation/Deployment:

APIs
1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in Spark 1.x.
2. Remove Akka from Spark’s API dependency (in streaming), so user applications 
can use Akka (SPARK-5293). We have gotten a lot of complaints about user 
applications being unable to use Akka due to Spark’s dependency on Akka.
3. Remove Guava from Spark’s public API (JavaRDD Optional).
4. Better class package structure for low level developer API’s. In particular, 
we have some DeveloperApi (mostly various listener-related classes) added over 
the years. Some packages include only one or two public classes but a lot of 
private classes. A better structure is to have public classes isolated to a few 
public packages, and these public packages should have minimal private classes 
for low level developer APIs.
5. Consolidate task metric and accumulator API. Although having some subtle 
differences, these two are very similar but have completely different code path.
6. Possibly making Catalyst, Dataset, and DataFrame more general by moving them 
to other package(s). They are already used beyond SQL, e.g. in ML pipelines, 
and will be used by streaming also.

Operation/Deployment
1. Scala 2.11 as the default build. We should still support Scala 2.10, but it 
has been end-of-life.
2. Remove Hadoop 1 support. 
3. Assembly-free distribution of Spark: don’t require building an enormous 
assembly jar in order to run Spark.

Re: A proposal for Spark 2.0

2015-12-21 Thread Allen Zhang



Thanks your quick respose, ok, I will start a new thread with my thoughts


Thanks,
Allen





At 2015-12-22 15:19:49, "Reynold Xin"  wrote:

I'm not sure if we need special API support for GPUs. You can already use GPUs 
on individual executor nodes to build your own applications. If we want to 
leverage GPUs out of the box, I don't think the solution is to provide GPU 
specific APIs. Rather, we should just switch the underlying execution to GPUs 
when it is more optimal.


Anyway, I don't want to distract this topic, If you want to discuss more about 
GPUs, please start a new thread.




On Mon, Dec 21, 2015 at 11:18 PM, Allen Zhang  wrote:

plus dev







在 2015-12-22 15:15:59，"Allen Zhang"  写道：

Hi Reynold,


Any new API support for GPU computing in our 2.0 new version ?


-Allen





在 2015-12-22 14:12:50，"Reynold Xin"  写道：

FYI I updated the master branch's Spark version to 2.0.0-SNAPSHOT. 


On Tue, Nov 10, 2015 at 3:10 PM, Reynold Xin  wrote:

I’m starting a new thread since the other one got intermixed with feature 
requests. Please refrain from making feature request in this thread. Not that 
we shouldn’t be adding features, but we can always add features in 1.7, 2.1, 
2.2, ...


First - I want to propose a premise for how to think about Spark 2.0 and major 
releases in Spark, based on discussion with several members of the community: a 
major release should be low overhead and minimally disruptive to the Spark 
community. A major release should not be very different from a minor release 
and should not be gated based on new features. The main purpose of a major 
release is an opportunity to fix things that are broken in the current API and 
remove certain deprecated APIs (examples follow).


For this reason, I would *not* propose doing major releases to break 
substantial API's or perform large re-architecting that prevent users from 
upgrading. Spark has always had a culture of evolving architecture 
incrementally and making changes - and I don't think we want to change this 
model. In fact, we’ve released many architectural changes on the 1.X line.


If the community likes the above model, then to me it seems reasonable to do 
Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or immediately after 
Spark 1.7. It will be 18 or 21 months since Spark 1.0. A cadence of major 
releases every 2 years seems doable within the above model.


Under this model, here is a list of example things I would propose doing in 
Spark 2.0, separated into APIs and Operation/Deployment:




APIs


1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in Spark 1.x.


2. Remove Akka from Spark’s API dependency (in streaming), so user applications 
can use Akka (SPARK-5293). We have gotten a lot of complaints about user 
applications being unable to use Akka due to Spark’s dependency on Akka.


3. Remove Guava from Spark’s public API (JavaRDD Optional).


4. Better class package structure for low level developer API’s. In particular, 
we have some DeveloperApi (mostly various listener-related classes) added over 
the years. Some packages include only one or two public classes but a lot of 
private classes. A better structure is to have public classes isolated to a few 
public packages, and these public packages should have minimal private classes 
for low level developer APIs.


5. Consolidate task metric and accumulator API. Although having some subtle 
differences, these two are very similar but have completely different code path.


6. Possibly making Catalyst, Dataset, and DataFrame more general by moving them 
to other package(s). They are already used beyond SQL, e.g. in ML pipelines, 
and will be used by streaming also.




Operation/Deployment


1. Scala 2.11 as the default build. We should still support Scala 2.10, but it 
has been end-of-life.


2. Remove Hadoop 1 support. 


3. Assembly-free distribution of Spark: don’t require building an enormous 
assembly jar in order to run Spark.

Re: A proposal for Spark 2.0

2015-12-21 Thread Reynold Xin

I'm not sure if we need special API support for GPUs. You can already use
GPUs on individual executor nodes to build your own applications. If we
want to leverage GPUs out of the box, I don't think the solution is to
provide GPU specific APIs. Rather, we should just switch the underlying
execution to GPUs when it is more optimal.

Anyway, I don't want to distract this topic, If you want to discuss more
about GPUs, please start a new thread.


On Mon, Dec 21, 2015 at 11:18 PM, Allen Zhang  wrote:

> plus dev
>
>
>
>
>
>
> 在 2015-12-22 15:15:59，"Allen Zhang"  写道：
>
> Hi Reynold,
>
> Any new API support for GPU computing in our 2.0 new version ?
>
> -Allen
>
>
>
>
> 在 2015-12-22 14:12:50，"Reynold Xin"  写道：
>
> FYI I updated the master branch's Spark version to 2.0.0-SNAPSHOT.
>
> On Tue, Nov 10, 2015 at 3:10 PM, Reynold Xin  wrote:
>
>> I’m starting a new thread since the other one got intermixed with feature
>> requests. Please refrain from making feature request in this thread. Not
>> that we shouldn’t be adding features, but we can always add features in
>> 1.7, 2.1, 2.2, ...
>>
>> First - I want to propose a premise for how to think about Spark 2.0 and
>> major releases in Spark, based on discussion with several members of the
>> community: a major release should be low overhead and minimally disruptive
>> to the Spark community. A major release should not be very different from a
>> minor release and should not be gated based on new features. The main
>> purpose of a major release is an opportunity to fix things that are broken
>> in the current API and remove certain deprecated APIs (examples follow).
>>
>> For this reason, I would *not* propose doing major releases to break
>> substantial API's or perform large re-architecting that prevent users from
>> upgrading. Spark has always had a culture of evolving architecture
>> incrementally and making changes - and I don't think we want to change this
>> model. In fact, we’ve released many architectural changes on the 1.X line.
>>
>> If the community likes the above model, then to me it seems reasonable to
>> do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or immediately
>> after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A cadence of
>> major releases every 2 years seems doable within the above model.
>>
>> Under this model, here is a list of example things I would propose doing
>> in Spark 2.0, separated into APIs and Operation/Deployment:
>>
>>
>> APIs
>>
>> 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in
>> Spark 1.x.
>>
>> 2. Remove Akka from Spark’s API dependency (in streaming), so user
>> applications can use Akka (SPARK-5293). We have gotten a lot of complaints
>> about user applications being unable to use Akka due to Spark’s dependency
>> on Akka.
>>
>> 3. Remove Guava from Spark’s public API (JavaRDD Optional).
>>
>> 4. Better class package structure for low level developer API’s. In
>> particular, we have some DeveloperApi (mostly various listener-related
>> classes) added over the years. Some packages include only one or two public
>> classes but a lot of private classes. A better structure is to have public
>> classes isolated to a few public packages, and these public packages should
>> have minimal private classes for low level developer APIs.
>>
>> 5. Consolidate task metric and accumulator API. Although having some
>> subtle differences, these two are very similar but have completely
>> different code path.
>>
>> 6. Possibly making Catalyst, Dataset, and DataFrame more general by
>> moving them to other package(s). They are already used beyond SQL, e.g. in
>> ML pipelines, and will be used by streaming also.
>>
>>
>> Operation/Deployment
>>
>> 1. Scala 2.11 as the default build. We should still support Scala 2.10,
>> but it has been end-of-life.
>>
>> 2. Remove Hadoop 1 support.
>>
>> 3. Assembly-free distribution of Spark: don’t require building an
>> enormous assembly jar in order to run Spark.
>>
>>
>
>
>
>
>
>
>
>

Re: A proposal for Spark 2.0

2015-12-21 Thread Allen Zhang

plus dev






在 2015-12-22 15:15:59，"Allen Zhang"  写道：

Hi Reynold,


Any new API support for GPU computing in our 2.0 new version ?


-Allen





在 2015-12-22 14:12:50，"Reynold Xin"  写道：

FYI I updated the master branch's Spark version to 2.0.0-SNAPSHOT. 


On Tue, Nov 10, 2015 at 3:10 PM, Reynold Xin  wrote:

I’m starting a new thread since the other one got intermixed with feature 
requests. Please refrain from making feature request in this thread. Not that 
we shouldn’t be adding features, but we can always add features in 1.7, 2.1, 
2.2, ...


First - I want to propose a premise for how to think about Spark 2.0 and major 
releases in Spark, based on discussion with several members of the community: a 
major release should be low overhead and minimally disruptive to the Spark 
community. A major release should not be very different from a minor release 
and should not be gated based on new features. The main purpose of a major 
release is an opportunity to fix things that are broken in the current API and 
remove certain deprecated APIs (examples follow).


For this reason, I would *not* propose doing major releases to break 
substantial API's or perform large re-architecting that prevent users from 
upgrading. Spark has always had a culture of evolving architecture 
incrementally and making changes - and I don't think we want to change this 
model. In fact, we’ve released many architectural changes on the 1.X line.


If the community likes the above model, then to me it seems reasonable to do 
Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or immediately after 
Spark 1.7. It will be 18 or 21 months since Spark 1.0. A cadence of major 
releases every 2 years seems doable within the above model.


Under this model, here is a list of example things I would propose doing in 
Spark 2.0, separated into APIs and Operation/Deployment:




APIs


1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in Spark 1.x.


2. Remove Akka from Spark’s API dependency (in streaming), so user applications 
can use Akka (SPARK-5293). We have gotten a lot of complaints about user 
applications being unable to use Akka due to Spark’s dependency on Akka.


3. Remove Guava from Spark’s public API (JavaRDD Optional).


4. Better class package structure for low level developer API’s. In particular, 
we have some DeveloperApi (mostly various listener-related classes) added over 
the years. Some packages include only one or two public classes but a lot of 
private classes. A better structure is to have public classes isolated to a few 
public packages, and these public packages should have minimal private classes 
for low level developer APIs.


5. Consolidate task metric and accumulator API. Although having some subtle 
differences, these two are very similar but have completely different code path.


6. Possibly making Catalyst, Dataset, and DataFrame more general by moving them 
to other package(s). They are already used beyond SQL, e.g. in ML pipelines, 
and will be used by streaming also.




Operation/Deployment


1. Scala 2.11 as the default build. We should still support Scala 2.10, but it 
has been end-of-life.


2. Remove Hadoop 1 support. 


3. Assembly-free distribution of Spark: don’t require building an enormous 
assembly jar in order to run Spark.

Re: A proposal for Spark 2.0

2015-12-21 Thread Reynold Xin

FYI I updated the master branch's Spark version to 2.0.0-SNAPSHOT.

On Tue, Nov 10, 2015 at 3:10 PM, Reynold Xin  wrote:

> I’m starting a new thread since the other one got intermixed with feature
> requests. Please refrain from making feature request in this thread. Not
> that we shouldn’t be adding features, but we can always add features in
> 1.7, 2.1, 2.2, ...
>
> First - I want to propose a premise for how to think about Spark 2.0 and
> major releases in Spark, based on discussion with several members of the
> community: a major release should be low overhead and minimally disruptive
> to the Spark community. A major release should not be very different from a
> minor release and should not be gated based on new features. The main
> purpose of a major release is an opportunity to fix things that are broken
> in the current API and remove certain deprecated APIs (examples follow).
>
> For this reason, I would *not* propose doing major releases to break
> substantial API's or perform large re-architecting that prevent users from
> upgrading. Spark has always had a culture of evolving architecture
> incrementally and making changes - and I don't think we want to change this
> model. In fact, we’ve released many architectural changes on the 1.X line.
>
> If the community likes the above model, then to me it seems reasonable to
> do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or immediately
> after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A cadence of
> major releases every 2 years seems doable within the above model.
>
> Under this model, here is a list of example things I would propose doing
> in Spark 2.0, separated into APIs and Operation/Deployment:
>
>
> APIs
>
> 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in
> Spark 1.x.
>
> 2. Remove Akka from Spark’s API dependency (in streaming), so user
> applications can use Akka (SPARK-5293). We have gotten a lot of complaints
> about user applications being unable to use Akka due to Spark’s dependency
> on Akka.
>
> 3. Remove Guava from Spark’s public API (JavaRDD Optional).
>
> 4. Better class package structure for low level developer API’s. In
> particular, we have some DeveloperApi (mostly various listener-related
> classes) added over the years. Some packages include only one or two public
> classes but a lot of private classes. A better structure is to have public
> classes isolated to a few public packages, and these public packages should
> have minimal private classes for low level developer APIs.
>
> 5. Consolidate task metric and accumulator API. Although having some
> subtle differences, these two are very similar but have completely
> different code path.
>
> 6. Possibly making Catalyst, Dataset, and DataFrame more general by moving
> them to other package(s). They are already used beyond SQL, e.g. in ML
> pipelines, and will be used by streaming also.
>
>
> Operation/Deployment
>
> 1. Scala 2.11 as the default build. We should still support Scala 2.10,
> but it has been end-of-life.
>
> 2. Remove Hadoop 1 support.
>
> 3. Assembly-free distribution of Spark: don’t require building an enormous
> assembly jar in order to run Spark.
>
>

Re: A proposal for Spark 2.0

2015-12-09 Thread kostas papageorgopoylos

Hi Kostas

With regards to your *second* point. I believe that requiring from the user
apps to explicitly declare their dependencies is the most clear API
approach when it comes to classpath and classloading.

However what about the following API: *SparkContext.addJar(String
pathToJar)* . *Is this going to change or affected in someway?*
Currently i use spark 1.5.2 in a Java application and i have built a
utility class that finds the correct path of a Dependency
(myPathOfTheJarDependency=Something like SparkUtils.getJarFullPathFromClass
(EsSparkSQL.class, "^elasticsearch-hadoop-2.2.0-beta1.*\\.jar$");), Which
is not something beatiful but i can live with.

Then i  use *javaSparkContext.addJar(myPathOfTheJarDependency)* ; after i
have initiated the javaSparkContext. In that way i do not require my
SparkCluster to have configuration on the classpath of my application and i
explicitly define the dependencies during runtime of my app after each time
i initiate a sparkContext.
I would be happy and i believe many other users also if i could could
continue having the same or similar approach with regards to dependencies


Regards

2015-12-08 23:40 GMT+02:00 Kostas Sakellis :

> I'd also like to make it a requirement that Spark 2.0 have a stable
> dataframe and dataset API - we should not leave these APIs experimental in
> the 2.0 release. We already know of at least one breaking change we need to
> make to dataframes, now's the time to make any other changes we need to
> stabilize these APIs. Anything we can do to make us feel more comfortable
> about the dataset and dataframe APIs before the 2.0 release?
>
> I've also been thinking that in Spark 2.0, we might want to consider
> strict classpath isolation for user applications. Hadoop 3 is moving in
> this direction. We could, for instance, run all user applications in their
> own classloader that only inherits very specific classes from Spark (ie.
> public APIs). This will require user apps to explicitly declare their
> dependencies as there won't be any accidental class leaking anymore. We do
> something like this for *userClasspathFirst option but it is not as strict
> as what I described. This is a breaking change but I think it will help
> with eliminating weird classpath incompatibility issues between user
> applications and Spark system dependencies.
>
> Thoughts?
>
> Kostas
>
>
> On Fri, Dec 4, 2015 at 3:28 AM, Sean Owen  wrote:
>
>> To be clear-er, I don't think it's clear yet whether a 1.7 release
>> should exist or not. I could see both making sense. It's also not
>> really necessary to decide now, well before a 1.6 is even out in the
>> field. Deleting the version lost information, and I would not have
>> done that given my reply. Reynold maybe I can take this up with you
>> offline.
>>
>> On Thu, Dec 3, 2015 at 6:03 PM, Mark Hamstra 
>> wrote:
>> > Reynold's post fromNov. 25:
>> >
>> >> I don't think we should drop support for Scala 2.10, or make it harder
>> in
>> >> terms of operations for people to upgrade.
>> >>
>> >> If there are further objections, I'm going to bump remove the 1.7
>> version
>> >> and retarget things to 2.0 on JIRA.
>> >
>> >
>> > On Thu, Dec 3, 2015 at 12:47 AM, Sean Owen  wrote:
>> >>
>> >> Reynold, did you (or someone else) delete version 1.7.0 in JIRA? I
>> >> think that's premature. If there's a 1.7.0 then we've lost info about
>> >> what it would contain. It's trivial at any later point to merge the
>> >> versions. And, since things change and there's not a pressing need to
>> >> decide one way or the other, it seems fine to at least collect this
>> >> info like we have things like "1.4.3" that may never be released. I'd
>> >> like to add it back?
>> >>
>> >> On Thu, Nov 26, 2015 at 9:45 AM, Sean Owen  wrote:
>> >> > Maintaining both a 1.7 and 2.0 is too much work for the project,
>> which
>> >> > is over-stretched now. This means that after 1.6 it's just small
>> >> > maintenance releases in 1.x and no substantial features or evolution.
>> >> > This means that the "in progress" APIs in 1.x that will stay that
>> way,
>> >> > unless one updates to 2.x. It's not unreasonable, but means the
>> update
>> >> > to the 2.x line isn't going to be that optional for users.
>> >> >
>> >> > Scala 2.10 is already EOL right? Supporting it in 2.x means
>> supporting
>> >> > it for a couple years, note. 2.10 is still used today, but that's the
>> >> > point of the current stable 1.x release in general: if you want to
>> >> > stick to current dependencies, stick to the current release. Although
>> >> > I think that's the right way to think about support across major
>> >> > versions in general, I can see that 2.x is more of a required update
>> >> > for those following the project's fixes and releases. Hence may
>> indeed
>> >> > be important to just keep supporting 2.10.
>> >> >
>> >> > I can't see supporting 2.12 at the same time (right?). Is that a
>> >> > concern? it will be long since GA by the time 2.x is first released.
>> >> >
>>

Re: A proposal for Spark 2.0

2015-12-08 Thread Kostas Sakellis

I'd also like to make it a requirement that Spark 2.0 have a stable
dataframe and dataset API - we should not leave these APIs experimental in
the 2.0 release. We already know of at least one breaking change we need to
make to dataframes, now's the time to make any other changes we need to
stabilize these APIs. Anything we can do to make us feel more comfortable
about the dataset and dataframe APIs before the 2.0 release?

I've also been thinking that in Spark 2.0, we might want to consider strict
classpath isolation for user applications. Hadoop 3 is moving in this
direction. We could, for instance, run all user applications in their own
classloader that only inherits very specific classes from Spark (ie. public
APIs). This will require user apps to explicitly declare their dependencies
as there won't be any accidental class leaking anymore. We do something
like this for *userClasspathFirst option but it is not as strict as what I
described. This is a breaking change but I think it will help with
eliminating weird classpath incompatibility issues between user
applications and Spark system dependencies.

Thoughts?

Kostas


On Fri, Dec 4, 2015 at 3:28 AM, Sean Owen  wrote:

> To be clear-er, I don't think it's clear yet whether a 1.7 release
> should exist or not. I could see both making sense. It's also not
> really necessary to decide now, well before a 1.6 is even out in the
> field. Deleting the version lost information, and I would not have
> done that given my reply. Reynold maybe I can take this up with you
> offline.
>
> On Thu, Dec 3, 2015 at 6:03 PM, Mark Hamstra 
> wrote:
> > Reynold's post fromNov. 25:
> >
> >> I don't think we should drop support for Scala 2.10, or make it harder
> in
> >> terms of operations for people to upgrade.
> >>
> >> If there are further objections, I'm going to bump remove the 1.7
> version
> >> and retarget things to 2.0 on JIRA.
> >
> >
> > On Thu, Dec 3, 2015 at 12:47 AM, Sean Owen  wrote:
> >>
> >> Reynold, did you (or someone else) delete version 1.7.0 in JIRA? I
> >> think that's premature. If there's a 1.7.0 then we've lost info about
> >> what it would contain. It's trivial at any later point to merge the
> >> versions. And, since things change and there's not a pressing need to
> >> decide one way or the other, it seems fine to at least collect this
> >> info like we have things like "1.4.3" that may never be released. I'd
> >> like to add it back?
> >>
> >> On Thu, Nov 26, 2015 at 9:45 AM, Sean Owen  wrote:
> >> > Maintaining both a 1.7 and 2.0 is too much work for the project, which
> >> > is over-stretched now. This means that after 1.6 it's just small
> >> > maintenance releases in 1.x and no substantial features or evolution.
> >> > This means that the "in progress" APIs in 1.x that will stay that way,
> >> > unless one updates to 2.x. It's not unreasonable, but means the update
> >> > to the 2.x line isn't going to be that optional for users.
> >> >
> >> > Scala 2.10 is already EOL right? Supporting it in 2.x means supporting
> >> > it for a couple years, note. 2.10 is still used today, but that's the
> >> > point of the current stable 1.x release in general: if you want to
> >> > stick to current dependencies, stick to the current release. Although
> >> > I think that's the right way to think about support across major
> >> > versions in general, I can see that 2.x is more of a required update
> >> > for those following the project's fixes and releases. Hence may indeed
> >> > be important to just keep supporting 2.10.
> >> >
> >> > I can't see supporting 2.12 at the same time (right?). Is that a
> >> > concern? it will be long since GA by the time 2.x is first released.
> >> >
> >> > There's another fairly coherent worldview where development continues
> >> > in 1.7 and focuses on finishing the loose ends and lots of bug fixing.
> >> > 2.0 is delayed somewhat into next year, and by that time supporting
> >> > 2.11+2.12 and Java 8 looks more feasible and more in tune with
> >> > currently deployed versions.
> >> >
> >> > I can't say I have a strong view but I personally hadn't imagined 2.x
> >> > would start now.
> >> >
> >> >
> >> > On Thu, Nov 26, 2015 at 7:00 AM, Reynold Xin 
> >> > wrote:
> >> >> I don't think we should drop support for Scala 2.10, or make it
> harder
> >> >> in
> >> >> terms of operations for people to upgrade.
> >> >>
> >> >> If there are further objections, I'm going to bump remove the 1.7
> >> >> version
> >> >> and retarget things to 2.0 on JIRA.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: A proposal for Spark 2.0

2015-12-04 Thread Sean Owen

To be clear-er, I don't think it's clear yet whether a 1.7 release
should exist or not. I could see both making sense. It's also not
really necessary to decide now, well before a 1.6 is even out in the
field. Deleting the version lost information, and I would not have
done that given my reply. Reynold maybe I can take this up with you
offline.

On Thu, Dec 3, 2015 at 6:03 PM, Mark Hamstra  wrote:
> Reynold's post fromNov. 25:
>
>> I don't think we should drop support for Scala 2.10, or make it harder in
>> terms of operations for people to upgrade.
>>
>> If there are further objections, I'm going to bump remove the 1.7 version
>> and retarget things to 2.0 on JIRA.
>
>
> On Thu, Dec 3, 2015 at 12:47 AM, Sean Owen  wrote:
>>
>> Reynold, did you (or someone else) delete version 1.7.0 in JIRA? I
>> think that's premature. If there's a 1.7.0 then we've lost info about
>> what it would contain. It's trivial at any later point to merge the
>> versions. And, since things change and there's not a pressing need to
>> decide one way or the other, it seems fine to at least collect this
>> info like we have things like "1.4.3" that may never be released. I'd
>> like to add it back?
>>
>> On Thu, Nov 26, 2015 at 9:45 AM, Sean Owen  wrote:
>> > Maintaining both a 1.7 and 2.0 is too much work for the project, which
>> > is over-stretched now. This means that after 1.6 it's just small
>> > maintenance releases in 1.x and no substantial features or evolution.
>> > This means that the "in progress" APIs in 1.x that will stay that way,
>> > unless one updates to 2.x. It's not unreasonable, but means the update
>> > to the 2.x line isn't going to be that optional for users.
>> >
>> > Scala 2.10 is already EOL right? Supporting it in 2.x means supporting
>> > it for a couple years, note. 2.10 is still used today, but that's the
>> > point of the current stable 1.x release in general: if you want to
>> > stick to current dependencies, stick to the current release. Although
>> > I think that's the right way to think about support across major
>> > versions in general, I can see that 2.x is more of a required update
>> > for those following the project's fixes and releases. Hence may indeed
>> > be important to just keep supporting 2.10.
>> >
>> > I can't see supporting 2.12 at the same time (right?). Is that a
>> > concern? it will be long since GA by the time 2.x is first released.
>> >
>> > There's another fairly coherent worldview where development continues
>> > in 1.7 and focuses on finishing the loose ends and lots of bug fixing.
>> > 2.0 is delayed somewhat into next year, and by that time supporting
>> > 2.11+2.12 and Java 8 looks more feasible and more in tune with
>> > currently deployed versions.
>> >
>> > I can't say I have a strong view but I personally hadn't imagined 2.x
>> > would start now.
>> >
>> >
>> > On Thu, Nov 26, 2015 at 7:00 AM, Reynold Xin 
>> > wrote:
>> >> I don't think we should drop support for Scala 2.10, or make it harder
>> >> in
>> >> terms of operations for people to upgrade.
>> >>
>> >> If there are further objections, I'm going to bump remove the 1.7
>> >> version
>> >> and retarget things to 2.0 on JIRA.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: A proposal for Spark 2.0

2015-12-03 Thread Mridul Muralidharan

;>
> >>>>>>> A 1.6.x release will only fix bugs - we typically don't change
> APIs in
> >>>>>>> z releases. The Dataset API is experimental and so we might be
> changing the
> >>>>>>> APIs before we declare it stable. This is why I think it is
> important to
> >>>>>>> first stabilize the Dataset API with a Spark 1.7 release before
> moving to
> >>>>>>> Spark 2.0. This will benefit users that would like to use the new
> Dataset
> >>>>>>> APIs but can't move to Spark 2.0 because of the backwards
> incompatible
> >>>>>>> changes, like removal of deprecated APIs, Scala 2.11 etc.
> >>>>>>>
> >>>>>>> Kostas
> >>>>>>>
> >>>>>>>
> >>>>>>> On Fri, Nov 13, 2015 at 12:26 PM, Mark Hamstra
> >>>>>>>  wrote:
> >>>>>>>>
> >>>>>>>> Why does stabilization of those two features require a 1.7 release
> >>>>>>>> instead of 1.6.1?
> >>>>>>>>
> >>>>>>>> On Fri, Nov 13, 2015 at 11:40 AM, Kostas Sakellis
> >>>>>>>>  wrote:
> >>>>>>>>>
> >>>>>>>>> We have veered off the topic of Spark 2.0 a little bit here -
> yes we
> >>>>>>>>> can talk about RDD vs. DS/DF more but lets refocus on Spark 2.0.
> I'd like to
> >>>>>>>>> propose we have one more 1.x release after Spark 1.6. This will
> allow us to
> >>>>>>>>> stabilize a few of the new features that were added in 1.6:
> >>>>>>>>>
> >>>>>>>>> 1) the experimental Datasets API
> >>>>>>>>> 2) the new unified memory manager.
> >>>>>>>>>
> >>>>>>>>> I understand our goal for Spark 2.0 is to offer an easy
> transition
> >>>>>>>>> but there will be users that won't be able to seamlessly upgrade
> given what
> >>>>>>>>> we have discussed as in scope for 2.0. For these users, having a
> 1.x release
> >>>>>>>>> with these new features/APIs stabilized will be very beneficial.
> This might
> >>>>>>>>> make Spark 1.7 a lighter release but that is not necessarily a
> bad thing.
> >>>>>>>>>
> >>>>>>>>> Any thoughts on this timeline?
> >>>>>>>>>
> >>>>>>>>> Kostas Sakellis
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Thu, Nov 12, 2015 at 8:39 PM, Cheng, Hao  >
> >>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>> Agree, more features/apis/optimization need to be added in
> DF/DS.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> I mean, we need to think about what kind of RDD APIs we have to
> >>>>>>>>>> provide to developer, maybe the fundamental API is enough,
> like, the
> >>>>>>>>>> ShuffledRDD etc..  But PairRDDFunctions probably not in this
> category, as we
> >>>>>>>>>> can do the same thing easily with DF/DS, even better
> performance.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> From: Mark Hamstra [mailto:m...@clearstorydata.com]
> >>>>>>>>>> Sent: Friday, November 13, 2015 11:23 AM
> >>>>>>>>>> To: Stephen Boesch
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Cc: dev@spark.apache.org
> >>>>>>>>>> Subject: Re: A proposal for Spark 2.0
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Hmmm... to me, that seems like precisely the kind of thing that
> >>>>>>>>>> argues for retaining the RDD API but not

Re: A proposal for Spark 2.0

2015-12-03 Thread Rad Gruchalski

tabricks.com)>
> > > >>>> wrote:
> > > >>>>>
> > > >>>>> I actually think the next one (after 1.6) should be Spark 2.0. The
> > > >>>>> reason is that I already know we have to break some part of the
> > > >>>>> DataFrame/Dataset API as part of the Dataset design. (e.g. 
> > > >>>>> DataFrame.map
> > > >>>>> should return Dataset rather than RDD). In that case, I'd rather 
> > > >>>>> break this
> > > >>>>> sooner (in one release) than later (in two releases). so the damage 
> > > >>>>> is
> > > >>>>> smaller.
> > > >>>>>
> > > >>>>> I don't think whether we call Dataset/DataFrame experimental or not
> > > >>>>> matters too much for 2.0. We can still call Dataset experimental in 
> > > >>>>> 2.0 and
> > > >>>>> then mark them as stable in 2.1. Despite being "experimental", 
> > > >>>>> there has
> > > >>>>> been no breaking changes to DataFrame from 1.3 to 1.6.
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>> On Wed, Nov 18, 2015 at 3:43 PM, Mark Hamstra 
> > > >>>>> mailto:m...@clearstorydata.com)>
> > > >>>>> wrote:
> > > >>>>>>
> > > >>>>>> Ah, got it; by "stabilize" you meant changing the API, not just bug
> > > >>>>>> fixing.  We're on the same page now.
> > > >>>>>>
> > > >>>>>> On Wed, Nov 18, 2015 at 3:39 PM, Kostas Sakellis 
> > > >>>>>> mailto:kos...@cloudera.com)>
> > > >>>>>> wrote:
> > > >>>>>>>
> > > >>>>>>> A 1.6.x release will only fix bugs - we typically don't change 
> > > >>>>>>> APIs in
> > > >>>>>>> z releases. The Dataset API is experimental and so we might be 
> > > >>>>>>> changing the
> > > >>>>>>> APIs before we declare it stable. This is why I think it is 
> > > >>>>>>> important to
> > > >>>>>>> first stabilize the Dataset API with a Spark 1.7 release before 
> > > >>>>>>> moving to
> > > >>>>>>> Spark 2.0. This will benefit users that would like to use the new 
> > > >>>>>>> Dataset
> > > >>>>>>> APIs but can't move to Spark 2.0 because of the backwards 
> > > >>>>>>> incompatible
> > > >>>>>>> changes, like removal of deprecated APIs, Scala 2.11 etc.
> > > >>>>>>>
> > > >>>>>>> Kostas
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>> On Fri, Nov 13, 2015 at 12:26 PM, Mark Hamstra
> > > >>>>>>> mailto:m...@clearstorydata.com)> wrote:
> > > >>>>>>>>
> > > >>>>>>>> Why does stabilization of those two features require a 1.7 
> > > >>>>>>>> release
> > > >>>>>>>> instead of 1.6.1?
> > > >>>>>>>>
> > > >>>>>>>> On Fri, Nov 13, 2015 at 11:40 AM, Kostas Sakellis
> > > >>>>>>>> mailto:kos...@cloudera.com)> wrote:
> > > >>>>>>>>>
> > > >>>>>>>>> We have veered off the topic of Spark 2.0 a little bit here - 
> > > >>>>>>>>> yes we
> > > >>>>>>>>> can talk about RDD vs. DS/DF more but lets refocus on Spark 
> > > >>>>>>>>> 2.0. I'd like to
> > > >>>>>>>>> propose we have one more 1.x release after Spark 1.6. This will 
> > > >>>>>>>>> allow us to
> > > >>>>>>>>> stabilize a few of the new features that were added in 1.6:
> > > >>>>>>>>>
> > > >>>>>>>>> 1) the experimental Datasets API
> > > >>>>>>>>&g

Re: A proposal for Spark 2.0

2015-12-03 Thread Koert Kuipers

king changes in 2.0 though? Note that we're
>> not
>> >>>> removing Scala 2.10, we're just making the default build be against
>> Scala
>> >>>> 2.11 instead of 2.10. There seem to be very few changes that people
>> would
>> >>>> worry about. If people are going to update their apps, I think it's
>> better
>> >>>> to make the other small changes in 2.0 at the same time than to
>> update once
>> >>>> for Dataset and another time for 2.0.
>> >>>>
>> >>>> BTW just refer to Reynold's original post for the other proposed API
>> >>>> changes.
>> >>>>
>> >>>> Matei
>> >>>>
>> >>>> On Nov 24, 2015, at 12:27 PM, Sandy Ryza 
>> wrote:
>> >>>>
>> >>>> I think that Kostas' logic still holds.  The majority of Spark
>> users, and
>> >>>> likely an even vaster majority of people running vaster jobs, are
>> still on
>> >>>> RDDs and on the cusp of upgrading to DataFrames.  Users will
>> probably want
>> >>>> to upgrade to the stable version of the Dataset / DataFrame API so
>> they
>> >>>> don't need to do so twice.  Requiring that they absorb all the other
>> ways
>> >>>> that Spark breaks compatibility in the move to 2.0 makes it much more
>> >>>> difficult for them to make this transition.
>> >>>>
>> >>>> Using the same set of APIs also means that it will be easier to
>> backport
>> >>>> critical fixes to the 1.x line.
>> >>>>
>> >>>> It's not clear to me that avoiding breakage of an experimental API
>> in the
>> >>>> 1.x line outweighs these issues.
>> >>>>
>> >>>> -Sandy
>> >>>>
>> >>>> On Mon, Nov 23, 2015 at 10:51 PM, Reynold Xin 
>> >>>> wrote:
>> >>>>>
>> >>>>> I actually think the next one (after 1.6) should be Spark 2.0. The
>> >>>>> reason is that I already know we have to break some part of the
>> >>>>> DataFrame/Dataset API as part of the Dataset design. (e.g.
>> DataFrame.map
>> >>>>> should return Dataset rather than RDD). In that case, I'd rather
>> break this
>> >>>>> sooner (in one release) than later (in two releases). so the damage
>> is
>> >>>>> smaller.
>> >>>>>
>> >>>>> I don't think whether we call Dataset/DataFrame experimental or not
>> >>>>> matters too much for 2.0. We can still call Dataset experimental in
>> 2.0 and
>> >>>>> then mark them as stable in 2.1. Despite being "experimental",
>> there has
>> >>>>> been no breaking changes to DataFrame from 1.3 to 1.6.
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> On Wed, Nov 18, 2015 at 3:43 PM, Mark Hamstra <
>> m...@clearstorydata.com>
>> >>>>> wrote:
>> >>>>>>
>> >>>>>> Ah, got it; by "stabilize" you meant changing the API, not just bug
>> >>>>>> fixing.  We're on the same page now.
>> >>>>>>
>> >>>>>> On Wed, Nov 18, 2015 at 3:39 PM, Kostas Sakellis <
>> kos...@cloudera.com>
>> >>>>>> wrote:
>> >>>>>>>
>> >>>>>>> A 1.6.x release will only fix bugs - we typically don't change
>> APIs in
>> >>>>>>> z releases. The Dataset API is experimental and so we might be
>> changing the
>> >>>>>>> APIs before we declare it stable. This is why I think it is
>> important to
>> >>>>>>> first stabilize the Dataset API with a Spark 1.7 release before
>> moving to
>> >>>>>>> Spark 2.0. This will benefit users that would like to use the new
>> Dataset
>> >>>>>>> APIs but can't move to Spark 2.0 because of the backwards
>> incompatible
>> >>>>>>> changes, like removal of deprecated APIs, Scala 2.11 etc.
>> >>>>>>>
>> >>>>>>> Kostas
>> >>>>>>>
>> >>>>>>>
>> >>>>&g

Re: A proposal for Spark 2.0

2015-12-03 Thread Mark Hamstra

gt; On Nov 24, 2015, at 12:27 PM, Sandy Ryza 
> wrote:
> >>>>
> >>>> I think that Kostas' logic still holds.  The majority of Spark users,
> and
> >>>> likely an even vaster majority of people running vaster jobs, are
> still on
> >>>> RDDs and on the cusp of upgrading to DataFrames.  Users will probably
> want
> >>>> to upgrade to the stable version of the Dataset / DataFrame API so
> they
> >>>> don't need to do so twice.  Requiring that they absorb all the other
> ways
> >>>> that Spark breaks compatibility in the move to 2.0 makes it much more
> >>>> difficult for them to make this transition.
> >>>>
> >>>> Using the same set of APIs also means that it will be easier to
> backport
> >>>> critical fixes to the 1.x line.
> >>>>
> >>>> It's not clear to me that avoiding breakage of an experimental API in
> the
> >>>> 1.x line outweighs these issues.
> >>>>
> >>>> -Sandy
> >>>>
> >>>> On Mon, Nov 23, 2015 at 10:51 PM, Reynold Xin 
> >>>> wrote:
> >>>>>
> >>>>> I actually think the next one (after 1.6) should be Spark 2.0. The
> >>>>> reason is that I already know we have to break some part of the
> >>>>> DataFrame/Dataset API as part of the Dataset design. (e.g.
> DataFrame.map
> >>>>> should return Dataset rather than RDD). In that case, I'd rather
> break this
> >>>>> sooner (in one release) than later (in two releases). so the damage
> is
> >>>>> smaller.
> >>>>>
> >>>>> I don't think whether we call Dataset/DataFrame experimental or not
> >>>>> matters too much for 2.0. We can still call Dataset experimental in
> 2.0 and
> >>>>> then mark them as stable in 2.1. Despite being "experimental", there
> has
> >>>>> been no breaking changes to DataFrame from 1.3 to 1.6.
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Wed, Nov 18, 2015 at 3:43 PM, Mark Hamstra <
> m...@clearstorydata.com>
> >>>>> wrote:
> >>>>>>
> >>>>>> Ah, got it; by "stabilize" you meant changing the API, not just bug
> >>>>>> fixing.  We're on the same page now.
> >>>>>>
> >>>>>> On Wed, Nov 18, 2015 at 3:39 PM, Kostas Sakellis <
> kos...@cloudera.com>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>> A 1.6.x release will only fix bugs - we typically don't change
> APIs in
> >>>>>>> z releases. The Dataset API is experimental and so we might be
> changing the
> >>>>>>> APIs before we declare it stable. This is why I think it is
> important to
> >>>>>>> first stabilize the Dataset API with a Spark 1.7 release before
> moving to
> >>>>>>> Spark 2.0. This will benefit users that would like to use the new
> Dataset
> >>>>>>> APIs but can't move to Spark 2.0 because of the backwards
> incompatible
> >>>>>>> changes, like removal of deprecated APIs, Scala 2.11 etc.
> >>>>>>>
> >>>>>>> Kostas
> >>>>>>>
> >>>>>>>
> >>>>>>> On Fri, Nov 13, 2015 at 12:26 PM, Mark Hamstra
> >>>>>>>  wrote:
> >>>>>>>>
> >>>>>>>> Why does stabilization of those two features require a 1.7 release
> >>>>>>>> instead of 1.6.1?
> >>>>>>>>
> >>>>>>>> On Fri, Nov 13, 2015 at 11:40 AM, Kostas Sakellis
> >>>>>>>>  wrote:
> >>>>>>>>>
> >>>>>>>>> We have veered off the topic of Spark 2.0 a little bit here -
> yes we
> >>>>>>>>> can talk about RDD vs. DS/DF more but lets refocus on Spark 2.0.
> I'd like to
> >>>>>>>>> propose we have one more 1.x release after Spark 1.6. This will
> allow us to
> >>>>>>>>> stabilize a few of the new features that were added in 1.6:
> >>>>>>>>>
> >>>>>>>>> 1) the experimental Datasets API
> >>>>>>>>> 2)

Re: A proposal for Spark 2.0

2015-12-03 Thread Sean Owen

Pardon for tacking on one more message to this thread, but I'm
reminded of one more issue when building the RC today: Scala 2.10 does
not in general try to work with Java 8, and indeed I can never fully
compile it with Java 8 on Ubuntu or OS X, due to scalac assertion
errors. 2.11 is the first that's supposed to work with Java 8. This
may be a good reason to drop 2.10 by the time this comes up.

On Thu, Nov 26, 2015 at 8:59 PM, Koert Kuipers  wrote:
> I also thought the idea was to drop 2.10. Do we want to cross build for 3
> scala versions?
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: A proposal for Spark 2.0

2015-12-03 Thread Sean Owen

.
>>>>
>>>> It's not clear to me that avoiding breakage of an experimental API in the
>>>> 1.x line outweighs these issues.
>>>>
>>>> -Sandy
>>>>
>>>> On Mon, Nov 23, 2015 at 10:51 PM, Reynold Xin 
>>>> wrote:
>>>>>
>>>>> I actually think the next one (after 1.6) should be Spark 2.0. The
>>>>> reason is that I already know we have to break some part of the
>>>>> DataFrame/Dataset API as part of the Dataset design. (e.g. DataFrame.map
>>>>> should return Dataset rather than RDD). In that case, I'd rather break 
>>>>> this
>>>>> sooner (in one release) than later (in two releases). so the damage is
>>>>> smaller.
>>>>>
>>>>> I don't think whether we call Dataset/DataFrame experimental or not
>>>>> matters too much for 2.0. We can still call Dataset experimental in 2.0 
>>>>> and
>>>>> then mark them as stable in 2.1. Despite being "experimental", there has
>>>>> been no breaking changes to DataFrame from 1.3 to 1.6.
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Nov 18, 2015 at 3:43 PM, Mark Hamstra 
>>>>> wrote:
>>>>>>
>>>>>> Ah, got it; by "stabilize" you meant changing the API, not just bug
>>>>>> fixing.  We're on the same page now.
>>>>>>
>>>>>> On Wed, Nov 18, 2015 at 3:39 PM, Kostas Sakellis 
>>>>>> wrote:
>>>>>>>
>>>>>>> A 1.6.x release will only fix bugs - we typically don't change APIs in
>>>>>>> z releases. The Dataset API is experimental and so we might be changing 
>>>>>>> the
>>>>>>> APIs before we declare it stable. This is why I think it is important to
>>>>>>> first stabilize the Dataset API with a Spark 1.7 release before moving 
>>>>>>> to
>>>>>>> Spark 2.0. This will benefit users that would like to use the new 
>>>>>>> Dataset
>>>>>>> APIs but can't move to Spark 2.0 because of the backwards incompatible
>>>>>>> changes, like removal of deprecated APIs, Scala 2.11 etc.
>>>>>>>
>>>>>>> Kostas
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Nov 13, 2015 at 12:26 PM, Mark Hamstra
>>>>>>>  wrote:
>>>>>>>>
>>>>>>>> Why does stabilization of those two features require a 1.7 release
>>>>>>>> instead of 1.6.1?
>>>>>>>>
>>>>>>>> On Fri, Nov 13, 2015 at 11:40 AM, Kostas Sakellis
>>>>>>>>  wrote:
>>>>>>>>>
>>>>>>>>> We have veered off the topic of Spark 2.0 a little bit here - yes we
>>>>>>>>> can talk about RDD vs. DS/DF more but lets refocus on Spark 2.0. I'd 
>>>>>>>>> like to
>>>>>>>>> propose we have one more 1.x release after Spark 1.6. This will allow 
>>>>>>>>> us to
>>>>>>>>> stabilize a few of the new features that were added in 1.6:
>>>>>>>>>
>>>>>>>>> 1) the experimental Datasets API
>>>>>>>>> 2) the new unified memory manager.
>>>>>>>>>
>>>>>>>>> I understand our goal for Spark 2.0 is to offer an easy transition
>>>>>>>>> but there will be users that won't be able to seamlessly upgrade 
>>>>>>>>> given what
>>>>>>>>> we have discussed as in scope for 2.0. For these users, having a 1.x 
>>>>>>>>> release
>>>>>>>>> with these new features/APIs stabilized will be very beneficial. This 
>>>>>>>>> might
>>>>>>>>> make Spark 1.7 a lighter release but that is not necessarily a bad 
>>>>>>>>> thing.
>>>>>>>>>
>>>>>>>>> Any thoughts on this timeline?
>>>>>>>>>
>>>>>>>>> Kostas Sakellis
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Nov 12, 2015 at

Re: A proposal for Spark 2.0

2015-11-26 Thread Reynold Xin

, Nov 13, 2015 at 12:26 PM, Mark Hamstra <
>>>>>> m...@clearstorydata.com> wrote:
>>>>>>
>>>>>>> Why does stabilization of those two features require a 1.7 release
>>>>>>> instead of 1.6.1?
>>>>>>>
>>>>>>> On Fri, Nov 13, 2015 at 11:40 AM, Kostas Sakellis <
>>>>>>> kos...@cloudera.com> wrote:
>>>>>>>
>>>>>>>> We have veered off the topic of Spark 2.0 a little bit here - yes
>>>>>>>> we can talk about RDD vs. DS/DF more but lets refocus on Spark 2.0. I'd
>>>>>>>> like to propose we have one more 1.x release after Spark 1.6. This will
>>>>>>>> allow us to stabilize a few of the new features that were added in 1.6:
>>>>>>>>
>>>>>>>> 1) the experimental Datasets API
>>>>>>>> 2) the new unified memory manager.
>>>>>>>>
>>>>>>>> I understand our goal for Spark 2.0 is to offer an easy transition
>>>>>>>> but there will be users that won't be able to seamlessly upgrade given 
>>>>>>>> what
>>>>>>>> we have discussed as in scope for 2.0. For these users, having a 1.x
>>>>>>>> release with these new features/APIs stabilized will be very 
>>>>>>>> beneficial.
>>>>>>>> This might make Spark 1.7 a lighter release but that is not 
>>>>>>>> necessarily a
>>>>>>>> bad thing.
>>>>>>>>
>>>>>>>> Any thoughts on this timeline?
>>>>>>>>
>>>>>>>> Kostas Sakellis
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Nov 12, 2015 at 8:39 PM, Cheng, Hao 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Agree, more features/apis/optimization need to be added in DF/DS.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I mean, we need to think about what kind of RDD APIs we have to
>>>>>>>>> provide to developer, maybe the fundamental API is enough, like, the
>>>>>>>>> ShuffledRDD etc..  But PairRDDFunctions probably not in this 
>>>>>>>>> category, as
>>>>>>>>> we can do the same thing easily with DF/DS, even better performance.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *From:* Mark Hamstra [mailto:m...@clearstorydata.com]
>>>>>>>>> *Sent:* Friday, November 13, 2015 11:23 AM
>>>>>>>>> *To:* Stephen Boesch
>>>>>>>>>
>>>>>>>>> *Cc:* dev@spark.apache.org
>>>>>>>>> *Subject:* Re: A proposal for Spark 2.0
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hmmm... to me, that seems like precisely the kind of thing that
>>>>>>>>> argues for retaining the RDD API but not as the first thing presented 
>>>>>>>>> to
>>>>>>>>> new Spark developers: "Here's how to use groupBy with DataFrames 
>>>>>>>>> Until
>>>>>>>>> the optimizer is more fully developed, that won't always get you the 
>>>>>>>>> best
>>>>>>>>> performance that can be obtained.  In these particular circumstances, 
>>>>>>>>> ...,
>>>>>>>>> you may want to use the low-level RDD API while setting
>>>>>>>>> preservesPartitioning to true.  Like this"
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Nov 12, 2015 at 7:05 PM, Stephen Boesch 
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> My understanding is that  the RDD's presently have more support
>>>>>>>>> for complete control of partitioning which is a key consideration at
>>>>>>>>> scale.  While partitioning control is still piecemeal in  DF/DS  it 
>

Re: A proposal for Spark 2.0

2015-11-26 Thread Koert Kuipers

>>>>> can talk about RDD vs. DS/DF more but lets refocus on Spark 2.0. I'd 
>>>>>>> like
>>>>>>> to propose we have one more 1.x release after Spark 1.6. This will 
>>>>>>> allow us
>>>>>>> to stabilize a few of the new features that were added in 1.6:
>>>>>>>
>>>>>>> 1) the experimental Datasets API
>>>>>>> 2) the new unified memory manager.
>>>>>>>
>>>>>>> I understand our goal for Spark 2.0 is to offer an easy transition
>>>>>>> but there will be users that won't be able to seamlessly upgrade given 
>>>>>>> what
>>>>>>> we have discussed as in scope for 2.0. For these users, having a 1.x
>>>>>>> release with these new features/APIs stabilized will be very beneficial.
>>>>>>> This might make Spark 1.7 a lighter release but that is not necessarily 
>>>>>>> a
>>>>>>> bad thing.
>>>>>>>
>>>>>>> Any thoughts on this timeline?
>>>>>>>
>>>>>>> Kostas Sakellis
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Nov 12, 2015 at 8:39 PM, Cheng, Hao 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Agree, more features/apis/optimization need to be added in DF/DS.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I mean, we need to think about what kind of RDD APIs we have to
>>>>>>>> provide to developer, maybe the fundamental API is enough, like, the
>>>>>>>> ShuffledRDD etc..  But PairRDDFunctions probably not in this category, 
>>>>>>>> as
>>>>>>>> we can do the same thing easily with DF/DS, even better performance.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *From:* Mark Hamstra [mailto:m...@clearstorydata.com]
>>>>>>>> *Sent:* Friday, November 13, 2015 11:23 AM
>>>>>>>> *To:* Stephen Boesch
>>>>>>>>
>>>>>>>> *Cc:* dev@spark.apache.org
>>>>>>>> *Subject:* Re: A proposal for Spark 2.0
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Hmmm... to me, that seems like precisely the kind of thing that
>>>>>>>> argues for retaining the RDD API but not as the first thing presented 
>>>>>>>> to
>>>>>>>> new Spark developers: "Here's how to use groupBy with DataFrames 
>>>>>>>> Until
>>>>>>>> the optimizer is more fully developed, that won't always get you the 
>>>>>>>> best
>>>>>>>> performance that can be obtained.  In these particular circumstances, 
>>>>>>>> ...,
>>>>>>>> you may want to use the low-level RDD API while setting
>>>>>>>> preservesPartitioning to true.  Like this"
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Nov 12, 2015 at 7:05 PM, Stephen Boesch 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> My understanding is that  the RDD's presently have more support for
>>>>>>>> complete control of partitioning which is a key consideration at scale.
>>>>>>>> While partitioning control is still piecemeal in  DF/DS  it would seem
>>>>>>>> premature to make RDD's a second-tier approach to spark dev.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> An example is the use of groupBy when we know that the source
>>>>>>>> relation (/RDD) is already partitioned on the grouping expressions.  
>>>>>>>> AFAIK
>>>>>>>> the spark sql still does not allow that knowledge to be applied to the
>>>>>>>> optimizer - so a full shuffle will be performed. However in the native 
>>>>>>>> RDD
>>>>>>>> we can use preservesPartitioning=true.
>>>>>>>>
>>>>>>>>

Re: A proposal for Spark 2.0

2015-11-26 Thread Steve Loughran


> On 25 Nov 2015, at 08:54, Sandy Ryza  wrote:
> 
> I see.  My concern is / was that cluster operators will be reluctant to 
> upgrade to 2.0, meaning that developers using those clusters need to stay on 
> 1.x, and, if they want to move to DataFrames, essentially need to port their 
> app twice.
> 
> I misunderstood and thought part of the proposal was to drop support for 2.10 
> though.  If your broad point is that there aren't changes in 2.0 that will 
> make it less palatable to cluster administrators than releases in the 1.x 
> line, then yes, 2.0 as the next release sounds fine to me.
> 
> -Sandy
> 

mixing spark versions in a JAR cluster with compatible hadoop native libs isn't 
so hard: users just deploy them up separately. 

But: 

-mixing Scala version is going to be tricky unless the jobs people submit are 
configured with the different paths
-the history server will need to be of the most latest spark version being 
executed in the cluster

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: A proposal for Spark 2.0

2015-11-26 Thread Sean Owen

>>>> matters too much for 2.0. We can still call Dataset experimental in 2.0 and
>>>> then mark them as stable in 2.1. Despite being "experimental", there has
>>>> been no breaking changes to DataFrame from 1.3 to 1.6.
>>>>
>>>>
>>>>
>>>> On Wed, Nov 18, 2015 at 3:43 PM, Mark Hamstra 
>>>> wrote:
>>>>>
>>>>> Ah, got it; by "stabilize" you meant changing the API, not just bug
>>>>> fixing.  We're on the same page now.
>>>>>
>>>>> On Wed, Nov 18, 2015 at 3:39 PM, Kostas Sakellis 
>>>>> wrote:
>>>>>>
>>>>>> A 1.6.x release will only fix bugs - we typically don't change APIs in
>>>>>> z releases. The Dataset API is experimental and so we might be changing 
>>>>>> the
>>>>>> APIs before we declare it stable. This is why I think it is important to
>>>>>> first stabilize the Dataset API with a Spark 1.7 release before moving to
>>>>>> Spark 2.0. This will benefit users that would like to use the new Dataset
>>>>>> APIs but can't move to Spark 2.0 because of the backwards incompatible
>>>>>> changes, like removal of deprecated APIs, Scala 2.11 etc.
>>>>>>
>>>>>> Kostas
>>>>>>
>>>>>>
>>>>>> On Fri, Nov 13, 2015 at 12:26 PM, Mark Hamstra
>>>>>>  wrote:
>>>>>>>
>>>>>>> Why does stabilization of those two features require a 1.7 release
>>>>>>> instead of 1.6.1?
>>>>>>>
>>>>>>> On Fri, Nov 13, 2015 at 11:40 AM, Kostas Sakellis
>>>>>>>  wrote:
>>>>>>>>
>>>>>>>> We have veered off the topic of Spark 2.0 a little bit here - yes we
>>>>>>>> can talk about RDD vs. DS/DF more but lets refocus on Spark 2.0. I'd 
>>>>>>>> like to
>>>>>>>> propose we have one more 1.x release after Spark 1.6. This will allow 
>>>>>>>> us to
>>>>>>>> stabilize a few of the new features that were added in 1.6:
>>>>>>>>
>>>>>>>> 1) the experimental Datasets API
>>>>>>>> 2) the new unified memory manager.
>>>>>>>>
>>>>>>>> I understand our goal for Spark 2.0 is to offer an easy transition
>>>>>>>> but there will be users that won't be able to seamlessly upgrade given 
>>>>>>>> what
>>>>>>>> we have discussed as in scope for 2.0. For these users, having a 1.x 
>>>>>>>> release
>>>>>>>> with these new features/APIs stabilized will be very beneficial. This 
>>>>>>>> might
>>>>>>>> make Spark 1.7 a lighter release but that is not necessarily a bad 
>>>>>>>> thing.
>>>>>>>>
>>>>>>>> Any thoughts on this timeline?
>>>>>>>>
>>>>>>>> Kostas Sakellis
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Nov 12, 2015 at 8:39 PM, Cheng, Hao 
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Agree, more features/apis/optimization need to be added in DF/DS.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I mean, we need to think about what kind of RDD APIs we have to
>>>>>>>>> provide to developer, maybe the fundamental API is enough, like, the
>>>>>>>>> ShuffledRDD etc..  But PairRDDFunctions probably not in this 
>>>>>>>>> category, as we
>>>>>>>>> can do the same thing easily with DF/DS, even better performance.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> From: Mark Hamstra [mailto:m...@clearstorydata.com]
>>>>>>>>> Sent: Friday, November 13, 2015 11:23 AM
>>>>>>>>> To: Stephen Boesch
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Cc: dev@spark.apache.org
>>>>>>>>> Subject: Re: A pro

Re: A proposal for Spark 2.0

2015-11-25 Thread Reynold Xin

dera.com> wrote:
>>>>>>
>>>>>>> We have veered off the topic of Spark 2.0 a little bit here - yes we
>>>>>>> can talk about RDD vs. DS/DF more but lets refocus on Spark 2.0. I'd 
>>>>>>> like
>>>>>>> to propose we have one more 1.x release after Spark 1.6. This will 
>>>>>>> allow us
>>>>>>> to stabilize a few of the new features that were added in 1.6:
>>>>>>>
>>>>>>> 1) the experimental Datasets API
>>>>>>> 2) the new unified memory manager.
>>>>>>>
>>>>>>> I understand our goal for Spark 2.0 is to offer an easy transition
>>>>>>> but there will be users that won't be able to seamlessly upgrade given 
>>>>>>> what
>>>>>>> we have discussed as in scope for 2.0. For these users, having a 1.x
>>>>>>> release with these new features/APIs stabilized will be very beneficial.
>>>>>>> This might make Spark 1.7 a lighter release but that is not necessarily 
>>>>>>> a
>>>>>>> bad thing.
>>>>>>>
>>>>>>> Any thoughts on this timeline?
>>>>>>>
>>>>>>> Kostas Sakellis
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Nov 12, 2015 at 8:39 PM, Cheng, Hao 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Agree, more features/apis/optimization need to be added in DF/DS.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I mean, we need to think about what kind of RDD APIs we have to
>>>>>>>> provide to developer, maybe the fundamental API is enough, like, the
>>>>>>>> ShuffledRDD etc..  But PairRDDFunctions probably not in this category, 
>>>>>>>> as
>>>>>>>> we can do the same thing easily with DF/DS, even better performance.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *From:* Mark Hamstra [mailto:m...@clearstorydata.com]
>>>>>>>> *Sent:* Friday, November 13, 2015 11:23 AM
>>>>>>>> *To:* Stephen Boesch
>>>>>>>>
>>>>>>>> *Cc:* dev@spark.apache.org
>>>>>>>> *Subject:* Re: A proposal for Spark 2.0
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Hmmm... to me, that seems like precisely the kind of thing that
>>>>>>>> argues for retaining the RDD API but not as the first thing presented 
>>>>>>>> to
>>>>>>>> new Spark developers: "Here's how to use groupBy with DataFrames 
>>>>>>>> Until
>>>>>>>> the optimizer is more fully developed, that won't always get you the 
>>>>>>>> best
>>>>>>>> performance that can be obtained.  In these particular circumstances, 
>>>>>>>> ...,
>>>>>>>> you may want to use the low-level RDD API while setting
>>>>>>>> preservesPartitioning to true.  Like this"
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Nov 12, 2015 at 7:05 PM, Stephen Boesch 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> My understanding is that  the RDD's presently have more support for
>>>>>>>> complete control of partitioning which is a key consideration at scale.
>>>>>>>> While partitioning control is still piecemeal in  DF/DS  it would seem
>>>>>>>> premature to make RDD's a second-tier approach to spark dev.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> An example is the use of groupBy when we know that the source
>>>>>>>> relation (/RDD) is already partitioned on the grouping expressions.  
>>>>>>>> AFAIK
>>>>>>>> the spark sql still does not allow that knowledge to be applied to the
>>>>>>>> optimizer - so a full shuffle will be performed. However in the native 
>>>>>>&

Re: A proposal for Spark 2.0

2015-11-25 Thread Sandy Ryza

at won't be able to seamlessly upgrade given 
>>>>>> what
>>>>>> we have discussed as in scope for 2.0. For these users, having a 1.x
>>>>>> release with these new features/APIs stabilized will be very beneficial.
>>>>>> This might make Spark 1.7 a lighter release but that is not necessarily a
>>>>>> bad thing.
>>>>>>
>>>>>> Any thoughts on this timeline?
>>>>>>
>>>>>> Kostas Sakellis
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Nov 12, 2015 at 8:39 PM, Cheng, Hao 
>>>>>> wrote:
>>>>>>
>>>>>>> Agree, more features/apis/optimization need to be added in DF/DS.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I mean, we need to think about what kind of RDD APIs we have to
>>>>>>> provide to developer, maybe the fundamental API is enough, like, the
>>>>>>> ShuffledRDD etc..  But PairRDDFunctions probably not in this category, 
>>>>>>> as
>>>>>>> we can do the same thing easily with DF/DS, even better performance.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *From:* Mark Hamstra [mailto:m...@clearstorydata.com]
>>>>>>> *Sent:* Friday, November 13, 2015 11:23 AM
>>>>>>> *To:* Stephen Boesch
>>>>>>>
>>>>>>> *Cc:* dev@spark.apache.org
>>>>>>> *Subject:* Re: A proposal for Spark 2.0
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Hmmm... to me, that seems like precisely the kind of thing that
>>>>>>> argues for retaining the RDD API but not as the first thing presented to
>>>>>>> new Spark developers: "Here's how to use groupBy with DataFrames 
>>>>>>> Until
>>>>>>> the optimizer is more fully developed, that won't always get you the 
>>>>>>> best
>>>>>>> performance that can be obtained.  In these particular circumstances, 
>>>>>>> ...,
>>>>>>> you may want to use the low-level RDD API while setting
>>>>>>> preservesPartitioning to true.  Like this"
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Nov 12, 2015 at 7:05 PM, Stephen Boesch 
>>>>>>> wrote:
>>>>>>>
>>>>>>> My understanding is that  the RDD's presently have more support for
>>>>>>> complete control of partitioning which is a key consideration at scale.
>>>>>>> While partitioning control is still piecemeal in  DF/DS  it would seem
>>>>>>> premature to make RDD's a second-tier approach to spark dev.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> An example is the use of groupBy when we know that the source
>>>>>>> relation (/RDD) is already partitioned on the grouping expressions.  
>>>>>>> AFAIK
>>>>>>> the spark sql still does not allow that knowledge to be applied to the
>>>>>>> optimizer - so a full shuffle will be performed. However in the native 
>>>>>>> RDD
>>>>>>> we can use preservesPartitioning=true.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 2015-11-12 17:42 GMT-08:00 Mark Hamstra :
>>>>>>>
>>>>>>> The place of the RDD API in 2.0 is also something I've been
>>>>>>> wondering about.  I think it may be going too far to deprecate it, but
>>>>>>> changing emphasis is something that we might consider.  The RDD API came
>>>>>>> well before DataFrames and DataSets, so programming guides, introductory
>>>>>>> how-to articles and the like have, to this point, also tended to 
>>>>>>> emphasize
>>>>>>> RDDs -- or at least to deal with them early.  What I'm thinking is that
>>>>>>> with 2.0 maybe we should overhaul all the documentation to de-emphasize 
>>>>>>> and
>>>>>>> reposition RDDs.  In this scheme, DataFrames and DataSets would be
>>>>>>> intro

Re: A proposal for Spark 2.0

2015-11-24 Thread Matei Zaharia

What are the other breaking changes in 2.0 though? Note that we're not removing 
Scala 2.10, we're just making the default build be against Scala 2.11 instead 
of 2.10. There seem to be very few changes that people would worry about. If 
people are going to update their apps, I think it's better to make the other 
small changes in 2.0 at the same time than to update once for Dataset and 
another time for 2.0.

BTW just refer to Reynold's original post for the other proposed API changes.

Matei

> On Nov 24, 2015, at 12:27 PM, Sandy Ryza  wrote:
> 
> I think that Kostas' logic still holds.  The majority of Spark users, and 
> likely an even vaster majority of people running vaster jobs, are still on 
> RDDs and on the cusp of upgrading to DataFrames.  Users will probably want to 
> upgrade to the stable version of the Dataset / DataFrame API so they don't 
> need to do so twice.  Requiring that they absorb all the other ways that 
> Spark breaks compatibility in the move to 2.0 makes it much more difficult 
> for them to make this transition.
> 
> Using the same set of APIs also means that it will be easier to backport 
> critical fixes to the 1.x line.
> 
> It's not clear to me that avoiding breakage of an experimental API in the 1.x 
> line outweighs these issues.
> 
> -Sandy 
> 
> On Mon, Nov 23, 2015 at 10:51 PM, Reynold Xin  <mailto:r...@databricks.com>> wrote:
> I actually think the next one (after 1.6) should be Spark 2.0. The reason is 
> that I already know we have to break some part of the DataFrame/Dataset API 
> as part of the Dataset design. (e.g. DataFrame.map should return Dataset 
> rather than RDD). In that case, I'd rather break this sooner (in one release) 
> than later (in two releases). so the damage is smaller.
> 
> I don't think whether we call Dataset/DataFrame experimental or not matters 
> too much for 2.0. We can still call Dataset experimental in 2.0 and then mark 
> them as stable in 2.1. Despite being "experimental", there has been no 
> breaking changes to DataFrame from 1.3 to 1.6.
> 
> 
> 
> On Wed, Nov 18, 2015 at 3:43 PM, Mark Hamstra  <mailto:m...@clearstorydata.com>> wrote:
> Ah, got it; by "stabilize" you meant changing the API, not just bug fixing.  
> We're on the same page now.
> 
> On Wed, Nov 18, 2015 at 3:39 PM, Kostas Sakellis  <mailto:kos...@cloudera.com>> wrote:
> A 1.6.x release will only fix bugs - we typically don't change APIs in z 
> releases. The Dataset API is experimental and so we might be changing the 
> APIs before we declare it stable. This is why I think it is important to 
> first stabilize the Dataset API with a Spark 1.7 release before moving to 
> Spark 2.0. This will benefit users that would like to use the new Dataset 
> APIs but can't move to Spark 2.0 because of the backwards incompatible 
> changes, like removal of deprecated APIs, Scala 2.11 etc.
> 
> Kostas
> 
> 
> On Fri, Nov 13, 2015 at 12:26 PM, Mark Hamstra  <mailto:m...@clearstorydata.com>> wrote:
> Why does stabilization of those two features require a 1.7 release instead of 
> 1.6.1?
> 
> On Fri, Nov 13, 2015 at 11:40 AM, Kostas Sakellis  <mailto:kos...@cloudera.com>> wrote:
> We have veered off the topic of Spark 2.0 a little bit here - yes we can talk 
> about RDD vs. DS/DF more but lets refocus on Spark 2.0. I'd like to propose 
> we have one more 1.x release after Spark 1.6. This will allow us to stabilize 
> a few of the new features that were added in 1.6:
> 
> 1) the experimental Datasets API
> 2) the new unified memory manager.
> 
> I understand our goal for Spark 2.0 is to offer an easy transition but there 
> will be users that won't be able to seamlessly upgrade given what we have 
> discussed as in scope for 2.0. For these users, having a 1.x release with 
> these new features/APIs stabilized will be very beneficial. This might make 
> Spark 1.7 a lighter release but that is not necessarily a bad thing.
> 
> Any thoughts on this timeline?
> 
> Kostas Sakellis
> 
> 
> 
> On Thu, Nov 12, 2015 at 8:39 PM, Cheng, Hao  <mailto:hao.ch...@intel.com>> wrote:
> Agree, more features/apis/optimization need to be added in DF/DS.
> 
>  
> 
> I mean, we need to think about what kind of RDD APIs we have to provide to 
> developer, maybe the fundamental API is enough, like, the ShuffledRDD etc..  
> But PairRDDFunctions probably not in this category, as we can do the same 
> thing easily with DF/DS, even better performance.
> 
>   <>
> From: Mark Hamstra [mailto:m...@clearstorydata.com 
> <mailto:m...@clearstorydata.com>] 
> Sent: Friday, November 13, 2015 11:23 AM
> To:

Re: A proposal for Spark 2.0

2015-11-24 Thread Sandy Ryza

I think that Kostas' logic still holds.  The majority of Spark users, and
likely an even vaster majority of people running vaster jobs, are still on
RDDs and on the cusp of upgrading to DataFrames.  Users will probably want
to upgrade to the stable version of the Dataset / DataFrame API so they
don't need to do so twice.  Requiring that they absorb all the other ways
that Spark breaks compatibility in the move to 2.0 makes it much more
difficult for them to make this transition.

Using the same set of APIs also means that it will be easier to backport
critical fixes to the 1.x line.

It's not clear to me that avoiding breakage of an experimental API in the
1.x line outweighs these issues.

-Sandy

On Mon, Nov 23, 2015 at 10:51 PM, Reynold Xin  wrote:

> I actually think the next one (after 1.6) should be Spark 2.0. The reason
> is that I already know we have to break some part of the DataFrame/Dataset
> API as part of the Dataset design. (e.g. DataFrame.map should return
> Dataset rather than RDD). In that case, I'd rather break this sooner (in
> one release) than later (in two releases). so the damage is smaller.
>
> I don't think whether we call Dataset/DataFrame experimental or not
> matters too much for 2.0. We can still call Dataset experimental in 2.0 and
> then mark them as stable in 2.1. Despite being "experimental", there has
> been no breaking changes to DataFrame from 1.3 to 1.6.
>
>
>
> On Wed, Nov 18, 2015 at 3:43 PM, Mark Hamstra 
> wrote:
>
>> Ah, got it; by "stabilize" you meant changing the API, not just bug
>> fixing.  We're on the same page now.
>>
>> On Wed, Nov 18, 2015 at 3:39 PM, Kostas Sakellis 
>> wrote:
>>
>>> A 1.6.x release will only fix bugs - we typically don't change APIs in z
>>> releases. The Dataset API is experimental and so we might be changing the
>>> APIs before we declare it stable. This is why I think it is important to
>>> first stabilize the Dataset API with a Spark 1.7 release before moving to
>>> Spark 2.0. This will benefit users that would like to use the new Dataset
>>> APIs but can't move to Spark 2.0 because of the backwards incompatible
>>> changes, like removal of deprecated APIs, Scala 2.11 etc.
>>>
>>> Kostas
>>>
>>>
>>> On Fri, Nov 13, 2015 at 12:26 PM, Mark Hamstra 
>>> wrote:
>>>
>>>> Why does stabilization of those two features require a 1.7 release
>>>> instead of 1.6.1?
>>>>
>>>> On Fri, Nov 13, 2015 at 11:40 AM, Kostas Sakellis 
>>>> wrote:
>>>>
>>>>> We have veered off the topic of Spark 2.0 a little bit here - yes we
>>>>> can talk about RDD vs. DS/DF more but lets refocus on Spark 2.0. I'd like
>>>>> to propose we have one more 1.x release after Spark 1.6. This will allow 
>>>>> us
>>>>> to stabilize a few of the new features that were added in 1.6:
>>>>>
>>>>> 1) the experimental Datasets API
>>>>> 2) the new unified memory manager.
>>>>>
>>>>> I understand our goal for Spark 2.0 is to offer an easy transition but
>>>>> there will be users that won't be able to seamlessly upgrade given what we
>>>>> have discussed as in scope for 2.0. For these users, having a 1.x release
>>>>> with these new features/APIs stabilized will be very beneficial. This 
>>>>> might
>>>>> make Spark 1.7 a lighter release but that is not necessarily a bad thing.
>>>>>
>>>>> Any thoughts on this timeline?
>>>>>
>>>>> Kostas Sakellis
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Nov 12, 2015 at 8:39 PM, Cheng, Hao 
>>>>> wrote:
>>>>>
>>>>>> Agree, more features/apis/optimization need to be added in DF/DS.
>>>>>>
>>>>>>
>>>>>>
>>>>>> I mean, we need to think about what kind of RDD APIs we have to
>>>>>> provide to developer, maybe the fundamental API is enough, like, the
>>>>>> ShuffledRDD etc..  But PairRDDFunctions probably not in this category, as
>>>>>> we can do the same thing easily with DF/DS, even better performance.
>>>>>>
>>>>>>
>>>>>>
>>>>>> *From:* Mark Hamstra [mailto:m...@clearstorydata.com]
>>>>>> *Sent:* Friday, November 13, 2015 11:23 AM
>>>>>> *To:* Stephen Boesch
>>>>>>
>>>>&

Re: A proposal for Spark 2.0

2015-11-23 Thread Reynold Xin

I actually think the next one (after 1.6) should be Spark 2.0. The reason
is that I already know we have to break some part of the DataFrame/Dataset
API as part of the Dataset design. (e.g. DataFrame.map should return
Dataset rather than RDD). In that case, I'd rather break this sooner (in
one release) than later (in two releases). so the damage is smaller.

I don't think whether we call Dataset/DataFrame experimental or not matters
too much for 2.0. We can still call Dataset experimental in 2.0 and then
mark them as stable in 2.1. Despite being "experimental", there has been no
breaking changes to DataFrame from 1.3 to 1.6.



On Wed, Nov 18, 2015 at 3:43 PM, Mark Hamstra 
wrote:

> Ah, got it; by "stabilize" you meant changing the API, not just bug
> fixing.  We're on the same page now.
>
> On Wed, Nov 18, 2015 at 3:39 PM, Kostas Sakellis 
> wrote:
>
>> A 1.6.x release will only fix bugs - we typically don't change APIs in z
>> releases. The Dataset API is experimental and so we might be changing the
>> APIs before we declare it stable. This is why I think it is important to
>> first stabilize the Dataset API with a Spark 1.7 release before moving to
>> Spark 2.0. This will benefit users that would like to use the new Dataset
>> APIs but can't move to Spark 2.0 because of the backwards incompatible
>> changes, like removal of deprecated APIs, Scala 2.11 etc.
>>
>> Kostas
>>
>>
>> On Fri, Nov 13, 2015 at 12:26 PM, Mark Hamstra 
>> wrote:
>>
>>> Why does stabilization of those two features require a 1.7 release
>>> instead of 1.6.1?
>>>
>>> On Fri, Nov 13, 2015 at 11:40 AM, Kostas Sakellis 
>>> wrote:
>>>
>>>> We have veered off the topic of Spark 2.0 a little bit here - yes we
>>>> can talk about RDD vs. DS/DF more but lets refocus on Spark 2.0. I'd like
>>>> to propose we have one more 1.x release after Spark 1.6. This will allow us
>>>> to stabilize a few of the new features that were added in 1.6:
>>>>
>>>> 1) the experimental Datasets API
>>>> 2) the new unified memory manager.
>>>>
>>>> I understand our goal for Spark 2.0 is to offer an easy transition but
>>>> there will be users that won't be able to seamlessly upgrade given what we
>>>> have discussed as in scope for 2.0. For these users, having a 1.x release
>>>> with these new features/APIs stabilized will be very beneficial. This might
>>>> make Spark 1.7 a lighter release but that is not necessarily a bad thing.
>>>>
>>>> Any thoughts on this timeline?
>>>>
>>>> Kostas Sakellis
>>>>
>>>>
>>>>
>>>> On Thu, Nov 12, 2015 at 8:39 PM, Cheng, Hao 
>>>> wrote:
>>>>
>>>>> Agree, more features/apis/optimization need to be added in DF/DS.
>>>>>
>>>>>
>>>>>
>>>>> I mean, we need to think about what kind of RDD APIs we have to
>>>>> provide to developer, maybe the fundamental API is enough, like, the
>>>>> ShuffledRDD etc..  But PairRDDFunctions probably not in this category, as
>>>>> we can do the same thing easily with DF/DS, even better performance.
>>>>>
>>>>>
>>>>>
>>>>> *From:* Mark Hamstra [mailto:m...@clearstorydata.com]
>>>>> *Sent:* Friday, November 13, 2015 11:23 AM
>>>>> *To:* Stephen Boesch
>>>>>
>>>>> *Cc:* dev@spark.apache.org
>>>>> *Subject:* Re: A proposal for Spark 2.0
>>>>>
>>>>>
>>>>>
>>>>> Hmmm... to me, that seems like precisely the kind of thing that argues
>>>>> for retaining the RDD API but not as the first thing presented to new 
>>>>> Spark
>>>>> developers: "Here's how to use groupBy with DataFrames Until the
>>>>> optimizer is more fully developed, that won't always get you the best
>>>>> performance that can be obtained.  In these particular circumstances, ...,
>>>>> you may want to use the low-level RDD API while setting
>>>>> preservesPartitioning to true.  Like this"
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Nov 12, 2015 at 7:05 PM, Stephen Boesch 
>>>>> wrote:
>>>>>
>>>>> My understanding is that  the RDD's presently have more support for
>>>>> complete control of partitioning which is a

Re: A proposal for Spark 2.0

2015-11-18 Thread Mark Hamstra

Ah, got it; by "stabilize" you meant changing the API, not just bug
fixing.  We're on the same page now.

On Wed, Nov 18, 2015 at 3:39 PM, Kostas Sakellis 
wrote:

> A 1.6.x release will only fix bugs - we typically don't change APIs in z
> releases. The Dataset API is experimental and so we might be changing the
> APIs before we declare it stable. This is why I think it is important to
> first stabilize the Dataset API with a Spark 1.7 release before moving to
> Spark 2.0. This will benefit users that would like to use the new Dataset
> APIs but can't move to Spark 2.0 because of the backwards incompatible
> changes, like removal of deprecated APIs, Scala 2.11 etc.
>
> Kostas
>
>
> On Fri, Nov 13, 2015 at 12:26 PM, Mark Hamstra 
> wrote:
>
>> Why does stabilization of those two features require a 1.7 release
>> instead of 1.6.1?
>>
>> On Fri, Nov 13, 2015 at 11:40 AM, Kostas Sakellis 
>> wrote:
>>
>>> We have veered off the topic of Spark 2.0 a little bit here - yes we can
>>> talk about RDD vs. DS/DF more but lets refocus on Spark 2.0. I'd like to
>>> propose we have one more 1.x release after Spark 1.6. This will allow us to
>>> stabilize a few of the new features that were added in 1.6:
>>>
>>> 1) the experimental Datasets API
>>> 2) the new unified memory manager.
>>>
>>> I understand our goal for Spark 2.0 is to offer an easy transition but
>>> there will be users that won't be able to seamlessly upgrade given what we
>>> have discussed as in scope for 2.0. For these users, having a 1.x release
>>> with these new features/APIs stabilized will be very beneficial. This might
>>> make Spark 1.7 a lighter release but that is not necessarily a bad thing.
>>>
>>> Any thoughts on this timeline?
>>>
>>> Kostas Sakellis
>>>
>>>
>>>
>>> On Thu, Nov 12, 2015 at 8:39 PM, Cheng, Hao  wrote:
>>>
>>>> Agree, more features/apis/optimization need to be added in DF/DS.
>>>>
>>>>
>>>>
>>>> I mean, we need to think about what kind of RDD APIs we have to provide
>>>> to developer, maybe the fundamental API is enough, like, the ShuffledRDD
>>>> etc..  But PairRDDFunctions probably not in this category, as we can do the
>>>> same thing easily with DF/DS, even better performance.
>>>>
>>>>
>>>>
>>>> *From:* Mark Hamstra [mailto:m...@clearstorydata.com]
>>>> *Sent:* Friday, November 13, 2015 11:23 AM
>>>> *To:* Stephen Boesch
>>>>
>>>> *Cc:* dev@spark.apache.org
>>>> *Subject:* Re: A proposal for Spark 2.0
>>>>
>>>>
>>>>
>>>> Hmmm... to me, that seems like precisely the kind of thing that argues
>>>> for retaining the RDD API but not as the first thing presented to new Spark
>>>> developers: "Here's how to use groupBy with DataFrames Until the
>>>> optimizer is more fully developed, that won't always get you the best
>>>> performance that can be obtained.  In these particular circumstances, ...,
>>>> you may want to use the low-level RDD API while setting
>>>> preservesPartitioning to true.  Like this"
>>>>
>>>>
>>>>
>>>> On Thu, Nov 12, 2015 at 7:05 PM, Stephen Boesch 
>>>> wrote:
>>>>
>>>> My understanding is that  the RDD's presently have more support for
>>>> complete control of partitioning which is a key consideration at scale.
>>>> While partitioning control is still piecemeal in  DF/DS  it would seem
>>>> premature to make RDD's a second-tier approach to spark dev.
>>>>
>>>>
>>>>
>>>> An example is the use of groupBy when we know that the source relation
>>>> (/RDD) is already partitioned on the grouping expressions.  AFAIK the spark
>>>> sql still does not allow that knowledge to be applied to the optimizer - so
>>>> a full shuffle will be performed. However in the native RDD we can use
>>>> preservesPartitioning=true.
>>>>
>>>>
>>>>
>>>> 2015-11-12 17:42 GMT-08:00 Mark Hamstra :
>>>>
>>>> The place of the RDD API in 2.0 is also something I've been wondering
>>>> about.  I think it may be going too far to deprecate it, but changing
>>>> emphasis is something that we might consider.  The RDD API came well before
>>>> DataFrames and D

Re: A proposal for Spark 2.0

2015-11-18 Thread Kostas Sakellis

A 1.6.x release will only fix bugs - we typically don't change APIs in z
releases. The Dataset API is experimental and so we might be changing the
APIs before we declare it stable. This is why I think it is important to
first stabilize the Dataset API with a Spark 1.7 release before moving to
Spark 2.0. This will benefit users that would like to use the new Dataset
APIs but can't move to Spark 2.0 because of the backwards incompatible
changes, like removal of deprecated APIs, Scala 2.11 etc.

Kostas


On Fri, Nov 13, 2015 at 12:26 PM, Mark Hamstra 
wrote:

> Why does stabilization of those two features require a 1.7 release instead
> of 1.6.1?
>
> On Fri, Nov 13, 2015 at 11:40 AM, Kostas Sakellis 
> wrote:
>
>> We have veered off the topic of Spark 2.0 a little bit here - yes we can
>> talk about RDD vs. DS/DF more but lets refocus on Spark 2.0. I'd like to
>> propose we have one more 1.x release after Spark 1.6. This will allow us to
>> stabilize a few of the new features that were added in 1.6:
>>
>> 1) the experimental Datasets API
>> 2) the new unified memory manager.
>>
>> I understand our goal for Spark 2.0 is to offer an easy transition but
>> there will be users that won't be able to seamlessly upgrade given what we
>> have discussed as in scope for 2.0. For these users, having a 1.x release
>> with these new features/APIs stabilized will be very beneficial. This might
>> make Spark 1.7 a lighter release but that is not necessarily a bad thing.
>>
>> Any thoughts on this timeline?
>>
>> Kostas Sakellis
>>
>>
>>
>> On Thu, Nov 12, 2015 at 8:39 PM, Cheng, Hao  wrote:
>>
>>> Agree, more features/apis/optimization need to be added in DF/DS.
>>>
>>>
>>>
>>> I mean, we need to think about what kind of RDD APIs we have to provide
>>> to developer, maybe the fundamental API is enough, like, the ShuffledRDD
>>> etc..  But PairRDDFunctions probably not in this category, as we can do the
>>> same thing easily with DF/DS, even better performance.
>>>
>>>
>>>
>>> *From:* Mark Hamstra [mailto:m...@clearstorydata.com]
>>> *Sent:* Friday, November 13, 2015 11:23 AM
>>> *To:* Stephen Boesch
>>>
>>> *Cc:* dev@spark.apache.org
>>> *Subject:* Re: A proposal for Spark 2.0
>>>
>>>
>>>
>>> Hmmm... to me, that seems like precisely the kind of thing that argues
>>> for retaining the RDD API but not as the first thing presented to new Spark
>>> developers: "Here's how to use groupBy with DataFrames Until the
>>> optimizer is more fully developed, that won't always get you the best
>>> performance that can be obtained.  In these particular circumstances, ...,
>>> you may want to use the low-level RDD API while setting
>>> preservesPartitioning to true.  Like this"
>>>
>>>
>>>
>>> On Thu, Nov 12, 2015 at 7:05 PM, Stephen Boesch 
>>> wrote:
>>>
>>> My understanding is that  the RDD's presently have more support for
>>> complete control of partitioning which is a key consideration at scale.
>>> While partitioning control is still piecemeal in  DF/DS  it would seem
>>> premature to make RDD's a second-tier approach to spark dev.
>>>
>>>
>>>
>>> An example is the use of groupBy when we know that the source relation
>>> (/RDD) is already partitioned on the grouping expressions.  AFAIK the spark
>>> sql still does not allow that knowledge to be applied to the optimizer - so
>>> a full shuffle will be performed. However in the native RDD we can use
>>> preservesPartitioning=true.
>>>
>>>
>>>
>>> 2015-11-12 17:42 GMT-08:00 Mark Hamstra :
>>>
>>> The place of the RDD API in 2.0 is also something I've been wondering
>>> about.  I think it may be going too far to deprecate it, but changing
>>> emphasis is something that we might consider.  The RDD API came well before
>>> DataFrames and DataSets, so programming guides, introductory how-to
>>> articles and the like have, to this point, also tended to emphasize RDDs --
>>> or at least to deal with them early.  What I'm thinking is that with 2.0
>>> maybe we should overhaul all the documentation to de-emphasize and
>>> reposition RDDs.  In this scheme, DataFrames and DataSets would be
>>> introduced and fully addressed before RDDs.  They would be presented as the
>>> normal/default/standard way to do things in Spark.  RDDs, in contrast,
&

Re: A proposal for Spark 2.0

2015-11-15 Thread Prashant Sharma

Hey Matei,


> Regarding Scala 2.12, we should definitely support it eventually, but I
> don't think we need to block 2.0 on that because it can be added later too.
> Has anyone investigated what it would take to run on there? I imagine we
> don't need many code changes, just maybe some REPL stuff.


Our REPL specific changes were merged in scala/scala and are available as
part of 2.11.7 and hopefully be part of 2.12 too. If I am not wrong, REPL
stuff is taken care of, we don;t need to keep upgrading REPL code for every
scala release now. http://www.scala-lang.org/news/2.11.7

I am +1 on the proposal for Spark 2.0.

Thanks,


Prashant Sharma



On Thu, Nov 12, 2015 at 3:02 AM, Matei Zaharia 
wrote:

> I like the idea of popping out Tachyon to an optional component too to
> reduce the number of dependencies. In the future, it might even be useful
> to do this for Hadoop, but it requires too many API changes to be worth
> doing now.
>
> Regarding Scala 2.12, we should definitely support it eventually, but I
> don't think we need to block 2.0 on that because it can be added later too.
> Has anyone investigated what it would take to run on there? I imagine we
> don't need many code changes, just maybe some REPL stuff.
>
> Needless to say, but I'm all for the idea of making "major" releases as
> undisruptive as possible in the model Reynold proposed. Keeping everyone
> working with the same set of releases is super important.
>
> Matei
>
> > On Nov 11, 2015, at 4:58 AM, Sean Owen  wrote:
> >
> > On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin 
> wrote:
> >> to the Spark community. A major release should not be very different
> from a
> >> minor release and should not be gated based on new features. The main
> >> purpose of a major release is an opportunity to fix things that are
> broken
> >> in the current API and remove certain deprecated APIs (examples follow).
> >
> > Agree with this stance. Generally, a major release might also be a
> > time to replace some big old API or implementation with a new one, but
> > I don't see obvious candidates.
> >
> > I wouldn't mind turning attention to 2.x sooner than later, unless
> > there's a fairly good reason to continue adding features in 1.x to a
> > 1.7 release. The scope as of 1.6 is already pretty darned big.
> >
> >
> >> 1. Scala 2.11 as the default build. We should still support Scala 2.10,
> but
> >> it has been end-of-life.
> >
> > By the time 2.x rolls around, 2.12 will be the main version, 2.11 will
> > be quite stable, and 2.10 will have been EOL for a while. I'd propose
> > dropping 2.10. Otherwise it's supported for 2 more years.
> >
> >
> >> 2. Remove Hadoop 1 support.
> >
> > I'd go further to drop support for <2.2 for sure (2.0 and 2.1 were
> > sort of 'alpha' and 'beta' releases) and even <2.6.
> >
> > I'm sure we'll think of a number of other small things -- shading a
> > bunch of stuff? reviewing and updating dependencies in light of
> > simpler, more recent dependencies to support from Hadoop etc?
> >
> > Farming out Tachyon to a module? (I felt like someone proposed this?)
> > Pop out any Docker stuff to another repo?
> > Continue that same effort for EC2?
> > Farming out some of the "external" integrations to another repo (?
> > controversial)
> >
> > See also anything marked version "2+" in JIRA.
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > For additional commands, e-mail: dev-h...@spark.apache.org
> >
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: A proposal for Spark 2.0

2015-11-14 Thread Steve Loughran


Producing new x.0 releases of open source projects is a recurrent problem: too 
radical a change means the old version gets updated anyway (Python 3) and an 
incompatible version stops takeup (example, Log4Js dropping support for 
log4j.properties files),

Similarly, any radical new feature does tend to push out release times longer 
than you think (Hadoop 2).

I think the lessons I'd draw from those and others is: keep an x.0 version as 
compatible as possible so that everyone can move, and ship fast. You want to be 
able to retire the 1.x line.

And how to ship fast? Keep those features down.

For anyone planning anything radical —a branch with a clear plan/schedule to be 
merged in is probably the best strategy. I actually think the firefox process 
is the best here, and that it should have been adopted more in Hadoop; ongoing 
work is going in in branches for some things (erasure coding, IPv6), but 
there's still pressure to define the release schedule on feature completeness.

https://wiki.mozilla.org/Release_Management/Release_Process

see also JDD's article on evolution vs revolution in OSS; 15 years old but 
still valid. At the time, the Jakarta project was the equivalent of the ASF 
hadoop/big data stack, and indeed, its traces run through the code and the 
build & test process if you know what to look for

http://incubator.apache.org/learn/rules-for-revolutionaries.html



-Steve

Re: A proposal for Spark 2.0

2015-11-13 Thread Mark Hamstra

Why does stabilization of those two features require a 1.7 release instead
of 1.6.1?

On Fri, Nov 13, 2015 at 11:40 AM, Kostas Sakellis 
wrote:

> We have veered off the topic of Spark 2.0 a little bit here - yes we can
> talk about RDD vs. DS/DF more but lets refocus on Spark 2.0. I'd like to
> propose we have one more 1.x release after Spark 1.6. This will allow us to
> stabilize a few of the new features that were added in 1.6:
>
> 1) the experimental Datasets API
> 2) the new unified memory manager.
>
> I understand our goal for Spark 2.0 is to offer an easy transition but
> there will be users that won't be able to seamlessly upgrade given what we
> have discussed as in scope for 2.0. For these users, having a 1.x release
> with these new features/APIs stabilized will be very beneficial. This might
> make Spark 1.7 a lighter release but that is not necessarily a bad thing.
>
> Any thoughts on this timeline?
>
> Kostas Sakellis
>
>
>
> On Thu, Nov 12, 2015 at 8:39 PM, Cheng, Hao  wrote:
>
>> Agree, more features/apis/optimization need to be added in DF/DS.
>>
>>
>>
>> I mean, we need to think about what kind of RDD APIs we have to provide
>> to developer, maybe the fundamental API is enough, like, the ShuffledRDD
>> etc..  But PairRDDFunctions probably not in this category, as we can do the
>> same thing easily with DF/DS, even better performance.
>>
>>
>>
>> *From:* Mark Hamstra [mailto:m...@clearstorydata.com]
>> *Sent:* Friday, November 13, 2015 11:23 AM
>> *To:* Stephen Boesch
>>
>> *Cc:* dev@spark.apache.org
>> *Subject:* Re: A proposal for Spark 2.0
>>
>>
>>
>> Hmmm... to me, that seems like precisely the kind of thing that argues
>> for retaining the RDD API but not as the first thing presented to new Spark
>> developers: "Here's how to use groupBy with DataFrames Until the
>> optimizer is more fully developed, that won't always get you the best
>> performance that can be obtained.  In these particular circumstances, ...,
>> you may want to use the low-level RDD API while setting
>> preservesPartitioning to true.  Like this"
>>
>>
>>
>> On Thu, Nov 12, 2015 at 7:05 PM, Stephen Boesch 
>> wrote:
>>
>> My understanding is that  the RDD's presently have more support for
>> complete control of partitioning which is a key consideration at scale.
>> While partitioning control is still piecemeal in  DF/DS  it would seem
>> premature to make RDD's a second-tier approach to spark dev.
>>
>>
>>
>> An example is the use of groupBy when we know that the source relation
>> (/RDD) is already partitioned on the grouping expressions.  AFAIK the spark
>> sql still does not allow that knowledge to be applied to the optimizer - so
>> a full shuffle will be performed. However in the native RDD we can use
>> preservesPartitioning=true.
>>
>>
>>
>> 2015-11-12 17:42 GMT-08:00 Mark Hamstra :
>>
>> The place of the RDD API in 2.0 is also something I've been wondering
>> about.  I think it may be going too far to deprecate it, but changing
>> emphasis is something that we might consider.  The RDD API came well before
>> DataFrames and DataSets, so programming guides, introductory how-to
>> articles and the like have, to this point, also tended to emphasize RDDs --
>> or at least to deal with them early.  What I'm thinking is that with 2.0
>> maybe we should overhaul all the documentation to de-emphasize and
>> reposition RDDs.  In this scheme, DataFrames and DataSets would be
>> introduced and fully addressed before RDDs.  They would be presented as the
>> normal/default/standard way to do things in Spark.  RDDs, in contrast,
>> would be presented later as a kind of lower-level, closer-to-the-metal API
>> that can be used in atypical, more specialized contexts where DataFrames or
>> DataSets don't fully fit.
>>
>>
>>
>> On Thu, Nov 12, 2015 at 5:17 PM, Cheng, Hao  wrote:
>>
>> I am not sure what the best practice for this specific problem, but it’s
>> really worth to think about it in 2.0, as it is a painful issue for lots of
>> users.
>>
>>
>>
>> By the way, is it also an opportunity to deprecate the RDD API (or
>> internal API only?)? As lots of its functionality overlapping with
>> DataFrame or DataSet.
>>
>>
>>
>> Hao
>>
>>
>>
>> *From:* Kostas Sakellis [mailto:kos...@cloudera.com]
>> *Sent:* Friday, November 13, 2015 5:27 AM
>> *To:* Nicholas Chammas
>> *Cc:* Ulanov,

Re: A proposal for Spark 2.0

2015-11-13 Thread Kostas Sakellis

We have veered off the topic of Spark 2.0 a little bit here - yes we can
talk about RDD vs. DS/DF more but lets refocus on Spark 2.0. I'd like to
propose we have one more 1.x release after Spark 1.6. This will allow us to
stabilize a few of the new features that were added in 1.6:

1) the experimental Datasets API
2) the new unified memory manager.

I understand our goal for Spark 2.0 is to offer an easy transition but
there will be users that won't be able to seamlessly upgrade given what we
have discussed as in scope for 2.0. For these users, having a 1.x release
with these new features/APIs stabilized will be very beneficial. This might
make Spark 1.7 a lighter release but that is not necessarily a bad thing.

Any thoughts on this timeline?

Kostas Sakellis



On Thu, Nov 12, 2015 at 8:39 PM, Cheng, Hao  wrote:

> Agree, more features/apis/optimization need to be added in DF/DS.
>
>
>
> I mean, we need to think about what kind of RDD APIs we have to provide to
> developer, maybe the fundamental API is enough, like, the ShuffledRDD
> etc..  But PairRDDFunctions probably not in this category, as we can do the
> same thing easily with DF/DS, even better performance.
>
>
>
> *From:* Mark Hamstra [mailto:m...@clearstorydata.com]
> *Sent:* Friday, November 13, 2015 11:23 AM
> *To:* Stephen Boesch
>
> *Cc:* dev@spark.apache.org
> *Subject:* Re: A proposal for Spark 2.0
>
>
>
> Hmmm... to me, that seems like precisely the kind of thing that argues for
> retaining the RDD API but not as the first thing presented to new Spark
> developers: "Here's how to use groupBy with DataFrames Until the
> optimizer is more fully developed, that won't always get you the best
> performance that can be obtained.  In these particular circumstances, ...,
> you may want to use the low-level RDD API while setting
> preservesPartitioning to true.  Like this"
>
>
>
> On Thu, Nov 12, 2015 at 7:05 PM, Stephen Boesch  wrote:
>
> My understanding is that  the RDD's presently have more support for
> complete control of partitioning which is a key consideration at scale.
> While partitioning control is still piecemeal in  DF/DS  it would seem
> premature to make RDD's a second-tier approach to spark dev.
>
>
>
> An example is the use of groupBy when we know that the source relation
> (/RDD) is already partitioned on the grouping expressions.  AFAIK the spark
> sql still does not allow that knowledge to be applied to the optimizer - so
> a full shuffle will be performed. However in the native RDD we can use
> preservesPartitioning=true.
>
>
>
> 2015-11-12 17:42 GMT-08:00 Mark Hamstra :
>
> The place of the RDD API in 2.0 is also something I've been wondering
> about.  I think it may be going too far to deprecate it, but changing
> emphasis is something that we might consider.  The RDD API came well before
> DataFrames and DataSets, so programming guides, introductory how-to
> articles and the like have, to this point, also tended to emphasize RDDs --
> or at least to deal with them early.  What I'm thinking is that with 2.0
> maybe we should overhaul all the documentation to de-emphasize and
> reposition RDDs.  In this scheme, DataFrames and DataSets would be
> introduced and fully addressed before RDDs.  They would be presented as the
> normal/default/standard way to do things in Spark.  RDDs, in contrast,
> would be presented later as a kind of lower-level, closer-to-the-metal API
> that can be used in atypical, more specialized contexts where DataFrames or
> DataSets don't fully fit.
>
>
>
> On Thu, Nov 12, 2015 at 5:17 PM, Cheng, Hao  wrote:
>
> I am not sure what the best practice for this specific problem, but it’s
> really worth to think about it in 2.0, as it is a painful issue for lots of
> users.
>
>
>
> By the way, is it also an opportunity to deprecate the RDD API (or
> internal API only?)? As lots of its functionality overlapping with
> DataFrame or DataSet.
>
>
>
> Hao
>
>
>
> *From:* Kostas Sakellis [mailto:kos...@cloudera.com]
> *Sent:* Friday, November 13, 2015 5:27 AM
> *To:* Nicholas Chammas
> *Cc:* Ulanov, Alexander; Nan Zhu; wi...@qq.com; dev@spark.apache.org;
> Reynold Xin
>
>
> *Subject:* Re: A proposal for Spark 2.0
>
>
>
> I know we want to keep breaking changes to a minimum but I'm hoping that
> with Spark 2.0 we can also look at better classpath isolation with user
> programs. I propose we build on spark.{driver|executor}.userClassPathFirst,
> setting it true by default, and not allow any spark transitive dependencies
> to leak into user code. For backwards compatibility we can have a whitelist
> if we want but I'd be good if we start requi

RE: A proposal for Spark 2.0

2015-11-12 Thread Cheng, Hao

Agree, more features/apis/optimization need to be added in DF/DS.

I mean, we need to think about what kind of RDD APIs we have to provide to 
developer, maybe the fundamental API is enough, like, the ShuffledRDD etc..  
But PairRDDFunctions probably not in this category, as we can do the same thing 
easily with DF/DS, even better performance.

From: Mark Hamstra [mailto:m...@clearstorydata.com]
Sent: Friday, November 13, 2015 11:23 AM
To: Stephen Boesch
Cc: dev@spark.apache.org
Subject: Re: A proposal for Spark 2.0

Hmmm... to me, that seems like precisely the kind of thing that argues for 
retaining the RDD API but not as the first thing presented to new Spark 
developers: "Here's how to use groupBy with DataFrames Until the optimizer 
is more fully developed, that won't always get you the best performance that 
can be obtained.  In these particular circumstances, ..., you may want to use 
the low-level RDD API while setting preservesPartitioning to true.  Like 
this"

On Thu, Nov 12, 2015 at 7:05 PM, Stephen Boesch 
mailto:java...@gmail.com>> wrote:
My understanding is that  the RDD's presently have more support for complete 
control of partitioning which is a key consideration at scale.  While 
partitioning control is still piecemeal in  DF/DS  it would seem premature to 
make RDD's a second-tier approach to spark dev.

An example is the use of groupBy when we know that the source relation (/RDD) 
is already partitioned on the grouping expressions.  AFAIK the spark sql still 
does not allow that knowledge to be applied to the optimizer - so a full 
shuffle will be performed. However in the native RDD we can use 
preservesPartitioning=true.

2015-11-12 17:42 GMT-08:00 Mark Hamstra 
mailto:m...@clearstorydata.com>>:
The place of the RDD API in 2.0 is also something I've been wondering about.  I 
think it may be going too far to deprecate it, but changing emphasis is 
something that we might consider.  The RDD API came well before DataFrames and 
DataSets, so programming guides, introductory how-to articles and the like 
have, to this point, also tended to emphasize RDDs -- or at least to deal with 
them early.  What I'm thinking is that with 2.0 maybe we should overhaul all 
the documentation to de-emphasize and reposition RDDs.  In this scheme, 
DataFrames and DataSets would be introduced and fully addressed before RDDs.  
They would be presented as the normal/default/standard way to do things in 
Spark.  RDDs, in contrast, would be presented later as a kind of lower-level, 
closer-to-the-metal API that can be used in atypical, more specialized contexts 
where DataFrames or DataSets don't fully fit.

On Thu, Nov 12, 2015 at 5:17 PM, Cheng, Hao 
mailto:hao.ch...@intel.com>> wrote:
I am not sure what the best practice for this specific problem, but it’s really 
worth to think about it in 2.0, as it is a painful issue for lots of users.

By the way, is it also an opportunity to deprecate the RDD API (or internal API 
only?)? As lots of its functionality overlapping with DataFrame or DataSet.

Hao

From: Kostas Sakellis [mailto:kos...@cloudera.com<mailto:kos...@cloudera.com>]
Sent: Friday, November 13, 2015 5:27 AM
To: Nicholas Chammas
Cc: Ulanov, Alexander; Nan Zhu; wi...@qq.com<mailto:wi...@qq.com>; 
dev@spark.apache.org<mailto:dev@spark.apache.org>; Reynold Xin

Subject: Re: A proposal for Spark 2.0

I know we want to keep breaking changes to a minimum but I'm hoping that with 
Spark 2.0 we can also look at better classpath isolation with user programs. I 
propose we build on spark.{driver|executor}.userClassPathFirst, setting it true 
by default, and not allow any spark transitive dependencies to leak into user 
code. For backwards compatibility we can have a whitelist if we want but I'd be 
good if we start requiring user apps to explicitly pull in all their 
dependencies. From what I can tell, Hadoop 3 is also moving in this direction.

Kostas

On Thu, Nov 12, 2015 at 9:56 AM, Nicholas Chammas 
mailto:nicholas.cham...@gmail.com>> wrote:

With regards to Machine learning, it would be great to move useful features 
from MLlib to ML and deprecate the former. Current structure of two separate 
machine learning packages seems to be somewhat confusing.

With regards to GraphX, it would be great to deprecate the use of RDD in GraphX 
and switch to Dataframe. This will allow GraphX evolve with Tungsten.

On that note of deprecating stuff, it might be good to deprecate some things in 
2.0 without removing or replacing them immediately. That way 2.0 doesn’t have 
to wait for everything that we want to deprecate to be replaced all at once.

Nick

On Thu, Nov 12, 2015 at 12:45 PM Ulanov, Alexander 
mailto:alexander.ula...@hpe.com>> wrote:
Parameter Server is a new feature and thus does not match the goal of 2.0 is 
“to fix things that are broken in the current API and remove certain deprecated 
APIs”.

Re: RE: A proposal for Spark 2.0

2015-11-12 Thread Guoqiang Li

Yes, I agree with  Nan Zhu. I recommend these projects:
https://github.com/dmlc/ps-lite (Apache License 2)
https://github.com/Microsoft/multiverso (MIT License)


Alexander, You may also be interested in the demo(graph on parameter Server) 


https://github.com/witgo/zen/tree/ps_graphx/graphx/src/main/scala/com/github/cloudml/zen/graphx







-- Original --
From:  "Ulanov, Alexander";;
Date:  Fri, Nov 13, 2015 01:44 AM
To:  "Nan Zhu"; "Guoqiang Li"; 
Cc:  "dev@spark.apache.org"; "Reynold 
Xin"; 
Subject:  RE: A proposal for Spark 2.0



  
Parameter Server is a new feature and thus does not match the goal of 2.0 is 
??to fix things that are broken in the current API and remove certain 
deprecated APIs??.  At the same time I would be happy to have that feature.
 
 
 
With regards to Machine learning, it would be great to move useful features 
from MLlib to ML and deprecate the former. Current structure of two separate 
machine  learning packages seems to be somewhat confusing.
 
With regards to GraphX, it would be great to deprecate the use of RDD in GraphX 
and switch to Dataframe. This will allow GraphX evolve with Tungsten.
 
 
 
Best regards, Alexander
 
 
 
From: Nan Zhu [mailto:zhunanmcg...@gmail.com] 
 Sent: Thursday, November 12, 2015 7:28 AM
 To: wi...@qq.com
 Cc: dev@spark.apache.org
 Subject: Re: A proposal for Spark 2.0
 
 
  
Being specific to Parameter Server, I think the current agreement is that PS 
shall exist as a third-party library instead of a component of the core code 
base, isn??t?
 
  
 
 
  
Best,
 
   
 
 
  
-- 
 
  
Nan Zhu
 
  
http://codingcat.me
 
  
 
 
 
 
On Thursday, November 12, 2015 at 9:49 AM,  wi...@qq.com wrote:
 
Who has the idea of machine learning? Spark missing some features for machine 
learning, For example, the parameter server.
 
  
 
 
  
 
 

?? 2015??11??1205:32??Matei  Zaharia   ??
 
  
 
 
  
I like the idea of popping out Tachyon to an optional component too to reduce 
the number of dependencies. In the future, it might even be useful to do this 
for Hadoop, but it requires too many API changes to be worth doing now.
 
  
 
 
  
Regarding Scala 2.12, we should definitely support it eventually, but I don't 
think we need to block 2.0 on that because it can be added later too. Has 
anyone investigated what it would take to run on there? I imagine we don't need 
many  code changes, just maybe some REPL stuff.
 
  
 
 
  
Needless to say, but I'm all for the idea of making "major" releases as 
undisruptive as possible in the model Reynold proposed. Keeping everyone 
working with the same set of releases is super important.
 
  
 
 
  
Matei
 
  
 
 

On Nov 11, 2015, at 4:58 AM, Sean Owen  wrote:
 
  
 
 
  
On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin  wrote:
 

to the Spark community. A major release should not be very different from a
 
  
minor release and should not be gated based on new features. The main
 
  
purpose of a major release is an opportunity to fix things that are broken
 
  
in the current API and remove certain deprecated APIs (examples follow).
 
 
   
 
 
  
Agree with this stance. Generally, a major release might also be a
 
  
time to replace some big old API or implementation with a new one, but
 
  
I don't see obvious candidates.
 
  
 
 
  
I wouldn't mind turning attention to 2.x sooner than later, unless
 
  
there's a fairly good reason to continue adding features in 1.x to a
 
  
1.7 release. The scope as of 1.6 is already pretty darned big.
 
  
 
 
  
 
 

1. Scala 2.11 as the default build. We should still support Scala 2.10, but
 
  
it has been end-of-life.
 
 
   
 
 
  
By the time 2.x rolls around, 2.12 will be the main version, 2.11 will
 
  
be quite stable, and 2.10 will have been EOL for a while. I'd propose
 
  
dropping 2.10. Otherwise it's supported for 2 more years.
 
  
 
 
  
 
 
   
2. Remove Hadoop 1 support.
 
   
 
 
  
I'd go further to drop support for <2.2 for sure (2.0 and 2.1 were
 
  
sort of 'alpha' and 'beta' releases) and even <2.6.
 
  
 
 
  
I'm sure we'll think of a number of other small things -- shading a
 
  
bunch of stuff? reviewing and updating dependencies in light of
 
  
simpler, more recent dependencies to support from Hadoop etc?
 
  
 
 
  
Farming out Tachyon to a module? (I felt like someone proposed this?)
 
  
Pop out any Docker stuff to another repo?
 
  
Continue that same effort for EC2?
 
  
Farming out some of the "external" integrations to another repo (?
 
  
controversial)
 
  
 
 
  
See also anything marked version "2+" in JIRA.
 
  
 
 
  
-
 
  
To unsubscribe, e-mail:  dev-unsubscr...@spark.apache.org
 
  
For additional commands, e-mail:  dev-h...@spark.apache.org

Re: A proposal for Spark 2.0

2015-11-12 Thread Mark Hamstra

Hmmm... to me, that seems like precisely the kind of thing that argues for
retaining the RDD API but not as the first thing presented to new Spark
developers: "Here's how to use groupBy with DataFrames Until the
optimizer is more fully developed, that won't always get you the best
performance that can be obtained.  In these particular circumstances, ...,
you may want to use the low-level RDD API while setting
preservesPartitioning to true.  Like this"

On Thu, Nov 12, 2015 at 7:05 PM, Stephen Boesch  wrote:

> My understanding is that  the RDD's presently have more support for
> complete control of partitioning which is a key consideration at scale.
> While partitioning control is still piecemeal in  DF/DS  it would seem
> premature to make RDD's a second-tier approach to spark dev.
>
> An example is the use of groupBy when we know that the source relation
> (/RDD) is already partitioned on the grouping expressions.  AFAIK the spark
> sql still does not allow that knowledge to be applied to the optimizer - so
> a full shuffle will be performed. However in the native RDD we can use
> preservesPartitioning=true.
>
> 2015-11-12 17:42 GMT-08:00 Mark Hamstra :
>
>> The place of the RDD API in 2.0 is also something I've been wondering
>> about.  I think it may be going too far to deprecate it, but changing
>> emphasis is something that we might consider.  The RDD API came well before
>> DataFrames and DataSets, so programming guides, introductory how-to
>> articles and the like have, to this point, also tended to emphasize RDDs --
>> or at least to deal with them early.  What I'm thinking is that with 2.0
>> maybe we should overhaul all the documentation to de-emphasize and
>> reposition RDDs.  In this scheme, DataFrames and DataSets would be
>> introduced and fully addressed before RDDs.  They would be presented as the
>> normal/default/standard way to do things in Spark.  RDDs, in contrast,
>> would be presented later as a kind of lower-level, closer-to-the-metal API
>> that can be used in atypical, more specialized contexts where DataFrames or
>> DataSets don't fully fit.
>>
>> On Thu, Nov 12, 2015 at 5:17 PM, Cheng, Hao  wrote:
>>
>>> I am not sure what the best practice for this specific problem, but it’s
>>> really worth to think about it in 2.0, as it is a painful issue for lots of
>>> users.
>>>
>>>
>>>
>>> By the way, is it also an opportunity to deprecate the RDD API (or
>>> internal API only?)? As lots of its functionality overlapping with
>>> DataFrame or DataSet.
>>>
>>>
>>>
>>> Hao
>>>
>>>
>>>
>>> *From:* Kostas Sakellis [mailto:kos...@cloudera.com]
>>> *Sent:* Friday, November 13, 2015 5:27 AM
>>> *To:* Nicholas Chammas
>>> *Cc:* Ulanov, Alexander; Nan Zhu; wi...@qq.com; dev@spark.apache.org;
>>> Reynold Xin
>>>
>>> *Subject:* Re: A proposal for Spark 2.0
>>>
>>>
>>>
>>> I know we want to keep breaking changes to a minimum but I'm hoping that
>>> with Spark 2.0 we can also look at better classpath isolation with user
>>> programs. I propose we build on spark.{driver|executor}.userClassPathFirst,
>>> setting it true by default, and not allow any spark transitive dependencies
>>> to leak into user code. For backwards compatibility we can have a whitelist
>>> if we want but I'd be good if we start requiring user apps to explicitly
>>> pull in all their dependencies. From what I can tell, Hadoop 3 is also
>>> moving in this direction.
>>>
>>>
>>>
>>> Kostas
>>>
>>>
>>>
>>> On Thu, Nov 12, 2015 at 9:56 AM, Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
>>> With regards to Machine learning, it would be great to move useful
>>> features from MLlib to ML and deprecate the former. Current structure of
>>> two separate machine learning packages seems to be somewhat confusing.
>>>
>>> With regards to GraphX, it would be great to deprecate the use of RDD in
>>> GraphX and switch to Dataframe. This will allow GraphX evolve with Tungsten.
>>>
>>> On that note of deprecating stuff, it might be good to deprecate some
>>> things in 2.0 without removing or replacing them immediately. That way 2.0
>>> doesn’t have to wait for everything that we want to deprecate to be
>>> replaced all at once.
>>>
>>> Nick
>>>
>>> 
>>>
>>>
>>>
>>>

Re: A proposal for Spark 2.0

2015-11-12 Thread Stephen Boesch

My understanding is that  the RDD's presently have more support for
complete control of partitioning which is a key consideration at scale.
While partitioning control is still piecemeal in  DF/DS  it would seem
premature to make RDD's a second-tier approach to spark dev.

An example is the use of groupBy when we know that the source relation
(/RDD) is already partitioned on the grouping expressions.  AFAIK the spark
sql still does not allow that knowledge to be applied to the optimizer - so
a full shuffle will be performed. However in the native RDD we can use
preservesPartitioning=true.

2015-11-12 17:42 GMT-08:00 Mark Hamstra :

> The place of the RDD API in 2.0 is also something I've been wondering
> about.  I think it may be going too far to deprecate it, but changing
> emphasis is something that we might consider.  The RDD API came well before
> DataFrames and DataSets, so programming guides, introductory how-to
> articles and the like have, to this point, also tended to emphasize RDDs --
> or at least to deal with them early.  What I'm thinking is that with 2.0
> maybe we should overhaul all the documentation to de-emphasize and
> reposition RDDs.  In this scheme, DataFrames and DataSets would be
> introduced and fully addressed before RDDs.  They would be presented as the
> normal/default/standard way to do things in Spark.  RDDs, in contrast,
> would be presented later as a kind of lower-level, closer-to-the-metal API
> that can be used in atypical, more specialized contexts where DataFrames or
> DataSets don't fully fit.
>
> On Thu, Nov 12, 2015 at 5:17 PM, Cheng, Hao  wrote:
>
>> I am not sure what the best practice for this specific problem, but it’s
>> really worth to think about it in 2.0, as it is a painful issue for lots of
>> users.
>>
>>
>>
>> By the way, is it also an opportunity to deprecate the RDD API (or
>> internal API only?)? As lots of its functionality overlapping with
>> DataFrame or DataSet.
>>
>>
>>
>> Hao
>>
>>
>>
>> *From:* Kostas Sakellis [mailto:kos...@cloudera.com]
>> *Sent:* Friday, November 13, 2015 5:27 AM
>> *To:* Nicholas Chammas
>> *Cc:* Ulanov, Alexander; Nan Zhu; wi...@qq.com; dev@spark.apache.org;
>> Reynold Xin
>>
>> *Subject:* Re: A proposal for Spark 2.0
>>
>>
>>
>> I know we want to keep breaking changes to a minimum but I'm hoping that
>> with Spark 2.0 we can also look at better classpath isolation with user
>> programs. I propose we build on spark.{driver|executor}.userClassPathFirst,
>> setting it true by default, and not allow any spark transitive dependencies
>> to leak into user code. For backwards compatibility we can have a whitelist
>> if we want but I'd be good if we start requiring user apps to explicitly
>> pull in all their dependencies. From what I can tell, Hadoop 3 is also
>> moving in this direction.
>>
>>
>>
>> Kostas
>>
>>
>>
>> On Thu, Nov 12, 2015 at 9:56 AM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>> With regards to Machine learning, it would be great to move useful
>> features from MLlib to ML and deprecate the former. Current structure of
>> two separate machine learning packages seems to be somewhat confusing.
>>
>> With regards to GraphX, it would be great to deprecate the use of RDD in
>> GraphX and switch to Dataframe. This will allow GraphX evolve with Tungsten.
>>
>> On that note of deprecating stuff, it might be good to deprecate some
>> things in 2.0 without removing or replacing them immediately. That way 2.0
>> doesn’t have to wait for everything that we want to deprecate to be
>> replaced all at once.
>>
>> Nick
>>
>> 
>>
>>
>>
>> On Thu, Nov 12, 2015 at 12:45 PM Ulanov, Alexander <
>> alexander.ula...@hpe.com> wrote:
>>
>> Parameter Server is a new feature and thus does not match the goal of 2.0
>> is “to fix things that are broken in the current API and remove certain
>> deprecated APIs”. At the same time I would be happy to have that feature.
>>
>>
>>
>> With regards to Machine learning, it would be great to move useful
>> features from MLlib to ML and deprecate the former. Current structure of
>> two separate machine learning packages seems to be somewhat confusing.
>>
>> With regards to GraphX, it would be great to deprecate the use of RDD in
>> GraphX and switch to Dataframe. This will allow GraphX evolve with Tungsten.
>>
>>
>>
>> Best regards, Alexander
>>
>>
>>
>> *From:* Nan Zhu [mailto:zhunanm

Re: A proposal for Spark 2.0

2015-11-12 Thread Mark Hamstra

The place of the RDD API in 2.0 is also something I've been wondering
about.  I think it may be going too far to deprecate it, but changing
emphasis is something that we might consider.  The RDD API came well before
DataFrames and DataSets, so programming guides, introductory how-to
articles and the like have, to this point, also tended to emphasize RDDs --
or at least to deal with them early.  What I'm thinking is that with 2.0
maybe we should overhaul all the documentation to de-emphasize and
reposition RDDs.  In this scheme, DataFrames and DataSets would be
introduced and fully addressed before RDDs.  They would be presented as the
normal/default/standard way to do things in Spark.  RDDs, in contrast,
would be presented later as a kind of lower-level, closer-to-the-metal API
that can be used in atypical, more specialized contexts where DataFrames or
DataSets don't fully fit.

On Thu, Nov 12, 2015 at 5:17 PM, Cheng, Hao  wrote:

> I am not sure what the best practice for this specific problem, but it’s
> really worth to think about it in 2.0, as it is a painful issue for lots of
> users.
>
>
>
> By the way, is it also an opportunity to deprecate the RDD API (or
> internal API only?)? As lots of its functionality overlapping with
> DataFrame or DataSet.
>
>
>
> Hao
>
>
>
> *From:* Kostas Sakellis [mailto:kos...@cloudera.com]
> *Sent:* Friday, November 13, 2015 5:27 AM
> *To:* Nicholas Chammas
> *Cc:* Ulanov, Alexander; Nan Zhu; wi...@qq.com; dev@spark.apache.org;
> Reynold Xin
>
> *Subject:* Re: A proposal for Spark 2.0
>
>
>
> I know we want to keep breaking changes to a minimum but I'm hoping that
> with Spark 2.0 we can also look at better classpath isolation with user
> programs. I propose we build on spark.{driver|executor}.userClassPathFirst,
> setting it true by default, and not allow any spark transitive dependencies
> to leak into user code. For backwards compatibility we can have a whitelist
> if we want but I'd be good if we start requiring user apps to explicitly
> pull in all their dependencies. From what I can tell, Hadoop 3 is also
> moving in this direction.
>
>
>
> Kostas
>
>
>
> On Thu, Nov 12, 2015 at 9:56 AM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
> With regards to Machine learning, it would be great to move useful
> features from MLlib to ML and deprecate the former. Current structure of
> two separate machine learning packages seems to be somewhat confusing.
>
> With regards to GraphX, it would be great to deprecate the use of RDD in
> GraphX and switch to Dataframe. This will allow GraphX evolve with Tungsten.
>
> On that note of deprecating stuff, it might be good to deprecate some
> things in 2.0 without removing or replacing them immediately. That way 2.0
> doesn’t have to wait for everything that we want to deprecate to be
> replaced all at once.
>
> Nick
>
> 
>
>
>
> On Thu, Nov 12, 2015 at 12:45 PM Ulanov, Alexander <
> alexander.ula...@hpe.com> wrote:
>
> Parameter Server is a new feature and thus does not match the goal of 2.0
> is “to fix things that are broken in the current API and remove certain
> deprecated APIs”. At the same time I would be happy to have that feature.
>
>
>
> With regards to Machine learning, it would be great to move useful
> features from MLlib to ML and deprecate the former. Current structure of
> two separate machine learning packages seems to be somewhat confusing.
>
> With regards to GraphX, it would be great to deprecate the use of RDD in
> GraphX and switch to Dataframe. This will allow GraphX evolve with Tungsten.
>
>
>
> Best regards, Alexander
>
>
>
> *From:* Nan Zhu [mailto:zhunanmcg...@gmail.com]
> *Sent:* Thursday, November 12, 2015 7:28 AM
> *To:* wi...@qq.com
> *Cc:* dev@spark.apache.org
> *Subject:* Re: A proposal for Spark 2.0
>
>
>
> Being specific to Parameter Server, I think the current agreement is that
> PS shall exist as a third-party library instead of a component of the core
> code base, isn’t?
>
>
>
> Best,
>
>
>
> --
>
> Nan Zhu
>
> http://codingcat.me
>
>
>
> On Thursday, November 12, 2015 at 9:49 AM, wi...@qq.com wrote:
>
> Who has the idea of machine learning? Spark missing some features for
> machine learning, For example, the parameter server.
>
>
>
>
>
> 在 2015年11月12日，05:32，Matei Zaharia  写道：
>
>
>
> I like the idea of popping out Tachyon to an optional component too to
> reduce the number of dependencies. In the future, it might even be useful
> to do this for Hadoop, but it requires too many API changes to be worth
> doing now.
>
>
>
> Regarding Scala 2.12, we should defin

RE: A proposal for Spark 2.0

2015-11-12 Thread Cheng, Hao

I am not sure what the best practice for this specific problem, but it’s really 
worth to think about it in 2.0, as it is a painful issue for lots of users.

By the way, is it also an opportunity to deprecate the RDD API (or internal API 
only?)? As lots of its functionality overlapping with DataFrame or DataSet.

Hao

From: Kostas Sakellis [mailto:kos...@cloudera.com]
Sent: Friday, November 13, 2015 5:27 AM
To: Nicholas Chammas
Cc: Ulanov, Alexander; Nan Zhu; wi...@qq.com; dev@spark.apache.org; Reynold Xin
Subject: Re: A proposal for Spark 2.0

I know we want to keep breaking changes to a minimum but I'm hoping that with 
Spark 2.0 we can also look at better classpath isolation with user programs. I 
propose we build on spark.{driver|executor}.userClassPathFirst, setting it true 
by default, and not allow any spark transitive dependencies to leak into user 
code. For backwards compatibility we can have a whitelist if we want but I'd be 
good if we start requiring user apps to explicitly pull in all their 
dependencies. From what I can tell, Hadoop 3 is also moving in this direction.

Kostas

On Thu, Nov 12, 2015 at 9:56 AM, Nicholas Chammas 
mailto:nicholas.cham...@gmail.com>> wrote:

With regards to Machine learning, it would be great to move useful features 
from MLlib to ML and deprecate the former. Current structure of two separate 
machine learning packages seems to be somewhat confusing.

With regards to GraphX, it would be great to deprecate the use of RDD in GraphX 
and switch to Dataframe. This will allow GraphX evolve with Tungsten.

On that note of deprecating stuff, it might be good to deprecate some things in 
2.0 without removing or replacing them immediately. That way 2.0 doesn’t have 
to wait for everything that we want to deprecate to be replaced all at once.

Nick


On Thu, Nov 12, 2015 at 12:45 PM Ulanov, Alexander 
mailto:alexander.ula...@hpe.com>> wrote:
Parameter Server is a new feature and thus does not match the goal of 2.0 is 
“to fix things that are broken in the current API and remove certain deprecated 
APIs”. At the same time I would be happy to have that feature.

With regards to Machine learning, it would be great to move useful features 
from MLlib to ML and deprecate the former. Current structure of two separate 
machine learning packages seems to be somewhat confusing.
With regards to GraphX, it would be great to deprecate the use of RDD in GraphX 
and switch to Dataframe. This will allow GraphX evolve with Tungsten.

Best regards, Alexander

From: Nan Zhu [mailto:zhunanmcg...@gmail.com<mailto:zhunanmcg...@gmail.com>]
Sent: Thursday, November 12, 2015 7:28 AM
To: wi...@qq.com<mailto:wi...@qq.com>
Cc: dev@spark.apache.org<mailto:dev@spark.apache.org>
Subject: Re: A proposal for Spark 2.0

Being specific to Parameter Server, I think the current agreement is that PS 
shall exist as a third-party library instead of a component of the core code 
base, isn’t?

Best,

--
Nan Zhu
http://codingcat.me


On Thursday, November 12, 2015 at 9:49 AM, wi...@qq.com<mailto:wi...@qq.com> 
wrote:
Who has the idea of machine learning? Spark missing some features for machine 
learning, For example, the parameter server.


在 2015年11月12日，05:32，Matei Zaharia 
mailto:matei.zaha...@gmail.com>> 写道：

I like the idea of popping out Tachyon to an optional component too to reduce 
the number of dependencies. In the future, it might even be useful to do this 
for Hadoop, but it requires too many API changes to be worth doing now.

Regarding Scala 2.12, we should definitely support it eventually, but I don't 
think we need to block 2.0 on that because it can be added later too. Has 
anyone investigated what it would take to run on there? I imagine we don't need 
many code changes, just maybe some REPL stuff.

Needless to say, but I'm all for the idea of making "major" releases as 
undisruptive as possible in the model Reynold proposed. Keeping everyone 
working with the same set of releases is super important.

Matei

On Nov 11, 2015, at 4:58 AM, Sean Owen 
mailto:so...@cloudera.com>> wrote:

On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin 
mailto:r...@databricks.com>> wrote:
to the Spark community. A major release should not be very different from a
minor release and should not be gated based on new features. The main
purpose of a major release is an opportunity to fix things that are broken
in the current API and remove certain deprecated APIs (examples follow).

Agree with this stance. Generally, a major release might also be a
time to replace some big old API or implementation with a new one, but
I don't see obvious candidates.

I wouldn't mind turning attention to 2.x sooner than later, unless
there's a fairly good reason to continue adding features in 1.x to a
1.7 release. The scope as of 1.6 is already pretty darned big.


1. Scala 2.11 as the default build. We should still support Scala 2.10,

Re: A proposal for Spark 2.0

2015-11-12 Thread Kostas Sakellis

I know we want to keep breaking changes to a minimum but I'm hoping that
with Spark 2.0 we can also look at better classpath isolation with user
programs. I propose we build on spark.{driver|executor}.userClassPathFirst,
setting it true by default, and not allow any spark transitive dependencies
to leak into user code. For backwards compatibility we can have a whitelist
if we want but I'd be good if we start requiring user apps to explicitly
pull in all their dependencies. From what I can tell, Hadoop 3 is also
moving in this direction.

Kostas

On Thu, Nov 12, 2015 at 9:56 AM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> With regards to Machine learning, it would be great to move useful
> features from MLlib to ML and deprecate the former. Current structure of
> two separate machine learning packages seems to be somewhat confusing.
>
> With regards to GraphX, it would be great to deprecate the use of RDD in
> GraphX and switch to Dataframe. This will allow GraphX evolve with Tungsten.
>
> On that note of deprecating stuff, it might be good to deprecate some
> things in 2.0 without removing or replacing them immediately. That way 2.0
> doesn’t have to wait for everything that we want to deprecate to be
> replaced all at once.
>
> Nick
> 
>
> On Thu, Nov 12, 2015 at 12:45 PM Ulanov, Alexander <
> alexander.ula...@hpe.com> wrote:
>
>> Parameter Server is a new feature and thus does not match the goal of 2.0
>> is “to fix things that are broken in the current API and remove certain
>> deprecated APIs”. At the same time I would be happy to have that feature.
>>
>>
>>
>> With regards to Machine learning, it would be great to move useful
>> features from MLlib to ML and deprecate the former. Current structure of
>> two separate machine learning packages seems to be somewhat confusing.
>>
>> With regards to GraphX, it would be great to deprecate the use of RDD in
>> GraphX and switch to Dataframe. This will allow GraphX evolve with Tungsten.
>>
>>
>>
>> Best regards, Alexander
>>
>>
>>
>> *From:* Nan Zhu [mailto:zhunanmcg...@gmail.com]
>> *Sent:* Thursday, November 12, 2015 7:28 AM
>> *To:* wi...@qq.com
>> *Cc:* dev@spark.apache.org
>> *Subject:* Re: A proposal for Spark 2.0
>>
>>
>>
>> Being specific to Parameter Server, I think the current agreement is that
>> PS shall exist as a third-party library instead of a component of the core
>> code base, isn’t?
>>
>>
>>
>> Best,
>>
>>
>>
>> --
>>
>> Nan Zhu
>>
>> http://codingcat.me
>>
>>
>>
>> On Thursday, November 12, 2015 at 9:49 AM, wi...@qq.com wrote:
>>
>> Who has the idea of machine learning? Spark missing some features for
>> machine learning, For example, the parameter server.
>>
>>
>>
>>
>>
>> 在 2015年11月12日，05:32，Matei Zaharia  写道：
>>
>>
>>
>> I like the idea of popping out Tachyon to an optional component too to
>> reduce the number of dependencies. In the future, it might even be useful
>> to do this for Hadoop, but it requires too many API changes to be worth
>> doing now.
>>
>>
>>
>> Regarding Scala 2.12, we should definitely support it eventually, but I
>> don't think we need to block 2.0 on that because it can be added later too.
>> Has anyone investigated what it would take to run on there? I imagine we
>> don't need many code changes, just maybe some REPL stuff.
>>
>>
>>
>> Needless to say, but I'm all for the idea of making "major" releases as
>> undisruptive as possible in the model Reynold proposed. Keeping everyone
>> working with the same set of releases is super important.
>>
>>
>>
>> Matei
>>
>>
>>
>> On Nov 11, 2015, at 4:58 AM, Sean Owen  wrote:
>>
>>
>>
>> On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin 
>> wrote:
>>
>> to the Spark community. A major release should not be very different from
>> a
>>
>> minor release and should not be gated based on new features. The main
>>
>> purpose of a major release is an opportunity to fix things that are broken
>>
>> in the current API and remove certain deprecated APIs (examples follow).
>>
>>
>>
>> Agree with this stance. Generally, a major release might also be a
>>
>> time to replace some big old API or implementation with a new one, but
>>
>> I don't see obvious candidates.
>>
>>
>>
>> I wouldn't mind turning attention to 2.x soone

Re: A proposal for Spark 2.0

2015-11-12 Thread Nicholas Chammas

With regards to Machine learning, it would be great to move useful features
from MLlib to ML and deprecate the former. Current structure of two
separate machine learning packages seems to be somewhat confusing.

With regards to GraphX, it would be great to deprecate the use of RDD in
GraphX and switch to Dataframe. This will allow GraphX evolve with Tungsten.

On that note of deprecating stuff, it might be good to deprecate some
things in 2.0 without removing or replacing them immediately. That way 2.0
doesn’t have to wait for everything that we want to deprecate to be
replaced all at once.

Nick


On Thu, Nov 12, 2015 at 12:45 PM Ulanov, Alexander 
wrote:

> Parameter Server is a new feature and thus does not match the goal of 2.0
> is “to fix things that are broken in the current API and remove certain
> deprecated APIs”. At the same time I would be happy to have that feature.
>
>
>
> With regards to Machine learning, it would be great to move useful
> features from MLlib to ML and deprecate the former. Current structure of
> two separate machine learning packages seems to be somewhat confusing.
>
> With regards to GraphX, it would be great to deprecate the use of RDD in
> GraphX and switch to Dataframe. This will allow GraphX evolve with Tungsten.
>
>
>
> Best regards, Alexander
>
>
>
> *From:* Nan Zhu [mailto:zhunanmcg...@gmail.com]
> *Sent:* Thursday, November 12, 2015 7:28 AM
> *To:* wi...@qq.com
> *Cc:* dev@spark.apache.org
> *Subject:* Re: A proposal for Spark 2.0
>
>
>
> Being specific to Parameter Server, I think the current agreement is that
> PS shall exist as a third-party library instead of a component of the core
> code base, isn’t?
>
>
>
> Best,
>
>
>
> --
>
> Nan Zhu
>
> http://codingcat.me
>
>
>
> On Thursday, November 12, 2015 at 9:49 AM, wi...@qq.com wrote:
>
> Who has the idea of machine learning? Spark missing some features for
> machine learning, For example, the parameter server.
>
>
>
>
>
> 在 2015年11月12日，05:32，Matei Zaharia  写道：
>
>
>
> I like the idea of popping out Tachyon to an optional component too to
> reduce the number of dependencies. In the future, it might even be useful
> to do this for Hadoop, but it requires too many API changes to be worth
> doing now.
>
>
>
> Regarding Scala 2.12, we should definitely support it eventually, but I
> don't think we need to block 2.0 on that because it can be added later too.
> Has anyone investigated what it would take to run on there? I imagine we
> don't need many code changes, just maybe some REPL stuff.
>
>
>
> Needless to say, but I'm all for the idea of making "major" releases as
> undisruptive as possible in the model Reynold proposed. Keeping everyone
> working with the same set of releases is super important.
>
>
>
> Matei
>
>
>
> On Nov 11, 2015, at 4:58 AM, Sean Owen  wrote:
>
>
>
> On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin  wrote:
>
> to the Spark community. A major release should not be very different from a
>
> minor release and should not be gated based on new features. The main
>
> purpose of a major release is an opportunity to fix things that are broken
>
> in the current API and remove certain deprecated APIs (examples follow).
>
>
>
> Agree with this stance. Generally, a major release might also be a
>
> time to replace some big old API or implementation with a new one, but
>
> I don't see obvious candidates.
>
>
>
> I wouldn't mind turning attention to 2.x sooner than later, unless
>
> there's a fairly good reason to continue adding features in 1.x to a
>
> 1.7 release. The scope as of 1.6 is already pretty darned big.
>
>
>
>
>
> 1. Scala 2.11 as the default build. We should still support Scala 2.10, but
>
> it has been end-of-life.
>
>
>
> By the time 2.x rolls around, 2.12 will be the main version, 2.11 will
>
> be quite stable, and 2.10 will have been EOL for a while. I'd propose
>
> dropping 2.10. Otherwise it's supported for 2 more years.
>
>
>
>
>
> 2. Remove Hadoop 1 support.
>
>
>
> I'd go further to drop support for <2.2 for sure (2.0 and 2.1 were
>
> sort of 'alpha' and 'beta' releases) and even <2.6.
>
>
>
> I'm sure we'll think of a number of other small things -- shading a
>
> bunch of stuff? reviewing and updating dependencies in light of
>
> simpler, more recent dependencies to support from Hadoop etc?
>
>
>
> Farming out Tachyon to a module? (I felt like someone proposed this?)
>
> Pop out any Docker stuff to another repo?
>
> Continue th

RE: A proposal for Spark 2.0

2015-11-12 Thread Ulanov, Alexander

Parameter Server is a new feature and thus does not match the goal of 2.0 is 
“to fix things that are broken in the current API and remove certain deprecated 
APIs”. At the same time I would be happy to have that feature.

With regards to Machine learning, it would be great to move useful features 
from MLlib to ML and deprecate the former. Current structure of two separate 
machine learning packages seems to be somewhat confusing.
With regards to GraphX, it would be great to deprecate the use of RDD in GraphX 
and switch to Dataframe. This will allow GraphX evolve with Tungsten.

Best regards, Alexander

From: Nan Zhu [mailto:zhunanmcg...@gmail.com]
Sent: Thursday, November 12, 2015 7:28 AM
To: wi...@qq.com
Cc: dev@spark.apache.org
Subject: Re: A proposal for Spark 2.0

Being specific to Parameter Server, I think the current agreement is that PS 
shall exist as a third-party library instead of a component of the core code 
base, isn’t?

Best,

--
Nan Zhu
http://codingcat.me


On Thursday, November 12, 2015 at 9:49 AM, wi...@qq.com<mailto:wi...@qq.com> 
wrote:
Who has the idea of machine learning? Spark missing some features for machine 
learning, For example, the parameter server.


在 2015年11月12日，05:32，Matei Zaharia 
mailto:matei.zaha...@gmail.com>> 写道：

I like the idea of popping out Tachyon to an optional component too to reduce 
the number of dependencies. In the future, it might even be useful to do this 
for Hadoop, but it requires too many API changes to be worth doing now.

Regarding Scala 2.12, we should definitely support it eventually, but I don't 
think we need to block 2.0 on that because it can be added later too. Has 
anyone investigated what it would take to run on there? I imagine we don't need 
many code changes, just maybe some REPL stuff.

Needless to say, but I'm all for the idea of making "major" releases as 
undisruptive as possible in the model Reynold proposed. Keeping everyone 
working with the same set of releases is super important.

Matei

On Nov 11, 2015, at 4:58 AM, Sean Owen 
mailto:so...@cloudera.com>> wrote:

On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin 
mailto:r...@databricks.com>> wrote:
to the Spark community. A major release should not be very different from a
minor release and should not be gated based on new features. The main
purpose of a major release is an opportunity to fix things that are broken
in the current API and remove certain deprecated APIs (examples follow).

Agree with this stance. Generally, a major release might also be a
time to replace some big old API or implementation with a new one, but
I don't see obvious candidates.

I wouldn't mind turning attention to 2.x sooner than later, unless
there's a fairly good reason to continue adding features in 1.x to a
1.7 release. The scope as of 1.6 is already pretty darned big.


1. Scala 2.11 as the default build. We should still support Scala 2.10, but
it has been end-of-life.

By the time 2.x rolls around, 2.12 will be the main version, 2.11 will
be quite stable, and 2.10 will have been EOL for a while. I'd propose
dropping 2.10. Otherwise it's supported for 2 more years.


2. Remove Hadoop 1 support.

I'd go further to drop support for <2.2 for sure (2.0 and 2.1 were
sort of 'alpha' and 'beta' releases) and even <2.6.

I'm sure we'll think of a number of other small things -- shading a
bunch of stuff? reviewing and updating dependencies in light of
simpler, more recent dependencies to support from Hadoop etc?

Farming out Tachyon to a module? (I felt like someone proposed this?)
Pop out any Docker stuff to another repo?
Continue that same effort for EC2?
Farming out some of the "external" integrations to another repo (?
controversial)

See also anything marked version "2+" in JIRA.

-
To unsubscribe, e-mail: 
dev-unsubscr...@spark.apache.org<mailto:dev-unsubscr...@spark.apache.org>
For additional commands, e-mail: 
dev-h...@spark.apache.org<mailto:dev-h...@spark.apache.org>


-
To unsubscribe, e-mail: 
dev-unsubscr...@spark.apache.org<mailto:dev-unsubscr...@spark.apache.org>
For additional commands, e-mail: 
dev-h...@spark.apache.org<mailto:dev-h...@spark.apache.org>




-
To unsubscribe, e-mail: 
dev-unsubscr...@spark.apache.org<mailto:dev-unsubscr...@spark.apache.org>
For additional commands, e-mail: 
dev-h...@spark.apache.org<mailto:dev-h...@spark.apache.org>

Re: A proposal for Spark 2.0

2015-11-12 Thread Nan Zhu

Being specific to Parameter Server, I think the current agreement is that PS 
shall exist as a third-party library instead of a component of the core code 
base, isn’t?

Best,  

--  
Nan Zhu
http://codingcat.me


On Thursday, November 12, 2015 at 9:49 AM, wi...@qq.com wrote:

> Who has the idea of machine learning? Spark missing some features for machine 
> learning, For example, the parameter server.
>  
>  
> > 在 2015年11月12日，05:32，Matei Zaharia  > (mailto:matei.zaha...@gmail.com)> 写道：
> >  
> > I like the idea of popping out Tachyon to an optional component too to 
> > reduce the number of dependencies. In the future, it might even be useful 
> > to do this for Hadoop, but it requires too many API changes to be worth 
> > doing now.
> >  
> > Regarding Scala 2.12, we should definitely support it eventually, but I 
> > don't think we need to block 2.0 on that because it can be added later too. 
> > Has anyone investigated what it would take to run on there? I imagine we 
> > don't need many code changes, just maybe some REPL stuff.
> >  
> > Needless to say, but I'm all for the idea of making "major" releases as 
> > undisruptive as possible in the model Reynold proposed. Keeping everyone 
> > working with the same set of releases is super important.
> >  
> > Matei
> >  
> > > On Nov 11, 2015, at 4:58 AM, Sean Owen  > > (mailto:so...@cloudera.com)> wrote:
> > >  
> > > On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin  > > (mailto:r...@databricks.com)> wrote:
> > > > to the Spark community. A major release should not be very different 
> > > > from a
> > > > minor release and should not be gated based on new features. The main
> > > > purpose of a major release is an opportunity to fix things that are 
> > > > broken
> > > > in the current API and remove certain deprecated APIs (examples follow).
> > > >  
> > >  
> > >  
> > > Agree with this stance. Generally, a major release might also be a
> > > time to replace some big old API or implementation with a new one, but
> > > I don't see obvious candidates.
> > >  
> > > I wouldn't mind turning attention to 2.x sooner than later, unless
> > > there's a fairly good reason to continue adding features in 1.x to a
> > > 1.7 release. The scope as of 1.6 is already pretty darned big.
> > >  
> > >  
> > > > 1. Scala 2.11 as the default build. We should still support Scala 2.10, 
> > > > but
> > > > it has been end-of-life.
> > > >  
> > >  
> > >  
> > > By the time 2.x rolls around, 2.12 will be the main version, 2.11 will
> > > be quite stable, and 2.10 will have been EOL for a while. I'd propose
> > > dropping 2.10. Otherwise it's supported for 2 more years.
> > >  
> > >  
> > > > 2. Remove Hadoop 1 support.
> > >  
> > > I'd go further to drop support for <2.2 for sure (2.0 and 2.1 were
> > > sort of 'alpha' and 'beta' releases) and even <2.6.
> > >  
> > > I'm sure we'll think of a number of other small things -- shading a
> > > bunch of stuff? reviewing and updating dependencies in light of
> > > simpler, more recent dependencies to support from Hadoop etc?
> > >  
> > > Farming out Tachyon to a module? (I felt like someone proposed this?)
> > > Pop out any Docker stuff to another repo?
> > > Continue that same effort for EC2?
> > > Farming out some of the "external" integrations to another repo (?
> > > controversial)
> > >  
> > > See also anything marked version "2+" in JIRA.
> > >  
> > > -
> > > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
> > > (mailto:dev-unsubscr...@spark.apache.org)
> > > For additional commands, e-mail: dev-h...@spark.apache.org 
> > > (mailto:dev-h...@spark.apache.org)
> > >  
> >  
> >  
> >  
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
> > (mailto:dev-unsubscr...@spark.apache.org)
> > For additional commands, e-mail: dev-h...@spark.apache.org 
> > (mailto:dev-h...@spark.apache.org)
> >  
>  
>  
>  
>  
>  
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
> (mailto:dev-unsubscr...@spark.apache.org)
> For additional commands, e-mail: dev-h...@spark.apache.org 
> (mailto:dev-h...@spark.apache.org)
>  
>

Re: A proposal for Spark 2.0

2015-11-12 Thread witgo

Who has the idea of machine learning? Spark missing some features for machine 
learning, For example, the parameter server.


> 在 2015年11月12日，05:32，Matei Zaharia  写道：
> 
> I like the idea of popping out Tachyon to an optional component too to reduce 
> the number of dependencies. In the future, it might even be useful to do this 
> for Hadoop, but it requires too many API changes to be worth doing now.
> 
> Regarding Scala 2.12, we should definitely support it eventually, but I don't 
> think we need to block 2.0 on that because it can be added later too. Has 
> anyone investigated what it would take to run on there? I imagine we don't 
> need many code changes, just maybe some REPL stuff.
> 
> Needless to say, but I'm all for the idea of making "major" releases as 
> undisruptive as possible in the model Reynold proposed. Keeping everyone 
> working with the same set of releases is super important.
> 
> Matei
> 
>> On Nov 11, 2015, at 4:58 AM, Sean Owen  wrote:
>> 
>> On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin  wrote:
>>> to the Spark community. A major release should not be very different from a
>>> minor release and should not be gated based on new features. The main
>>> purpose of a major release is an opportunity to fix things that are broken
>>> in the current API and remove certain deprecated APIs (examples follow).
>> 
>> Agree with this stance. Generally, a major release might also be a
>> time to replace some big old API or implementation with a new one, but
>> I don't see obvious candidates.
>> 
>> I wouldn't mind turning attention to 2.x sooner than later, unless
>> there's a fairly good reason to continue adding features in 1.x to a
>> 1.7 release. The scope as of 1.6 is already pretty darned big.
>> 
>> 
>>> 1. Scala 2.11 as the default build. We should still support Scala 2.10, but
>>> it has been end-of-life.
>> 
>> By the time 2.x rolls around, 2.12 will be the main version, 2.11 will
>> be quite stable, and 2.10 will have been EOL for a while. I'd propose
>> dropping 2.10. Otherwise it's supported for 2 more years.
>> 
>> 
>>> 2. Remove Hadoop 1 support.
>> 
>> I'd go further to drop support for <2.2 for sure (2.0 and 2.1 were
>> sort of 'alpha' and 'beta' releases) and even <2.6.
>> 
>> I'm sure we'll think of a number of other small things -- shading a
>> bunch of stuff? reviewing and updating dependencies in light of
>> simpler, more recent dependencies to support from Hadoop etc?
>> 
>> Farming out Tachyon to a module? (I felt like someone proposed this?)
>> Pop out any Docker stuff to another repo?
>> Continue that same effort for EC2?
>> Farming out some of the "external" integrations to another repo (?
>> controversial)
>> 
>> See also anything marked version "2+" in JIRA.
>> 
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>> 
> 
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
> 




-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: A proposal for Spark 2.0

2015-11-11 Thread Matei Zaharia

I like the idea of popping out Tachyon to an optional component too to reduce 
the number of dependencies. In the future, it might even be useful to do this 
for Hadoop, but it requires too many API changes to be worth doing now.

Regarding Scala 2.12, we should definitely support it eventually, but I don't 
think we need to block 2.0 on that because it can be added later too. Has 
anyone investigated what it would take to run on there? I imagine we don't need 
many code changes, just maybe some REPL stuff.

Needless to say, but I'm all for the idea of making "major" releases as 
undisruptive as possible in the model Reynold proposed. Keeping everyone 
working with the same set of releases is super important.

Matei

> On Nov 11, 2015, at 4:58 AM, Sean Owen  wrote:
> 
> On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin  wrote:
>> to the Spark community. A major release should not be very different from a
>> minor release and should not be gated based on new features. The main
>> purpose of a major release is an opportunity to fix things that are broken
>> in the current API and remove certain deprecated APIs (examples follow).
> 
> Agree with this stance. Generally, a major release might also be a
> time to replace some big old API or implementation with a new one, but
> I don't see obvious candidates.
> 
> I wouldn't mind turning attention to 2.x sooner than later, unless
> there's a fairly good reason to continue adding features in 1.x to a
> 1.7 release. The scope as of 1.6 is already pretty darned big.
> 
> 
>> 1. Scala 2.11 as the default build. We should still support Scala 2.10, but
>> it has been end-of-life.
> 
> By the time 2.x rolls around, 2.12 will be the main version, 2.11 will
> be quite stable, and 2.10 will have been EOL for a while. I'd propose
> dropping 2.10. Otherwise it's supported for 2 more years.
> 
> 
>> 2. Remove Hadoop 1 support.
> 
> I'd go further to drop support for <2.2 for sure (2.0 and 2.1 were
> sort of 'alpha' and 'beta' releases) and even <2.6.
> 
> I'm sure we'll think of a number of other small things -- shading a
> bunch of stuff? reviewing and updating dependencies in light of
> simpler, more recent dependencies to support from Hadoop etc?
> 
> Farming out Tachyon to a module? (I felt like someone proposed this?)
> Pop out any Docker stuff to another repo?
> Continue that same effort for EC2?
> Farming out some of the "external" integrations to another repo (?
> controversial)
> 
> See also anything marked version "2+" in JIRA.
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
> 


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: A proposal for Spark 2.0

2015-11-11 Thread hitoshi

Resending my earlier message because it wasn't accepted.

Would like to add a proposal to upgrade jars when they do not break APIs and
fixes a bug. 
To be more specific, I would like to see Kryo to be upgraded from 2.21 to
3.x. Kryo 2.x has a bug (e.g.SPARK-7708) that is blocking it usage in
production environment. 
Other projects like Chill is also wanting to upgrade Kryo to 3.x but being
blocked because Spark won't upgrade. I think OSS community at large will
benefit if we can coordinate to upgrade to Kryo 3.x 



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/A-proposal-for-Spark-2-0-tp15122p15164.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: A proposal for Spark 2.0

2015-11-11 Thread hitoshi

It looks like Chill is willing to upgrade their Kryo to 3.x if Spark and Hive
will. As it is now Spark, Chill, and Hive have Kryo jar but it really can't
be used because Kryo 2 can't serdes some classes. Since Spark 2.0 is a major
release, it really would be nice if we can resolve the Kryo issue.
 
https://github.com/twitter/chill/pull/230#issuecomment-155845959



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/A-proposal-for-Spark-2-0-tp15122p15163.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: A proposal for Spark 2.0

2015-11-11 Thread Jonathan Kelly

If Scala 2.12 will require Java 8 and we want to enable cross-compiling
Spark against Scala 2.11 and 2.12, couldn't we just make Java 8 a
requirement if you want to use Scala 2.12?

On Wed, Nov 11, 2015 at 9:29 AM, Koert Kuipers  wrote:

> i would drop scala 2.10, but definitely keep java 7
>
> cross build for scala 2.12 is great, but i dont know how that works with
> java 8 requirement. dont want to make java 8 mandatory.
>
> and probably stating the obvious, but a lot of apis got polluted due to
> binary compatibility requirement. cleaning that up assuming only source
> compatibility would be a good idea, right?
>
> On Tue, Nov 10, 2015 at 6:10 PM, Reynold Xin  wrote:
>
>> I’m starting a new thread since the other one got intermixed with feature
>> requests. Please refrain from making feature request in this thread. Not
>> that we shouldn’t be adding features, but we can always add features in
>> 1.7, 2.1, 2.2, ...
>>
>> First - I want to propose a premise for how to think about Spark 2.0 and
>> major releases in Spark, based on discussion with several members of the
>> community: a major release should be low overhead and minimally disruptive
>> to the Spark community. A major release should not be very different from a
>> minor release and should not be gated based on new features. The main
>> purpose of a major release is an opportunity to fix things that are broken
>> in the current API and remove certain deprecated APIs (examples follow).
>>
>> For this reason, I would *not* propose doing major releases to break
>> substantial API's or perform large re-architecting that prevent users from
>> upgrading. Spark has always had a culture of evolving architecture
>> incrementally and making changes - and I don't think we want to change this
>> model. In fact, we’ve released many architectural changes on the 1.X line.
>>
>> If the community likes the above model, then to me it seems reasonable to
>> do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or immediately
>> after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A cadence of
>> major releases every 2 years seems doable within the above model.
>>
>> Under this model, here is a list of example things I would propose doing
>> in Spark 2.0, separated into APIs and Operation/Deployment:
>>
>>
>> APIs
>>
>> 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in
>> Spark 1.x.
>>
>> 2. Remove Akka from Spark’s API dependency (in streaming), so user
>> applications can use Akka (SPARK-5293). We have gotten a lot of complaints
>> about user applications being unable to use Akka due to Spark’s dependency
>> on Akka.
>>
>> 3. Remove Guava from Spark’s public API (JavaRDD Optional).
>>
>> 4. Better class package structure for low level developer API’s. In
>> particular, we have some DeveloperApi (mostly various listener-related
>> classes) added over the years. Some packages include only one or two public
>> classes but a lot of private classes. A better structure is to have public
>> classes isolated to a few public packages, and these public packages should
>> have minimal private classes for low level developer APIs.
>>
>> 5. Consolidate task metric and accumulator API. Although having some
>> subtle differences, these two are very similar but have completely
>> different code path.
>>
>> 6. Possibly making Catalyst, Dataset, and DataFrame more general by
>> moving them to other package(s). They are already used beyond SQL, e.g. in
>> ML pipelines, and will be used by streaming also.
>>
>>
>> Operation/Deployment
>>
>> 1. Scala 2.11 as the default build. We should still support Scala 2.10,
>> but it has been end-of-life.
>>
>> 2. Remove Hadoop 1 support.
>>
>> 3. Assembly-free distribution of Spark: don’t require building an
>> enormous assembly jar in order to run Spark.
>>
>>
>

Re: A proposal for Spark 2.0

2015-11-11 Thread Koert Kuipers

good point about dropping <2.2 for hadoop. you dont want to deal with
protobuf 2.4 for example


On Wed, Nov 11, 2015 at 4:58 AM, Sean Owen  wrote:

> On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin  wrote:
> > to the Spark community. A major release should not be very different
> from a
> > minor release and should not be gated based on new features. The main
> > purpose of a major release is an opportunity to fix things that are
> broken
> > in the current API and remove certain deprecated APIs (examples follow).
>
> Agree with this stance. Generally, a major release might also be a
> time to replace some big old API or implementation with a new one, but
> I don't see obvious candidates.
>
> I wouldn't mind turning attention to 2.x sooner than later, unless
> there's a fairly good reason to continue adding features in 1.x to a
> 1.7 release. The scope as of 1.6 is already pretty darned big.
>
>
> > 1. Scala 2.11 as the default build. We should still support Scala 2.10,
> but
> > it has been end-of-life.
>
> By the time 2.x rolls around, 2.12 will be the main version, 2.11 will
> be quite stable, and 2.10 will have been EOL for a while. I'd propose
> dropping 2.10. Otherwise it's supported for 2 more years.
>
>
> > 2. Remove Hadoop 1 support.
>
> I'd go further to drop support for <2.2 for sure (2.0 and 2.1 were
> sort of 'alpha' and 'beta' releases) and even <2.6.
>
> I'm sure we'll think of a number of other small things -- shading a
> bunch of stuff? reviewing and updating dependencies in light of
> simpler, more recent dependencies to support from Hadoop etc?
>
> Farming out Tachyon to a module? (I felt like someone proposed this?)
> Pop out any Docker stuff to another repo?
> Continue that same effort for EC2?
> Farming out some of the "external" integrations to another repo (?
> controversial)
>
> See also anything marked version "2+" in JIRA.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: A proposal for Spark 2.0

2015-11-11 Thread Koert Kuipers

i would drop scala 2.10, but definitely keep java 7

cross build for scala 2.12 is great, but i dont know how that works with
java 8 requirement. dont want to make java 8 mandatory.

and probably stating the obvious, but a lot of apis got polluted due to
binary compatibility requirement. cleaning that up assuming only source
compatibility would be a good idea, right?

On Tue, Nov 10, 2015 at 6:10 PM, Reynold Xin  wrote:

> I’m starting a new thread since the other one got intermixed with feature
> requests. Please refrain from making feature request in this thread. Not
> that we shouldn’t be adding features, but we can always add features in
> 1.7, 2.1, 2.2, ...
>
> First - I want to propose a premise for how to think about Spark 2.0 and
> major releases in Spark, based on discussion with several members of the
> community: a major release should be low overhead and minimally disruptive
> to the Spark community. A major release should not be very different from a
> minor release and should not be gated based on new features. The main
> purpose of a major release is an opportunity to fix things that are broken
> in the current API and remove certain deprecated APIs (examples follow).
>
> For this reason, I would *not* propose doing major releases to break
> substantial API's or perform large re-architecting that prevent users from
> upgrading. Spark has always had a culture of evolving architecture
> incrementally and making changes - and I don't think we want to change this
> model. In fact, we’ve released many architectural changes on the 1.X line.
>
> If the community likes the above model, then to me it seems reasonable to
> do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or immediately
> after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A cadence of
> major releases every 2 years seems doable within the above model.
>
> Under this model, here is a list of example things I would propose doing
> in Spark 2.0, separated into APIs and Operation/Deployment:
>
>
> APIs
>
> 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in
> Spark 1.x.
>
> 2. Remove Akka from Spark’s API dependency (in streaming), so user
> applications can use Akka (SPARK-5293). We have gotten a lot of complaints
> about user applications being unable to use Akka due to Spark’s dependency
> on Akka.
>
> 3. Remove Guava from Spark’s public API (JavaRDD Optional).
>
> 4. Better class package structure for low level developer API’s. In
> particular, we have some DeveloperApi (mostly various listener-related
> classes) added over the years. Some packages include only one or two public
> classes but a lot of private classes. A better structure is to have public
> classes isolated to a few public packages, and these public packages should
> have minimal private classes for low level developer APIs.
>
> 5. Consolidate task metric and accumulator API. Although having some
> subtle differences, these two are very similar but have completely
> different code path.
>
> 6. Possibly making Catalyst, Dataset, and DataFrame more general by moving
> them to other package(s). They are already used beyond SQL, e.g. in ML
> pipelines, and will be used by streaming also.
>
>
> Operation/Deployment
>
> 1. Scala 2.11 as the default build. We should still support Scala 2.10,
> but it has been end-of-life.
>
> 2. Remove Hadoop 1 support.
>
> 3. Assembly-free distribution of Spark: don’t require building an enormous
> assembly jar in order to run Spark.
>
>

Re: A proposal for Spark 2.0

2015-11-11 Thread Zoltán Zvara

Hi,

Reconsidering the execution model behind Streaming would be a good
candidate here, as Spark will not be able to provide the low latency and
sophisticated windowing semantics that more and more use-cases will
require. Maybe relaxing the strict batch model would help a lot. (Mainly
this would hit the shuffling, but the shuffle package suffers from
overlapping functionalities, lack of good modularity anyway. Look at how
coalesce implemented for example - inefficiency also kicks in there.)

On Wed, Nov 11, 2015 at 12:48 PM Tim Preece  wrote:

> Considering Spark 2.x will run for 2 years, would moving up to Scala 2.12 (
> pencilled in for Jan 2016 ) make any sense ? - although that would then
> pre-req Java 8.
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/A-proposal-for-Spark-2-0-tp15122p15153.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: A proposal for Spark 2.0

2015-11-11 Thread Tim Preece

Considering Spark 2.x will run for 2 years, would moving up to Scala 2.12 (
pencilled in for Jan 2016 ) make any sense ? - although that would then
pre-req Java 8.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/A-proposal-for-Spark-2-0-tp15122p15153.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: A proposal for Spark 2.0

2015-11-11 Thread Sean Owen

On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin  wrote:
> to the Spark community. A major release should not be very different from a
> minor release and should not be gated based on new features. The main
> purpose of a major release is an opportunity to fix things that are broken
> in the current API and remove certain deprecated APIs (examples follow).

Agree with this stance. Generally, a major release might also be a
time to replace some big old API or implementation with a new one, but
I don't see obvious candidates.

I wouldn't mind turning attention to 2.x sooner than later, unless
there's a fairly good reason to continue adding features in 1.x to a
1.7 release. The scope as of 1.6 is already pretty darned big.

> 1. Scala 2.11 as the default build. We should still support Scala 2.10, but
> it has been end-of-life.

By the time 2.x rolls around, 2.12 will be the main version, 2.11 will
be quite stable, and 2.10 will have been EOL for a while. I'd propose
dropping 2.10. Otherwise it's supported for 2 more years.

> 2. Remove Hadoop 1 support.

I'd go further to drop support for <2.2 for sure (2.0 and 2.1 were
sort of 'alpha' and 'beta' releases) and even <2.6.

I'm sure we'll think of a number of other small things -- shading a
bunch of stuff? reviewing and updating dependencies in light of
simpler, more recent dependencies to support from Hadoop etc?

Farming out Tachyon to a module? (I felt like someone proposed this?)
Pop out any Docker stuff to another repo?
Continue that same effort for EC2?
Farming out some of the "external" integrations to another repo (?
controversial)

See also anything marked version "2+" in JIRA.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: A proposal for Spark 2.0

2015-11-10 Thread Jean-Baptiste Onofré


Hi,

I fully agree that. Actually, I'm working on PR to add "client" and 
"exploded" profiles in Maven build.


The client profile create a spark-client-assembly jar, largely more 
lightweight that the spark-assembly. In our case, we construct jobs that 
don't require all the spark server side. It means that the minimal size 
of the generated jar is about 120MB, and it's painful in spark-submit 
submission time. That's why I started to remove unecessary dependencies 
in spark-assembly.


On the other hand, I'm also working on the "exploded" mode: instead of 
using a fat monolithic spark-assembly jar file, I'm working on a 
exploded mode, allowing users to view/change the dependencies.


For the client profile, I've already something ready, I will propose the 
PR very soon (by the end of this week hopefully). For the exploded 
profile, I need more time.


My $0.02

Regards
JB

On 11/11/2015 12:53 AM, Reynold Xin wrote:


On Tue, Nov 10, 2015 at 3:35 PM, Nicholas Chammas
mailto:nicholas.cham...@gmail.com>> wrote:


> 3. Assembly-free distribution of Spark: don’t require building an 
enormous assembly jar in order to run Spark.

Could you elaborate a bit on this? I'm not sure what an
assembly-free distribution means.


Right now we ship Spark using a single assembly jar, which causes a few
different problems:

- total number of classes are limited on some configurations

- dependency swapping is harder


The proposal is to just avoid a single fat jar.




--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: A proposal for Spark 2.0

2015-11-10 Thread Jean-Baptiste Onofré

Agree, it makes sense.

Regards
JB

On 11/11/2015 01:28 AM, Reynold Xin wrote:

Echoing Shivaram here. I don't think it makes a lot of sense to add more
features to the 1.x line. We should still do critical bug fixes though.

On Tue, Nov 10, 2015 at 4:23 PM, Shivaram Venkataraman
mailto:shiva...@eecs.berkeley.edu>> wrote:

+1

On a related note I think making it lightweight will ensure that we
stay on the current release schedule and don't unnecessarily delay 2.0
to wait for new features / big architectural changes.

In terms of fixes to 1.x, I think our current policy of back-porting
fixes to older releases would still apply. I don't think developing
new features on both 1.x and 2.x makes a lot of sense as we would like
users to switch to 2.x.

Shivaram

On Tue, Nov 10, 2015 at 4:02 PM, Kostas Sakellis
mailto:kos...@cloudera.com>> wrote:
 > +1 on a lightweight 2.0
 >
 > What is the thinking around the 1.x line after Spark 2.0 is
released? If not
 > terminated, how will we determine what goes into each major
version line?
 > Will 1.x only be for stability fixes?
 >
 > Thanks,
 > Kostas
 >
 > On Tue, Nov 10, 2015 at 3:41 PM, Patrick Wendell
mailto:pwend...@gmail.com>> wrote:
 >>
 >> I also feel the same as Reynold. I agree we should minimize API
breaks and
 >> focus on fixing things around the edge that were mistakes (e.g.
exposing
 >> Guava and Akka) rather than any overhaul that could fragment the
community.
 >> Ideally a major release is a lightweight process we can do every
couple of
 >> years, with minimal impact for users.
 >>
 >> - Patrick
 >>
 >> On Tue, Nov 10, 2015 at 3:35 PM, Nicholas Chammas
 >> mailto:nicholas.cham...@gmail.com>>
wrote:
 >>>
 >>> > For this reason, I would *not* propose doing major releases
to break
 >>> > substantial API's or perform large re-architecting that
prevent users from
 >>> > upgrading. Spark has always had a culture of evolving
architecture
 >>> > incrementally and making changes - and I don't think we want
to change this
 >>> > model.
 >>>
 >>> +1 for this. The Python community went through a lot of turmoil
over the
 >>> Python 2 -> Python 3 transition because the upgrade process was
too painful
 >>> for too long. The Spark community will benefit greatly from our
explicitly
 >>> looking to avoid a similar situation.
 >>>
 >>> > 3. Assembly-free distribution of Spark: don’t require building an
 >>> > enormous assembly jar in order to run Spark.
 >>>
 >>> Could you elaborate a bit on this? I'm not sure what an
assembly-free
 >>> distribution means.
 >>>
 >>> Nick
 >>>
 >>> On Tue, Nov 10, 2015 at 6:11 PM Reynold Xin
mailto:r...@databricks.com>> wrote:

  I’m starting a new thread since the other one got intermixed with
  feature requests. Please refrain from making feature request
in this thread.
  Not that we shouldn’t be adding features, but we can always
add features in
  1.7, 2.1, 2.2, ...

  First - I want to propose a premise for how to think about
Spark 2.0 and
  major releases in Spark, based on discussion with several
members of the
  community: a major release should be low overhead and
minimally disruptive
  to the Spark community. A major release should not be very
different from a
  minor release and should not be gated based on new features.
The main
  purpose of a major release is an opportunity to fix things
that are broken
  in the current API and remove certain deprecated APIs
(examples follow).

  For this reason, I would *not* propose doing major releases to
break
  substantial API's or perform large re-architecting that
prevent users from
  upgrading. Spark has always had a culture of evolving architecture
  incrementally and making changes - and I don't think we want
to change this
  model. In fact, we’ve released many architectural changes on
the 1.X line.

  If the community likes the above model, then to me it seems
reasonable
  to do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7)
or immediately
  after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A
cadence of
  major releases every 2 years seems doable within the above model.

  Under this model, here is a list of example things I would
propose doing
  in Spark 2.0, separated into APIs and Operation/Deployment:

  APIs

  1. Remove interfaces, configs, and modules (e.g. Bagel)
deprecated in
  Spark 1.x.

  2. Remove Akka from Spark’s API dep

Re: A proposal for Spark 2.0

2015-11-10 Thread Mark Hamstra

To take a stab at an example of something concrete and anticipatory I can
go back to something I mentioned previously.  It's not really a good
example because I don't mean to imply that I believe that its premises are
true, but try to go with it If we were to decide that real-time,
event-based streaming is something that we really think we'll want to do in
the 2.x cycle and that the current API (after having deprecations removed
and clear mistakes/inadequacies remedied) isn't adequate to support that,
would we want to "take our best shot" at defining a new API at the outset
of 2.0?  Another way of looking at it is whether API changes in 2.0 should
be entirely backward-looking, trying to fix problems that we've already
identified or whether there is room for some forward-looking changes that
are intended to open new directions for Spark development.

On Tue, Nov 10, 2015 at 7:04 PM, Mark Hamstra 
wrote:

> Heh... ok, I was intentionally pushing those bullet points to be extreme
> to find where people would start pushing back, and I'll agree that we do
> probably want some new features in 2.0 -- but I think we've got good
> agreement that new features aren't really the main point of doing a 2.0
> release.
>
> I don't really have a concrete example of an anticipatory change, and
> that's actually kind of the problem with trying to anticipate what we'll
> need in the way of new public API and the like: Until what we already have
> is clearly inadequate, it hard to concretely imagine how things really
> should be.  At this point I don't have anything specific where I can say "I
> really want to do __ with Spark in the future, and I think it should be
> changed in this way in 2.0 to allow me to do that."  I'm just wondering
> whether we want to even entertain those kinds of change requests if people
> have them, or whether we can just delay making those kinds of decisions
> until it is really obvious that what we have does't work and that there is
> clearly something better that should be done.
>
> On Tue, Nov 10, 2015 at 6:51 PM, Reynold Xin  wrote:
>
>> Mark,
>>
>> I think we are in agreement, although I wouldn't go to the extreme and
>> say "a release with no new features might even be best."
>>
>> Can you elaborate "anticipatory changes"? A concrete example or so would
>> be helpful.
>>
>> On Tue, Nov 10, 2015 at 5:19 PM, Mark Hamstra 
>> wrote:
>>
>>> I'm liking the way this is shaping up, and I'd summarize it this way
>>> (let me know if I'm misunderstanding or misrepresenting anything):
>>>
>>>- New features are not at all the focus of Spark 2.0 -- in fact, a
>>>release with no new features might even be best.
>>>- Remove deprecated API that we agree really should be deprecated.
>>>- Fix/change publicly-visible things that anyone who has spent any
>>>time looking at already knows are mistakes or should be done better, but
>>>that can't be changed within 1.x.
>>>
>>> Do we want to attempt anticipatory changes at all?  In other words, are
>>> there things we want to do in 2.x for which we already know that we'll want
>>> to make publicly-visible changes or that, if we don't add or change it now,
>>> will fall into the "everybody knows it shouldn't be that way" category when
>>> it comes time to discuss the Spark 3.0 release?  I'd be fine if we don't
>>> try at all to anticipate what is needed -- working from the premise that
>>> being forced into a 3.x release earlier than we expect would be less
>>> painful than trying to back out a mistake made at the outset of 2.0 while
>>> trying to guess what we'll need.
>>>
>>> On Tue, Nov 10, 2015 at 3:10 PM, Reynold Xin 
>>> wrote:
>>>
 I’m starting a new thread since the other one got intermixed with
 feature requests. Please refrain from making feature request in this
 thread. Not that we shouldn’t be adding features, but we can always add
 features in 1.7, 2.1, 2.2, ...

 First - I want to propose a premise for how to think about Spark 2.0
 and major releases in Spark, based on discussion with several members of
 the community: a major release should be low overhead and minimally
 disruptive to the Spark community. A major release should not be very
 different from a minor release and should not be gated based on new
 features. The main purpose of a major release is an opportunity to fix
 things that are broken in the current API and remove certain deprecated
 APIs (examples follow).

 For this reason, I would *not* propose doing major releases to break
 substantial API's or perform large re-architecting that prevent users from
 upgrading. Spark has always had a culture of evolving architecture
 incrementally and making changes - and I don't think we want to change this
 model. In fact, we’ve released many architectural changes on the 1.X line.

 If the community likes the above model, then to me it seems reasonable
 to do Spark 2.0 either after Spa

Re: A proposal for Spark 2.0

2015-11-10 Thread Mark Hamstra

Heh... ok, I was intentionally pushing those bullet points to be extreme to
find where people would start pushing back, and I'll agree that we do
probably want some new features in 2.0 -- but I think we've got good
agreement that new features aren't really the main point of doing a 2.0
release.

I don't really have a concrete example of an anticipatory change, and
that's actually kind of the problem with trying to anticipate what we'll
need in the way of new public API and the like: Until what we already have
is clearly inadequate, it hard to concretely imagine how things really
should be.  At this point I don't have anything specific where I can say "I
really want to do __ with Spark in the future, and I think it should be
changed in this way in 2.0 to allow me to do that."  I'm just wondering
whether we want to even entertain those kinds of change requests if people
have them, or whether we can just delay making those kinds of decisions
until it is really obvious that what we have does't work and that there is
clearly something better that should be done.

On Tue, Nov 10, 2015 at 6:51 PM, Reynold Xin  wrote:

> Mark,
>
> I think we are in agreement, although I wouldn't go to the extreme and say
> "a release with no new features might even be best."
>
> Can you elaborate "anticipatory changes"? A concrete example or so would
> be helpful.
>
> On Tue, Nov 10, 2015 at 5:19 PM, Mark Hamstra 
> wrote:
>
>> I'm liking the way this is shaping up, and I'd summarize it this way (let
>> me know if I'm misunderstanding or misrepresenting anything):
>>
>>- New features are not at all the focus of Spark 2.0 -- in fact, a
>>release with no new features might even be best.
>>- Remove deprecated API that we agree really should be deprecated.
>>- Fix/change publicly-visible things that anyone who has spent any
>>time looking at already knows are mistakes or should be done better, but
>>that can't be changed within 1.x.
>>
>> Do we want to attempt anticipatory changes at all?  In other words, are
>> there things we want to do in 2.x for which we already know that we'll want
>> to make publicly-visible changes or that, if we don't add or change it now,
>> will fall into the "everybody knows it shouldn't be that way" category when
>> it comes time to discuss the Spark 3.0 release?  I'd be fine if we don't
>> try at all to anticipate what is needed -- working from the premise that
>> being forced into a 3.x release earlier than we expect would be less
>> painful than trying to back out a mistake made at the outset of 2.0 while
>> trying to guess what we'll need.
>>
>> On Tue, Nov 10, 2015 at 3:10 PM, Reynold Xin  wrote:
>>
>>> I’m starting a new thread since the other one got intermixed with
>>> feature requests. Please refrain from making feature request in this
>>> thread. Not that we shouldn’t be adding features, but we can always add
>>> features in 1.7, 2.1, 2.2, ...
>>>
>>> First - I want to propose a premise for how to think about Spark 2.0 and
>>> major releases in Spark, based on discussion with several members of the
>>> community: a major release should be low overhead and minimally disruptive
>>> to the Spark community. A major release should not be very different from a
>>> minor release and should not be gated based on new features. The main
>>> purpose of a major release is an opportunity to fix things that are broken
>>> in the current API and remove certain deprecated APIs (examples follow).
>>>
>>> For this reason, I would *not* propose doing major releases to break
>>> substantial API's or perform large re-architecting that prevent users from
>>> upgrading. Spark has always had a culture of evolving architecture
>>> incrementally and making changes - and I don't think we want to change this
>>> model. In fact, we’ve released many architectural changes on the 1.X line.
>>>
>>> If the community likes the above model, then to me it seems reasonable
>>> to do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or
>>> immediately after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A
>>> cadence of major releases every 2 years seems doable within the above model.
>>>
>>> Under this model, here is a list of example things I would propose doing
>>> in Spark 2.0, separated into APIs and Operation/Deployment:
>>>
>>>
>>> APIs
>>>
>>> 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in
>>> Spark 1.x.
>>>
>>> 2. Remove Akka from Spark’s API dependency (in streaming), so user
>>> applications can use Akka (SPARK-5293). We have gotten a lot of complaints
>>> about user applications being unable to use Akka due to Spark’s dependency
>>> on Akka.
>>>
>>> 3. Remove Guava from Spark’s public API (JavaRDD Optional).
>>>
>>> 4. Better class package structure for low level developer API’s. In
>>> particular, we have some DeveloperApi (mostly various listener-related
>>> classes) added over the years. Some packages include only one or two public
>>> classes but a lot of pri

Re: A proposal for Spark 2.0

2015-11-10 Thread Marcelo Vanzin

On Tue, Nov 10, 2015 at 6:51 PM, Reynold Xin  wrote:
> I think we are in agreement, although I wouldn't go to the extreme and say
> "a release with no new features might even be best."
>
> Can you elaborate "anticipatory changes"? A concrete example or so would be
> helpful.

I don't know if that's what Mark had in mind, but I'd count the
"remove Guava Optional from Java API" in that category. It would be
nice to have an alternative before that API is removed, although I
have no idea how you'd do it nicely, given that they're all in return
types (so overloading doesn't really work).

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: A proposal for Spark 2.0

2015-11-10 Thread Reynold Xin

Mark,

I think we are in agreement, although I wouldn't go to the extreme and say
"a release with no new features might even be best."

Can you elaborate "anticipatory changes"? A concrete example or so would be
helpful.

On Tue, Nov 10, 2015 at 5:19 PM, Mark Hamstra 
wrote:

> I'm liking the way this is shaping up, and I'd summarize it this way (let
> me know if I'm misunderstanding or misrepresenting anything):
>
>- New features are not at all the focus of Spark 2.0 -- in fact, a
>release with no new features might even be best.
>- Remove deprecated API that we agree really should be deprecated.
>- Fix/change publicly-visible things that anyone who has spent any
>time looking at already knows are mistakes or should be done better, but
>that can't be changed within 1.x.
>
> Do we want to attempt anticipatory changes at all?  In other words, are
> there things we want to do in 2.x for which we already know that we'll want
> to make publicly-visible changes or that, if we don't add or change it now,
> will fall into the "everybody knows it shouldn't be that way" category when
> it comes time to discuss the Spark 3.0 release?  I'd be fine if we don't
> try at all to anticipate what is needed -- working from the premise that
> being forced into a 3.x release earlier than we expect would be less
> painful than trying to back out a mistake made at the outset of 2.0 while
> trying to guess what we'll need.
>
> On Tue, Nov 10, 2015 at 3:10 PM, Reynold Xin  wrote:
>
>> I’m starting a new thread since the other one got intermixed with feature
>> requests. Please refrain from making feature request in this thread. Not
>> that we shouldn’t be adding features, but we can always add features in
>> 1.7, 2.1, 2.2, ...
>>
>> First - I want to propose a premise for how to think about Spark 2.0 and
>> major releases in Spark, based on discussion with several members of the
>> community: a major release should be low overhead and minimally disruptive
>> to the Spark community. A major release should not be very different from a
>> minor release and should not be gated based on new features. The main
>> purpose of a major release is an opportunity to fix things that are broken
>> in the current API and remove certain deprecated APIs (examples follow).
>>
>> For this reason, I would *not* propose doing major releases to break
>> substantial API's or perform large re-architecting that prevent users from
>> upgrading. Spark has always had a culture of evolving architecture
>> incrementally and making changes - and I don't think we want to change this
>> model. In fact, we’ve released many architectural changes on the 1.X line.
>>
>> If the community likes the above model, then to me it seems reasonable to
>> do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or immediately
>> after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A cadence of
>> major releases every 2 years seems doable within the above model.
>>
>> Under this model, here is a list of example things I would propose doing
>> in Spark 2.0, separated into APIs and Operation/Deployment:
>>
>>
>> APIs
>>
>> 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in
>> Spark 1.x.
>>
>> 2. Remove Akka from Spark’s API dependency (in streaming), so user
>> applications can use Akka (SPARK-5293). We have gotten a lot of complaints
>> about user applications being unable to use Akka due to Spark’s dependency
>> on Akka.
>>
>> 3. Remove Guava from Spark’s public API (JavaRDD Optional).
>>
>> 4. Better class package structure for low level developer API’s. In
>> particular, we have some DeveloperApi (mostly various listener-related
>> classes) added over the years. Some packages include only one or two public
>> classes but a lot of private classes. A better structure is to have public
>> classes isolated to a few public packages, and these public packages should
>> have minimal private classes for low level developer APIs.
>>
>> 5. Consolidate task metric and accumulator API. Although having some
>> subtle differences, these two are very similar but have completely
>> different code path.
>>
>> 6. Possibly making Catalyst, Dataset, and DataFrame more general by
>> moving them to other package(s). They are already used beyond SQL, e.g. in
>> ML pipelines, and will be used by streaming also.
>>
>>
>> Operation/Deployment
>>
>> 1. Scala 2.11 as the default build. We should still support Scala 2.10,
>> but it has been end-of-life.
>>
>> 2. Remove Hadoop 1 support.
>>
>> 3. Assembly-free distribution of Spark: don’t require building an
>> enormous assembly jar in order to run Spark.
>>
>>
>

Re: A proposal for Spark 2.0

2015-11-10 Thread Sudhir Menon

Agree. If it is deprecated, get rid of it in 2.0
If the deprecation was a mistake, let's fix that.

Suds
Sent from my iPhone

On Nov 10, 2015, at 5:04 PM, Reynold Xin  wrote:

Maybe a better idea is to un-deprecate an API if it is too important to not
be removed.

I don't think we can drop Java 7 support. It's way too soon.



On Tue, Nov 10, 2015 at 4:59 PM, Mark Hamstra 
wrote:

> Really, Sandy?  "Extra consideration" even for already-deprecated API?  If
> we're not going to remove these with a major version change, then just when
> will we remove them?
>
> On Tue, Nov 10, 2015 at 4:53 PM, Sandy Ryza 
> wrote:
>
>> Another +1 to Reynold's proposal.
>>
>> Maybe this is obvious, but I'd like to advocate against a blanket removal
>> of deprecated / developer APIs.  Many APIs can likely be removed without
>> material impact (e.g. the SparkContext constructor that takes preferred
>> node location data), while others likely see heavier usage (e.g. I wouldn't
>> be surprised if mapPartitionsWithContext was baked into a number of apps)
>> and merit a little extra consideration.
>>
>> Maybe also obvious, but I think a migration guide with API equivlents and
>> the like would be incredibly useful in easing the transition.
>>
>> -Sandy
>>
>> On Tue, Nov 10, 2015 at 4:28 PM, Reynold Xin  wrote:
>>
>>> Echoing Shivaram here. I don't think it makes a lot of sense to add more
>>> features to the 1.x line. We should still do critical bug fixes though.
>>>
>>>
>>> On Tue, Nov 10, 2015 at 4:23 PM, Shivaram Venkataraman <
>>> shiva...@eecs.berkeley.edu> wrote:
>>>
 +1

 On a related note I think making it lightweight will ensure that we
 stay on the current release schedule and don't unnecessarily delay 2.0
 to wait for new features / big architectural changes.

 In terms of fixes to 1.x, I think our current policy of back-porting
 fixes to older releases would still apply. I don't think developing
 new features on both 1.x and 2.x makes a lot of sense as we would like
 users to switch to 2.x.

 Shivaram

 On Tue, Nov 10, 2015 at 4:02 PM, Kostas Sakellis 
 wrote:
 > +1 on a lightweight 2.0
 >
 > What is the thinking around the 1.x line after Spark 2.0 is released?
 If not
 > terminated, how will we determine what goes into each major version
 line?
 > Will 1.x only be for stability fixes?
 >
 > Thanks,
 > Kostas
 >
 > On Tue, Nov 10, 2015 at 3:41 PM, Patrick Wendell 
 wrote:
 >>
 >> I also feel the same as Reynold. I agree we should minimize API
 breaks and
 >> focus on fixing things around the edge that were mistakes (e.g.
 exposing
 >> Guava and Akka) rather than any overhaul that could fragment the
 community.
 >> Ideally a major release is a lightweight process we can do every
 couple of
 >> years, with minimal impact for users.
 >>
 >> - Patrick
 >>
 >> On Tue, Nov 10, 2015 at 3:35 PM, Nicholas Chammas
 >>  wrote:
 >>>
 >>> > For this reason, I would *not* propose doing major releases to
 break
 >>> > substantial API's or perform large re-architecting that prevent
 users from
 >>> > upgrading. Spark has always had a culture of evolving architecture
 >>> > incrementally and making changes - and I don't think we want to
 change this
 >>> > model.
 >>>
 >>> +1 for this. The Python community went through a lot of turmoil
 over the
 >>> Python 2 -> Python 3 transition because the upgrade process was too
 painful
 >>> for too long. The Spark community will benefit greatly from our
 explicitly
 >>> looking to avoid a similar situation.
 >>>
 >>> > 3. Assembly-free distribution of Spark: don’t require building an
 >>> > enormous assembly jar in order to run Spark.
 >>>
 >>> Could you elaborate a bit on this? I'm not sure what an
 assembly-free
 >>> distribution means.
 >>>
 >>> Nick
 >>>
 >>> On Tue, Nov 10, 2015 at 6:11 PM Reynold Xin 
 wrote:
 
  I’m starting a new thread since the other one got intermixed with
  feature requests. Please refrain from making feature request in
 this thread.
  Not that we shouldn’t be adding features, but we can always add
 features in
  1.7, 2.1, 2.2, ...
 
  First - I want to propose a premise for how to think about Spark
 2.0 and
  major releases in Spark, based on discussion with several members
 of the
  community: a major release should be low overhead and minimally
 disruptive
  to the Spark community. A major release should not be very
 different from a
  minor release and should not be gated based on new features. The
 main
  purpose of a major release is an opportunity to fix things that
 are broken
  in the current API and remove certain deprecated APIs (examples
>>

Re: A proposal for Spark 2.0

2015-11-10 Thread Mark Hamstra

I'm liking the way this is shaping up, and I'd summarize it this way (let
me know if I'm misunderstanding or misrepresenting anything):

   - New features are not at all the focus of Spark 2.0 -- in fact, a
   release with no new features might even be best.
   - Remove deprecated API that we agree really should be deprecated.
   - Fix/change publicly-visible things that anyone who has spent any time
   looking at already knows are mistakes or should be done better, but that
   can't be changed within 1.x.

Do we want to attempt anticipatory changes at all?  In other words, are
there things we want to do in 2.x for which we already know that we'll want
to make publicly-visible changes or that, if we don't add or change it now,
will fall into the "everybody knows it shouldn't be that way" category when
it comes time to discuss the Spark 3.0 release?  I'd be fine if we don't
try at all to anticipate what is needed -- working from the premise that
being forced into a 3.x release earlier than we expect would be less
painful than trying to back out a mistake made at the outset of 2.0 while
trying to guess what we'll need.

On Tue, Nov 10, 2015 at 3:10 PM, Reynold Xin  wrote:

> I’m starting a new thread since the other one got intermixed with feature
> requests. Please refrain from making feature request in this thread. Not
> that we shouldn’t be adding features, but we can always add features in
> 1.7, 2.1, 2.2, ...
>
> First - I want to propose a premise for how to think about Spark 2.0 and
> major releases in Spark, based on discussion with several members of the
> community: a major release should be low overhead and minimally disruptive
> to the Spark community. A major release should not be very different from a
> minor release and should not be gated based on new features. The main
> purpose of a major release is an opportunity to fix things that are broken
> in the current API and remove certain deprecated APIs (examples follow).
>
> For this reason, I would *not* propose doing major releases to break
> substantial API's or perform large re-architecting that prevent users from
> upgrading. Spark has always had a culture of evolving architecture
> incrementally and making changes - and I don't think we want to change this
> model. In fact, we’ve released many architectural changes on the 1.X line.
>
> If the community likes the above model, then to me it seems reasonable to
> do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or immediately
> after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A cadence of
> major releases every 2 years seems doable within the above model.
>
> Under this model, here is a list of example things I would propose doing
> in Spark 2.0, separated into APIs and Operation/Deployment:
>
>
> APIs
>
> 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in
> Spark 1.x.
>
> 2. Remove Akka from Spark’s API dependency (in streaming), so user
> applications can use Akka (SPARK-5293). We have gotten a lot of complaints
> about user applications being unable to use Akka due to Spark’s dependency
> on Akka.
>
> 3. Remove Guava from Spark’s public API (JavaRDD Optional).
>
> 4. Better class package structure for low level developer API’s. In
> particular, we have some DeveloperApi (mostly various listener-related
> classes) added over the years. Some packages include only one or two public
> classes but a lot of private classes. A better structure is to have public
> classes isolated to a few public packages, and these public packages should
> have minimal private classes for low level developer APIs.
>
> 5. Consolidate task metric and accumulator API. Although having some
> subtle differences, these two are very similar but have completely
> different code path.
>
> 6. Possibly making Catalyst, Dataset, and DataFrame more general by moving
> them to other package(s). They are already used beyond SQL, e.g. in ML
> pipelines, and will be used by streaming also.
>
>
> Operation/Deployment
>
> 1. Scala 2.11 as the default build. We should still support Scala 2.10,
> but it has been end-of-life.
>
> 2. Remove Hadoop 1 support.
>
> 3. Assembly-free distribution of Spark: don’t require building an enormous
> assembly jar in order to run Spark.
>
>

Re: A proposal for Spark 2.0

2015-11-10 Thread Reynold Xin

Maybe a better idea is to un-deprecate an API if it is too important to not
be removed.

I don't think we can drop Java 7 support. It's way too soon.

On Tue, Nov 10, 2015 at 4:59 PM, Mark Hamstra 
wrote:

> Really, Sandy?  "Extra consideration" even for already-deprecated API?  If
> we're not going to remove these with a major version change, then just when
> will we remove them?
>
> On Tue, Nov 10, 2015 at 4:53 PM, Sandy Ryza 
> wrote:
>
>> Another +1 to Reynold's proposal.
>>
>> Maybe this is obvious, but I'd like to advocate against a blanket removal
>> of deprecated / developer APIs.  Many APIs can likely be removed without
>> material impact (e.g. the SparkContext constructor that takes preferred
>> node location data), while others likely see heavier usage (e.g. I wouldn't
>> be surprised if mapPartitionsWithContext was baked into a number of apps)
>> and merit a little extra consideration.
>>
>> Maybe also obvious, but I think a migration guide with API equivlents and
>> the like would be incredibly useful in easing the transition.
>>
>> -Sandy
>>
>> On Tue, Nov 10, 2015 at 4:28 PM, Reynold Xin  wrote:
>>
>>> Echoing Shivaram here. I don't think it makes a lot of sense to add more
>>> features to the 1.x line. We should still do critical bug fixes though.
>>>
>>>
>>> On Tue, Nov 10, 2015 at 4:23 PM, Shivaram Venkataraman <
>>> shiva...@eecs.berkeley.edu> wrote:
>>>
 +1

 On a related note I think making it lightweight will ensure that we
 stay on the current release schedule and don't unnecessarily delay 2.0
 to wait for new features / big architectural changes.

 In terms of fixes to 1.x, I think our current policy of back-porting
 fixes to older releases would still apply. I don't think developing
 new features on both 1.x and 2.x makes a lot of sense as we would like
 users to switch to 2.x.

 Shivaram

 On Tue, Nov 10, 2015 at 4:02 PM, Kostas Sakellis 
 wrote:
 > +1 on a lightweight 2.0
 >
 > What is the thinking around the 1.x line after Spark 2.0 is released?
 If not
 > terminated, how will we determine what goes into each major version
 line?
 > Will 1.x only be for stability fixes?
 >
 > Thanks,
 > Kostas
 >
 > On Tue, Nov 10, 2015 at 3:41 PM, Patrick Wendell 
 wrote:
 >>
 >> I also feel the same as Reynold. I agree we should minimize API
 breaks and
 >> focus on fixing things around the edge that were mistakes (e.g.
 exposing
 >> Guava and Akka) rather than any overhaul that could fragment the
 community.
 >> Ideally a major release is a lightweight process we can do every
 couple of
 >> years, with minimal impact for users.
 >>
 >> - Patrick
 >>
 >> On Tue, Nov 10, 2015 at 3:35 PM, Nicholas Chammas
 >>  wrote:
 >>>
 >>> > For this reason, I would *not* propose doing major releases to
 break
 >>> > substantial API's or perform large re-architecting that prevent
 users from
 >>> > upgrading. Spark has always had a culture of evolving architecture
 >>> > incrementally and making changes - and I don't think we want to
 change this
 >>> > model.
 >>>
 >>> +1 for this. The Python community went through a lot of turmoil
 over the
 >>> Python 2 -> Python 3 transition because the upgrade process was too
 painful
 >>> for too long. The Spark community will benefit greatly from our
 explicitly
 >>> looking to avoid a similar situation.
 >>>
 >>> > 3. Assembly-free distribution of Spark: don’t require building an
 >>> > enormous assembly jar in order to run Spark.
 >>>
 >>> Could you elaborate a bit on this? I'm not sure what an
 assembly-free
 >>> distribution means.
 >>>
 >>> Nick
 >>>
 >>> On Tue, Nov 10, 2015 at 6:11 PM Reynold Xin 
 wrote:

  I’m starting a new thread since the other one got intermixed with
  feature requests. Please refrain from making feature request in
 this thread.
  Not that we shouldn’t be adding features, but we can always add
 features in
  1.7, 2.1, 2.2, ...

  First - I want to propose a premise for how to think about Spark
 2.0 and
  major releases in Spark, based on discussion with several members
 of the
  community: a major release should be low overhead and minimally
 disruptive
  to the Spark community. A major release should not be very
 different from a
  minor release and should not be gated based on new features. The
 main
  purpose of a major release is an opportunity to fix things that
 are broken
  in the current API and remove certain deprecated APIs (examples
 follow).

  For this reason, I would *not* propose doing major releases to
 break
  substantial API's or perform large re-architecting that prev

Re: A proposal for Spark 2.0

2015-11-10 Thread Mark Hamstra

Really, Sandy?  "Extra consideration" even for already-deprecated API?  If
we're not going to remove these with a major version change, then just when
will we remove them?

On Tue, Nov 10, 2015 at 4:53 PM, Sandy Ryza  wrote:

> Another +1 to Reynold's proposal.
>
> Maybe this is obvious, but I'd like to advocate against a blanket removal
> of deprecated / developer APIs.  Many APIs can likely be removed without
> material impact (e.g. the SparkContext constructor that takes preferred
> node location data), while others likely see heavier usage (e.g. I wouldn't
> be surprised if mapPartitionsWithContext was baked into a number of apps)
> and merit a little extra consideration.
>
> Maybe also obvious, but I think a migration guide with API equivlents and
> the like would be incredibly useful in easing the transition.
>
> -Sandy
>
> On Tue, Nov 10, 2015 at 4:28 PM, Reynold Xin  wrote:
>
>> Echoing Shivaram here. I don't think it makes a lot of sense to add more
>> features to the 1.x line. We should still do critical bug fixes though.
>>
>>
>> On Tue, Nov 10, 2015 at 4:23 PM, Shivaram Venkataraman <
>> shiva...@eecs.berkeley.edu> wrote:
>>
>>> +1
>>>
>>> On a related note I think making it lightweight will ensure that we
>>> stay on the current release schedule and don't unnecessarily delay 2.0
>>> to wait for new features / big architectural changes.
>>>
>>> In terms of fixes to 1.x, I think our current policy of back-porting
>>> fixes to older releases would still apply. I don't think developing
>>> new features on both 1.x and 2.x makes a lot of sense as we would like
>>> users to switch to 2.x.
>>>
>>> Shivaram
>>>
>>> On Tue, Nov 10, 2015 at 4:02 PM, Kostas Sakellis 
>>> wrote:
>>> > +1 on a lightweight 2.0
>>> >
>>> > What is the thinking around the 1.x line after Spark 2.0 is released?
>>> If not
>>> > terminated, how will we determine what goes into each major version
>>> line?
>>> > Will 1.x only be for stability fixes?
>>> >
>>> > Thanks,
>>> > Kostas
>>> >
>>> > On Tue, Nov 10, 2015 at 3:41 PM, Patrick Wendell 
>>> wrote:
>>> >>
>>> >> I also feel the same as Reynold. I agree we should minimize API
>>> breaks and
>>> >> focus on fixing things around the edge that were mistakes (e.g.
>>> exposing
>>> >> Guava and Akka) rather than any overhaul that could fragment the
>>> community.
>>> >> Ideally a major release is a lightweight process we can do every
>>> couple of
>>> >> years, with minimal impact for users.
>>> >>
>>> >> - Patrick
>>> >>
>>> >> On Tue, Nov 10, 2015 at 3:35 PM, Nicholas Chammas
>>> >>  wrote:
>>> >>>
>>> >>> > For this reason, I would *not* propose doing major releases to
>>> break
>>> >>> > substantial API's or perform large re-architecting that prevent
>>> users from
>>> >>> > upgrading. Spark has always had a culture of evolving architecture
>>> >>> > incrementally and making changes - and I don't think we want to
>>> change this
>>> >>> > model.
>>> >>>
>>> >>> +1 for this. The Python community went through a lot of turmoil over
>>> the
>>> >>> Python 2 -> Python 3 transition because the upgrade process was too
>>> painful
>>> >>> for too long. The Spark community will benefit greatly from our
>>> explicitly
>>> >>> looking to avoid a similar situation.
>>> >>>
>>> >>> > 3. Assembly-free distribution of Spark: don’t require building an
>>> >>> > enormous assembly jar in order to run Spark.
>>> >>>
>>> >>> Could you elaborate a bit on this? I'm not sure what an assembly-free
>>> >>> distribution means.
>>> >>>
>>> >>> Nick
>>> >>>
>>> >>> On Tue, Nov 10, 2015 at 6:11 PM Reynold Xin 
>>> wrote:
>>> 
>>>  I’m starting a new thread since the other one got intermixed with
>>>  feature requests. Please refrain from making feature request in
>>> this thread.
>>>  Not that we shouldn’t be adding features, but we can always add
>>> features in
>>>  1.7, 2.1, 2.2, ...
>>> 
>>>  First - I want to propose a premise for how to think about Spark
>>> 2.0 and
>>>  major releases in Spark, based on discussion with several members
>>> of the
>>>  community: a major release should be low overhead and minimally
>>> disruptive
>>>  to the Spark community. A major release should not be very
>>> different from a
>>>  minor release and should not be gated based on new features. The
>>> main
>>>  purpose of a major release is an opportunity to fix things that are
>>> broken
>>>  in the current API and remove certain deprecated APIs (examples
>>> follow).
>>> 
>>>  For this reason, I would *not* propose doing major releases to break
>>>  substantial API's or perform large re-architecting that prevent
>>> users from
>>>  upgrading. Spark has always had a culture of evolving architecture
>>>  incrementally and making changes - and I don't think we want to
>>> change this
>>>  model. In fact, we’ve released many architectural changes on the
>>> 1.X line.
>>> 
>>>  If the community likes the above model, then to me it seems

Re: A proposal for Spark 2.0

2015-11-10 Thread Sandy Ryza

Oh and another question - should Spark 2.0 support Java 7?

On Tue, Nov 10, 2015 at 4:53 PM, Sandy Ryza  wrote:

> Another +1 to Reynold's proposal.
>
> Maybe this is obvious, but I'd like to advocate against a blanket removal
> of deprecated / developer APIs.  Many APIs can likely be removed without
> material impact (e.g. the SparkContext constructor that takes preferred
> node location data), while others likely see heavier usage (e.g. I wouldn't
> be surprised if mapPartitionsWithContext was baked into a number of apps)
> and merit a little extra consideration.
>
> Maybe also obvious, but I think a migration guide with API equivlents and
> the like would be incredibly useful in easing the transition.
>
> -Sandy
>
> On Tue, Nov 10, 2015 at 4:28 PM, Reynold Xin  wrote:
>
>> Echoing Shivaram here. I don't think it makes a lot of sense to add more
>> features to the 1.x line. We should still do critical bug fixes though.
>>
>>
>> On Tue, Nov 10, 2015 at 4:23 PM, Shivaram Venkataraman <
>> shiva...@eecs.berkeley.edu> wrote:
>>
>>> +1
>>>
>>> On a related note I think making it lightweight will ensure that we
>>> stay on the current release schedule and don't unnecessarily delay 2.0
>>> to wait for new features / big architectural changes.
>>>
>>> In terms of fixes to 1.x, I think our current policy of back-porting
>>> fixes to older releases would still apply. I don't think developing
>>> new features on both 1.x and 2.x makes a lot of sense as we would like
>>> users to switch to 2.x.
>>>
>>> Shivaram
>>>
>>> On Tue, Nov 10, 2015 at 4:02 PM, Kostas Sakellis 
>>> wrote:
>>> > +1 on a lightweight 2.0
>>> >
>>> > What is the thinking around the 1.x line after Spark 2.0 is released?
>>> If not
>>> > terminated, how will we determine what goes into each major version
>>> line?
>>> > Will 1.x only be for stability fixes?
>>> >
>>> > Thanks,
>>> > Kostas
>>> >
>>> > On Tue, Nov 10, 2015 at 3:41 PM, Patrick Wendell 
>>> wrote:
>>> >>
>>> >> I also feel the same as Reynold. I agree we should minimize API
>>> breaks and
>>> >> focus on fixing things around the edge that were mistakes (e.g.
>>> exposing
>>> >> Guava and Akka) rather than any overhaul that could fragment the
>>> community.
>>> >> Ideally a major release is a lightweight process we can do every
>>> couple of
>>> >> years, with minimal impact for users.
>>> >>
>>> >> - Patrick
>>> >>
>>> >> On Tue, Nov 10, 2015 at 3:35 PM, Nicholas Chammas
>>> >>  wrote:
>>> >>>
>>> >>> > For this reason, I would *not* propose doing major releases to
>>> break
>>> >>> > substantial API's or perform large re-architecting that prevent
>>> users from
>>> >>> > upgrading. Spark has always had a culture of evolving architecture
>>> >>> > incrementally and making changes - and I don't think we want to
>>> change this
>>> >>> > model.
>>> >>>
>>> >>> +1 for this. The Python community went through a lot of turmoil over
>>> the
>>> >>> Python 2 -> Python 3 transition because the upgrade process was too
>>> painful
>>> >>> for too long. The Spark community will benefit greatly from our
>>> explicitly
>>> >>> looking to avoid a similar situation.
>>> >>>
>>> >>> > 3. Assembly-free distribution of Spark: don’t require building an
>>> >>> > enormous assembly jar in order to run Spark.
>>> >>>
>>> >>> Could you elaborate a bit on this? I'm not sure what an assembly-free
>>> >>> distribution means.
>>> >>>
>>> >>> Nick
>>> >>>
>>> >>> On Tue, Nov 10, 2015 at 6:11 PM Reynold Xin 
>>> wrote:
>>> 
>>>  I’m starting a new thread since the other one got intermixed with
>>>  feature requests. Please refrain from making feature request in
>>> this thread.
>>>  Not that we shouldn’t be adding features, but we can always add
>>> features in
>>>  1.7, 2.1, 2.2, ...
>>> 
>>>  First - I want to propose a premise for how to think about Spark
>>> 2.0 and
>>>  major releases in Spark, based on discussion with several members
>>> of the
>>>  community: a major release should be low overhead and minimally
>>> disruptive
>>>  to the Spark community. A major release should not be very
>>> different from a
>>>  minor release and should not be gated based on new features. The
>>> main
>>>  purpose of a major release is an opportunity to fix things that are
>>> broken
>>>  in the current API and remove certain deprecated APIs (examples
>>> follow).
>>> 
>>>  For this reason, I would *not* propose doing major releases to break
>>>  substantial API's or perform large re-architecting that prevent
>>> users from
>>>  upgrading. Spark has always had a culture of evolving architecture
>>>  incrementally and making changes - and I don't think we want to
>>> change this
>>>  model. In fact, we’ve released many architectural changes on the
>>> 1.X line.
>>> 
>>>  If the community likes the above model, then to me it seems
>>> reasonable
>>>  to do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or
>>> immediately
>>> >>>

Re: A proposal for Spark 2.0

2015-11-10 Thread Sandy Ryza

Another +1 to Reynold's proposal.

Maybe this is obvious, but I'd like to advocate against a blanket removal
of deprecated / developer APIs.  Many APIs can likely be removed without
material impact (e.g. the SparkContext constructor that takes preferred
node location data), while others likely see heavier usage (e.g. I wouldn't
be surprised if mapPartitionsWithContext was baked into a number of apps)
and merit a little extra consideration.

Maybe also obvious, but I think a migration guide with API equivlents and
the like would be incredibly useful in easing the transition.

-Sandy

On Tue, Nov 10, 2015 at 4:28 PM, Reynold Xin  wrote:

> Echoing Shivaram here. I don't think it makes a lot of sense to add more
> features to the 1.x line. We should still do critical bug fixes though.
>
>
> On Tue, Nov 10, 2015 at 4:23 PM, Shivaram Venkataraman <
> shiva...@eecs.berkeley.edu> wrote:
>
>> +1
>>
>> On a related note I think making it lightweight will ensure that we
>> stay on the current release schedule and don't unnecessarily delay 2.0
>> to wait for new features / big architectural changes.
>>
>> In terms of fixes to 1.x, I think our current policy of back-porting
>> fixes to older releases would still apply. I don't think developing
>> new features on both 1.x and 2.x makes a lot of sense as we would like
>> users to switch to 2.x.
>>
>> Shivaram
>>
>> On Tue, Nov 10, 2015 at 4:02 PM, Kostas Sakellis 
>> wrote:
>> > +1 on a lightweight 2.0
>> >
>> > What is the thinking around the 1.x line after Spark 2.0 is released?
>> If not
>> > terminated, how will we determine what goes into each major version
>> line?
>> > Will 1.x only be for stability fixes?
>> >
>> > Thanks,
>> > Kostas
>> >
>> > On Tue, Nov 10, 2015 at 3:41 PM, Patrick Wendell 
>> wrote:
>> >>
>> >> I also feel the same as Reynold. I agree we should minimize API breaks
>> and
>> >> focus on fixing things around the edge that were mistakes (e.g.
>> exposing
>> >> Guava and Akka) rather than any overhaul that could fragment the
>> community.
>> >> Ideally a major release is a lightweight process we can do every
>> couple of
>> >> years, with minimal impact for users.
>> >>
>> >> - Patrick
>> >>
>> >> On Tue, Nov 10, 2015 at 3:35 PM, Nicholas Chammas
>> >>  wrote:
>> >>>
>> >>> > For this reason, I would *not* propose doing major releases to break
>> >>> > substantial API's or perform large re-architecting that prevent
>> users from
>> >>> > upgrading. Spark has always had a culture of evolving architecture
>> >>> > incrementally and making changes - and I don't think we want to
>> change this
>> >>> > model.
>> >>>
>> >>> +1 for this. The Python community went through a lot of turmoil over
>> the
>> >>> Python 2 -> Python 3 transition because the upgrade process was too
>> painful
>> >>> for too long. The Spark community will benefit greatly from our
>> explicitly
>> >>> looking to avoid a similar situation.
>> >>>
>> >>> > 3. Assembly-free distribution of Spark: don’t require building an
>> >>> > enormous assembly jar in order to run Spark.
>> >>>
>> >>> Could you elaborate a bit on this? I'm not sure what an assembly-free
>> >>> distribution means.
>> >>>
>> >>> Nick
>> >>>
>> >>> On Tue, Nov 10, 2015 at 6:11 PM Reynold Xin 
>> wrote:
>> 
>>  I’m starting a new thread since the other one got intermixed with
>>  feature requests. Please refrain from making feature request in this
>> thread.
>>  Not that we shouldn’t be adding features, but we can always add
>> features in
>>  1.7, 2.1, 2.2, ...
>> 
>>  First - I want to propose a premise for how to think about Spark 2.0
>> and
>>  major releases in Spark, based on discussion with several members of
>> the
>>  community: a major release should be low overhead and minimally
>> disruptive
>>  to the Spark community. A major release should not be very different
>> from a
>>  minor release and should not be gated based on new features. The main
>>  purpose of a major release is an opportunity to fix things that are
>> broken
>>  in the current API and remove certain deprecated APIs (examples
>> follow).
>> 
>>  For this reason, I would *not* propose doing major releases to break
>>  substantial API's or perform large re-architecting that prevent
>> users from
>>  upgrading. Spark has always had a culture of evolving architecture
>>  incrementally and making changes - and I don't think we want to
>> change this
>>  model. In fact, we’ve released many architectural changes on the 1.X
>> line.
>> 
>>  If the community likes the above model, then to me it seems
>> reasonable
>>  to do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or
>> immediately
>>  after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A
>> cadence of
>>  major releases every 2 years seems doable within the above model.
>> 
>>  Under this model, here is a list of example things I would propose
>> doing
>>  in Spark

Re: A proposal for Spark 2.0

2015-11-10 Thread Reynold Xin

Echoing Shivaram here. I don't think it makes a lot of sense to add more
features to the 1.x line. We should still do critical bug fixes though.


On Tue, Nov 10, 2015 at 4:23 PM, Shivaram Venkataraman <
shiva...@eecs.berkeley.edu> wrote:

> +1
>
> On a related note I think making it lightweight will ensure that we
> stay on the current release schedule and don't unnecessarily delay 2.0
> to wait for new features / big architectural changes.
>
> In terms of fixes to 1.x, I think our current policy of back-porting
> fixes to older releases would still apply. I don't think developing
> new features on both 1.x and 2.x makes a lot of sense as we would like
> users to switch to 2.x.
>
> Shivaram
>
> On Tue, Nov 10, 2015 at 4:02 PM, Kostas Sakellis 
> wrote:
> > +1 on a lightweight 2.0
> >
> > What is the thinking around the 1.x line after Spark 2.0 is released? If
> not
> > terminated, how will we determine what goes into each major version line?
> > Will 1.x only be for stability fixes?
> >
> > Thanks,
> > Kostas
> >
> > On Tue, Nov 10, 2015 at 3:41 PM, Patrick Wendell 
> wrote:
> >>
> >> I also feel the same as Reynold. I agree we should minimize API breaks
> and
> >> focus on fixing things around the edge that were mistakes (e.g. exposing
> >> Guava and Akka) rather than any overhaul that could fragment the
> community.
> >> Ideally a major release is a lightweight process we can do every couple
> of
> >> years, with minimal impact for users.
> >>
> >> - Patrick
> >>
> >> On Tue, Nov 10, 2015 at 3:35 PM, Nicholas Chammas
> >>  wrote:
> >>>
> >>> > For this reason, I would *not* propose doing major releases to break
> >>> > substantial API's or perform large re-architecting that prevent
> users from
> >>> > upgrading. Spark has always had a culture of evolving architecture
> >>> > incrementally and making changes - and I don't think we want to
> change this
> >>> > model.
> >>>
> >>> +1 for this. The Python community went through a lot of turmoil over
> the
> >>> Python 2 -> Python 3 transition because the upgrade process was too
> painful
> >>> for too long. The Spark community will benefit greatly from our
> explicitly
> >>> looking to avoid a similar situation.
> >>>
> >>> > 3. Assembly-free distribution of Spark: don’t require building an
> >>> > enormous assembly jar in order to run Spark.
> >>>
> >>> Could you elaborate a bit on this? I'm not sure what an assembly-free
> >>> distribution means.
> >>>
> >>> Nick
> >>>
> >>> On Tue, Nov 10, 2015 at 6:11 PM Reynold Xin 
> wrote:
> 
>  I’m starting a new thread since the other one got intermixed with
>  feature requests. Please refrain from making feature request in this
> thread.
>  Not that we shouldn’t be adding features, but we can always add
> features in
>  1.7, 2.1, 2.2, ...
> 
>  First - I want to propose a premise for how to think about Spark 2.0
> and
>  major releases in Spark, based on discussion with several members of
> the
>  community: a major release should be low overhead and minimally
> disruptive
>  to the Spark community. A major release should not be very different
> from a
>  minor release and should not be gated based on new features. The main
>  purpose of a major release is an opportunity to fix things that are
> broken
>  in the current API and remove certain deprecated APIs (examples
> follow).
> 
>  For this reason, I would *not* propose doing major releases to break
>  substantial API's or perform large re-architecting that prevent users
> from
>  upgrading. Spark has always had a culture of evolving architecture
>  incrementally and making changes - and I don't think we want to
> change this
>  model. In fact, we’ve released many architectural changes on the 1.X
> line.
> 
>  If the community likes the above model, then to me it seems reasonable
>  to do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or
> immediately
>  after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A
> cadence of
>  major releases every 2 years seems doable within the above model.
> 
>  Under this model, here is a list of example things I would propose
> doing
>  in Spark 2.0, separated into APIs and Operation/Deployment:
> 
> 
>  APIs
> 
>  1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in
>  Spark 1.x.
> 
>  2. Remove Akka from Spark’s API dependency (in streaming), so user
>  applications can use Akka (SPARK-5293). We have gotten a lot of
> complaints
>  about user applications being unable to use Akka due to Spark’s
> dependency
>  on Akka.
> 
>  3. Remove Guava from Spark’s public API (JavaRDD Optional).
> 
>  4. Better class package structure for low level developer API’s. In
>  particular, we have some DeveloperApi (mostly various listener-related
>  classes) added over the years. Some packages include only one or two
> public
>  cl

Re: A proposal for Spark 2.0

2015-11-10 Thread Shivaram Venkataraman

+1

On a related note I think making it lightweight will ensure that we
stay on the current release schedule and don't unnecessarily delay 2.0
to wait for new features / big architectural changes.

In terms of fixes to 1.x, I think our current policy of back-porting
fixes to older releases would still apply. I don't think developing
new features on both 1.x and 2.x makes a lot of sense as we would like
users to switch to 2.x.

Shivaram

On Tue, Nov 10, 2015 at 4:02 PM, Kostas Sakellis  wrote:
> +1 on a lightweight 2.0
>
> What is the thinking around the 1.x line after Spark 2.0 is released? If not
> terminated, how will we determine what goes into each major version line?
> Will 1.x only be for stability fixes?
>
> Thanks,
> Kostas
>
> On Tue, Nov 10, 2015 at 3:41 PM, Patrick Wendell  wrote:
>>
>> I also feel the same as Reynold. I agree we should minimize API breaks and
>> focus on fixing things around the edge that were mistakes (e.g. exposing
>> Guava and Akka) rather than any overhaul that could fragment the community.
>> Ideally a major release is a lightweight process we can do every couple of
>> years, with minimal impact for users.
>>
>> - Patrick
>>
>> On Tue, Nov 10, 2015 at 3:35 PM, Nicholas Chammas
>>  wrote:
>>>
>>> > For this reason, I would *not* propose doing major releases to break
>>> > substantial API's or perform large re-architecting that prevent users from
>>> > upgrading. Spark has always had a culture of evolving architecture
>>> > incrementally and making changes - and I don't think we want to change 
>>> > this
>>> > model.
>>>
>>> +1 for this. The Python community went through a lot of turmoil over the
>>> Python 2 -> Python 3 transition because the upgrade process was too painful
>>> for too long. The Spark community will benefit greatly from our explicitly
>>> looking to avoid a similar situation.
>>>
>>> > 3. Assembly-free distribution of Spark: don’t require building an
>>> > enormous assembly jar in order to run Spark.
>>>
>>> Could you elaborate a bit on this? I'm not sure what an assembly-free
>>> distribution means.
>>>
>>> Nick
>>>
>>> On Tue, Nov 10, 2015 at 6:11 PM Reynold Xin  wrote:

 I’m starting a new thread since the other one got intermixed with
 feature requests. Please refrain from making feature request in this 
 thread.
 Not that we shouldn’t be adding features, but we can always add features in
 1.7, 2.1, 2.2, ...

 First - I want to propose a premise for how to think about Spark 2.0 and
 major releases in Spark, based on discussion with several members of the
 community: a major release should be low overhead and minimally disruptive
 to the Spark community. A major release should not be very different from a
 minor release and should not be gated based on new features. The main
 purpose of a major release is an opportunity to fix things that are broken
 in the current API and remove certain deprecated APIs (examples follow).

 For this reason, I would *not* propose doing major releases to break
 substantial API's or perform large re-architecting that prevent users from
 upgrading. Spark has always had a culture of evolving architecture
 incrementally and making changes - and I don't think we want to change this
 model. In fact, we’ve released many architectural changes on the 1.X line.

 If the community likes the above model, then to me it seems reasonable
 to do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or 
 immediately
 after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A cadence of
 major releases every 2 years seems doable within the above model.

 Under this model, here is a list of example things I would propose doing
 in Spark 2.0, separated into APIs and Operation/Deployment:

 APIs

 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in
 Spark 1.x.

 2. Remove Akka from Spark’s API dependency (in streaming), so user
 applications can use Akka (SPARK-5293). We have gotten a lot of complaints
 about user applications being unable to use Akka due to Spark’s dependency
 on Akka.

 3. Remove Guava from Spark’s public API (JavaRDD Optional).

 4. Better class package structure for low level developer API’s. In
 particular, we have some DeveloperApi (mostly various listener-related
 classes) added over the years. Some packages include only one or two public
 classes but a lot of private classes. A better structure is to have public
 classes isolated to a few public packages, and these public packages should
 have minimal private classes for low level developer APIs.

 5. Consolidate task metric and accumulator API. Although having some
 subtle differences, these two are very similar but have completely 
 different
 code path.

 6. Possibly making Catalyst, Dataset, and DataFrame more general b

Re: A proposal for Spark 2.0

2015-11-10 Thread Mridul Muralidharan

Would be also good to fix api breakages introduced as part of 1.0
(where there is missing functionality now), overhaul & remove all
deprecated config/features/combinations, api changes that we need to
make to public api which has been deferred for minor releases.

Regards,
Mridul

On Tue, Nov 10, 2015 at 3:10 PM, Reynold Xin  wrote:
> I’m starting a new thread since the other one got intermixed with feature
> requests. Please refrain from making feature request in this thread. Not
> that we shouldn’t be adding features, but we can always add features in 1.7,
> 2.1, 2.2, ...
>
> First - I want to propose a premise for how to think about Spark 2.0 and
> major releases in Spark, based on discussion with several members of the
> community: a major release should be low overhead and minimally disruptive
> to the Spark community. A major release should not be very different from a
> minor release and should not be gated based on new features. The main
> purpose of a major release is an opportunity to fix things that are broken
> in the current API and remove certain deprecated APIs (examples follow).
>
> For this reason, I would *not* propose doing major releases to break
> substantial API's or perform large re-architecting that prevent users from
> upgrading. Spark has always had a culture of evolving architecture
> incrementally and making changes - and I don't think we want to change this
> model. In fact, we’ve released many architectural changes on the 1.X line.
>
> If the community likes the above model, then to me it seems reasonable to do
> Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or immediately after
> Spark 1.7. It will be 18 or 21 months since Spark 1.0. A cadence of major
> releases every 2 years seems doable within the above model.
>
> Under this model, here is a list of example things I would propose doing in
> Spark 2.0, separated into APIs and Operation/Deployment:
>
>
> APIs
>
> 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in Spark
> 1.x.
>
> 2. Remove Akka from Spark’s API dependency (in streaming), so user
> applications can use Akka (SPARK-5293). We have gotten a lot of complaints
> about user applications being unable to use Akka due to Spark’s dependency
> on Akka.
>
> 3. Remove Guava from Spark’s public API (JavaRDD Optional).
>
> 4. Better class package structure for low level developer API’s. In
> particular, we have some DeveloperApi (mostly various listener-related
> classes) added over the years. Some packages include only one or two public
> classes but a lot of private classes. A better structure is to have public
> classes isolated to a few public packages, and these public packages should
> have minimal private classes for low level developer APIs.
>
> 5. Consolidate task metric and accumulator API. Although having some subtle
> differences, these two are very similar but have completely different code
> path.
>
> 6. Possibly making Catalyst, Dataset, and DataFrame more general by moving
> them to other package(s). They are already used beyond SQL, e.g. in ML
> pipelines, and will be used by streaming also.
>
>
> Operation/Deployment
>
> 1. Scala 2.11 as the default build. We should still support Scala 2.10, but
> it has been end-of-life.
>
> 2. Remove Hadoop 1 support.
>
> 3. Assembly-free distribution of Spark: don’t require building an enormous
> assembly jar in order to run Spark.
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: A proposal for Spark 2.0

2015-11-10 Thread Kostas Sakellis

+1 on a lightweight 2.0

What is the thinking around the 1.x line after Spark 2.0 is released? If
not terminated, how will we determine what goes into each major version
line? Will 1.x only be for stability fixes?

Thanks,
Kostas

On Tue, Nov 10, 2015 at 3:41 PM, Patrick Wendell  wrote:

> I also feel the same as Reynold. I agree we should minimize API breaks and
> focus on fixing things around the edge that were mistakes (e.g. exposing
> Guava and Akka) rather than any overhaul that could fragment the community.
> Ideally a major release is a lightweight process we can do every couple of
> years, with minimal impact for users.
>
> - Patrick
>
> On Tue, Nov 10, 2015 at 3:35 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> > For this reason, I would *not* propose doing major releases to break
>> substantial API's or perform large re-architecting that prevent users from
>> upgrading. Spark has always had a culture of evolving architecture
>> incrementally and making changes - and I don't think we want to change this
>> model.
>>
>> +1 for this. The Python community went through a lot of turmoil over the
>> Python 2 -> Python 3 transition because the upgrade process was too painful
>> for too long. The Spark community will benefit greatly from our explicitly
>> looking to avoid a similar situation.
>>
>> > 3. Assembly-free distribution of Spark: don’t require building an
>> enormous assembly jar in order to run Spark.
>>
>> Could you elaborate a bit on this? I'm not sure what an assembly-free
>> distribution means.
>>
>> Nick
>>
>> On Tue, Nov 10, 2015 at 6:11 PM Reynold Xin  wrote:
>>
>>> I’m starting a new thread since the other one got intermixed with
>>> feature requests. Please refrain from making feature request in this
>>> thread. Not that we shouldn’t be adding features, but we can always add
>>> features in 1.7, 2.1, 2.2, ...
>>>
>>> First - I want to propose a premise for how to think about Spark 2.0 and
>>> major releases in Spark, based on discussion with several members of the
>>> community: a major release should be low overhead and minimally disruptive
>>> to the Spark community. A major release should not be very different from a
>>> minor release and should not be gated based on new features. The main
>>> purpose of a major release is an opportunity to fix things that are broken
>>> in the current API and remove certain deprecated APIs (examples follow).
>>>
>>> For this reason, I would *not* propose doing major releases to break
>>> substantial API's or perform large re-architecting that prevent users from
>>> upgrading. Spark has always had a culture of evolving architecture
>>> incrementally and making changes - and I don't think we want to change this
>>> model. In fact, we’ve released many architectural changes on the 1.X line.
>>>
>>> If the community likes the above model, then to me it seems reasonable
>>> to do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or
>>> immediately after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A
>>> cadence of major releases every 2 years seems doable within the above model.
>>>
>>> Under this model, here is a list of example things I would propose doing
>>> in Spark 2.0, separated into APIs and Operation/Deployment:
>>>
>>>
>>> APIs
>>>
>>> 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in
>>> Spark 1.x.
>>>
>>> 2. Remove Akka from Spark’s API dependency (in streaming), so user
>>> applications can use Akka (SPARK-5293). We have gotten a lot of complaints
>>> about user applications being unable to use Akka due to Spark’s dependency
>>> on Akka.
>>>
>>> 3. Remove Guava from Spark’s public API (JavaRDD Optional).
>>>
>>> 4. Better class package structure for low level developer API’s. In
>>> particular, we have some DeveloperApi (mostly various listener-related
>>> classes) added over the years. Some packages include only one or two public
>>> classes but a lot of private classes. A better structure is to have public
>>> classes isolated to a few public packages, and these public packages should
>>> have minimal private classes for low level developer APIs.
>>>
>>> 5. Consolidate task metric and accumulator API. Although having some
>>> subtle differences, these two are very similar but have completely
>>> different code path.
>>>
>>> 6. Possibly making Catalyst, Dataset, and DataFrame more general by
>>> moving them to other package(s). They are already used beyond SQL, e.g. in
>>> ML pipelines, and will be used by streaming also.
>>>
>>>
>>> Operation/Deployment
>>>
>>> 1. Scala 2.11 as the default build. We should still support Scala 2.10,
>>> but it has been end-of-life.
>>>
>>> 2. Remove Hadoop 1 support.
>>>
>>> 3. Assembly-free distribution of Spark: don’t require building an
>>> enormous assembly jar in order to run Spark.
>>>
>>>
>

Re: A proposal for Spark 2.0

2015-11-10 Thread Josh Rosen

There's a proposal / discussion of the assembly-less distributions at
https://github.com/vanzin/spark/pull/2/files /
https://issues.apache.org/jira/browse/SPARK-11157.

On Tue, Nov 10, 2015 at 3:53 PM, Reynold Xin  wrote:

>
> On Tue, Nov 10, 2015 at 3:35 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>>
>> > 3. Assembly-free distribution of Spark: don’t require building an
>> enormous assembly jar in order to run Spark.
>>
>> Could you elaborate a bit on this? I'm not sure what an assembly-free
>> distribution means.
>>
>>
> Right now we ship Spark using a single assembly jar, which causes a few
> different problems:
>
> - total number of classes are limited on some configurations
>
> - dependency swapping is harder
>
>
> The proposal is to just avoid a single fat jar.
>
>
>

Re: A proposal for Spark 2.0

2015-11-10 Thread Reynold Xin

On Tue, Nov 10, 2015 at 3:35 PM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

>
> > 3. Assembly-free distribution of Spark: don’t require building an
> enormous assembly jar in order to run Spark.
>
> Could you elaborate a bit on this? I'm not sure what an assembly-free
> distribution means.
>
>
Right now we ship Spark using a single assembly jar, which causes a few
different problems:

- total number of classes are limited on some configurations

- dependency swapping is harder


The proposal is to just avoid a single fat jar.

Re: A proposal for Spark 2.0

2015-11-10 Thread Patrick Wendell

I also feel the same as Reynold. I agree we should minimize API breaks and
focus on fixing things around the edge that were mistakes (e.g. exposing
Guava and Akka) rather than any overhaul that could fragment the community.
Ideally a major release is a lightweight process we can do every couple of
years, with minimal impact for users.

- Patrick

On Tue, Nov 10, 2015 at 3:35 PM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> > For this reason, I would *not* propose doing major releases to break
> substantial API's or perform large re-architecting that prevent users from
> upgrading. Spark has always had a culture of evolving architecture
> incrementally and making changes - and I don't think we want to change this
> model.
>
> +1 for this. The Python community went through a lot of turmoil over the
> Python 2 -> Python 3 transition because the upgrade process was too painful
> for too long. The Spark community will benefit greatly from our explicitly
> looking to avoid a similar situation.
>
> > 3. Assembly-free distribution of Spark: don’t require building an
> enormous assembly jar in order to run Spark.
>
> Could you elaborate a bit on this? I'm not sure what an assembly-free
> distribution means.
>
> Nick
>
> On Tue, Nov 10, 2015 at 6:11 PM Reynold Xin  wrote:
>
>> I’m starting a new thread since the other one got intermixed with feature
>> requests. Please refrain from making feature request in this thread. Not
>> that we shouldn’t be adding features, but we can always add features in
>> 1.7, 2.1, 2.2, ...
>>
>> First - I want to propose a premise for how to think about Spark 2.0 and
>> major releases in Spark, based on discussion with several members of the
>> community: a major release should be low overhead and minimally disruptive
>> to the Spark community. A major release should not be very different from a
>> minor release and should not be gated based on new features. The main
>> purpose of a major release is an opportunity to fix things that are broken
>> in the current API and remove certain deprecated APIs (examples follow).
>>
>> For this reason, I would *not* propose doing major releases to break
>> substantial API's or perform large re-architecting that prevent users from
>> upgrading. Spark has always had a culture of evolving architecture
>> incrementally and making changes - and I don't think we want to change this
>> model. In fact, we’ve released many architectural changes on the 1.X line.
>>
>> If the community likes the above model, then to me it seems reasonable to
>> do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or immediately
>> after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A cadence of
>> major releases every 2 years seems doable within the above model.
>>
>> Under this model, here is a list of example things I would propose doing
>> in Spark 2.0, separated into APIs and Operation/Deployment:
>>
>>
>> APIs
>>
>> 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in
>> Spark 1.x.
>>
>> 2. Remove Akka from Spark’s API dependency (in streaming), so user
>> applications can use Akka (SPARK-5293). We have gotten a lot of complaints
>> about user applications being unable to use Akka due to Spark’s dependency
>> on Akka.
>>
>> 3. Remove Guava from Spark’s public API (JavaRDD Optional).
>>
>> 4. Better class package structure for low level developer API’s. In
>> particular, we have some DeveloperApi (mostly various listener-related
>> classes) added over the years. Some packages include only one or two public
>> classes but a lot of private classes. A better structure is to have public
>> classes isolated to a few public packages, and these public packages should
>> have minimal private classes for low level developer APIs.
>>
>> 5. Consolidate task metric and accumulator API. Although having some
>> subtle differences, these two are very similar but have completely
>> different code path.
>>
>> 6. Possibly making Catalyst, Dataset, and DataFrame more general by
>> moving them to other package(s). They are already used beyond SQL, e.g. in
>> ML pipelines, and will be used by streaming also.
>>
>>
>> Operation/Deployment
>>
>> 1. Scala 2.11 as the default build. We should still support Scala 2.10,
>> but it has been end-of-life.
>>
>> 2. Remove Hadoop 1 support.
>>
>> 3. Assembly-free distribution of Spark: don’t require building an
>> enormous assembly jar in order to run Spark.
>>
>>

Re: A proposal for Spark 2.0

2015-11-10 Thread Nicholas Chammas

> For this reason, I would *not* propose doing major releases to break
substantial API's or perform large re-architecting that prevent users from
upgrading. Spark has always had a culture of evolving architecture
incrementally and making changes - and I don't think we want to change this
model.

+1 for this. The Python community went through a lot of turmoil over the
Python 2 -> Python 3 transition because the upgrade process was too painful
for too long. The Spark community will benefit greatly from our explicitly
looking to avoid a similar situation.

> 3. Assembly-free distribution of Spark: don’t require building an
enormous assembly jar in order to run Spark.

Could you elaborate a bit on this? I'm not sure what an assembly-free
distribution means.

Nick

On Tue, Nov 10, 2015 at 6:11 PM Reynold Xin  wrote:

> I’m starting a new thread since the other one got intermixed with feature
> requests. Please refrain from making feature request in this thread. Not
> that we shouldn’t be adding features, but we can always add features in
> 1.7, 2.1, 2.2, ...
>
> First - I want to propose a premise for how to think about Spark 2.0 and
> major releases in Spark, based on discussion with several members of the
> community: a major release should be low overhead and minimally disruptive
> to the Spark community. A major release should not be very different from a
> minor release and should not be gated based on new features. The main
> purpose of a major release is an opportunity to fix things that are broken
> in the current API and remove certain deprecated APIs (examples follow).
>
> For this reason, I would *not* propose doing major releases to break
> substantial API's or perform large re-architecting that prevent users from
> upgrading. Spark has always had a culture of evolving architecture
> incrementally and making changes - and I don't think we want to change this
> model. In fact, we’ve released many architectural changes on the 1.X line.
>
> If the community likes the above model, then to me it seems reasonable to
> do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or immediately
> after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A cadence of
> major releases every 2 years seems doable within the above model.
>
> Under this model, here is a list of example things I would propose doing
> in Spark 2.0, separated into APIs and Operation/Deployment:
>
>
> APIs
>
> 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in
> Spark 1.x.
>
> 2. Remove Akka from Spark’s API dependency (in streaming), so user
> applications can use Akka (SPARK-5293). We have gotten a lot of complaints
> about user applications being unable to use Akka due to Spark’s dependency
> on Akka.
>
> 3. Remove Guava from Spark’s public API (JavaRDD Optional).
>
> 4. Better class package structure for low level developer API’s. In
> particular, we have some DeveloperApi (mostly various listener-related
> classes) added over the years. Some packages include only one or two public
> classes but a lot of private classes. A better structure is to have public
> classes isolated to a few public packages, and these public packages should
> have minimal private classes for low level developer APIs.
>
> 5. Consolidate task metric and accumulator API. Although having some
> subtle differences, these two are very similar but have completely
> different code path.
>
> 6. Possibly making Catalyst, Dataset, and DataFrame more general by moving
> them to other package(s). They are already used beyond SQL, e.g. in ML
> pipelines, and will be used by streaming also.
>
>
> Operation/Deployment
>
> 1. Scala 2.11 as the default build. We should still support Scala 2.10,
> but it has been end-of-life.
>
> 2. Remove Hadoop 1 support.
>
> 3. Assembly-free distribution of Spark: don’t require building an enormous
> assembly jar in order to run Spark.
>
>

73 matches

Mail list logo