Re: A proposal for Spark 2.0
How about the Hive dependency? We use ThriftServer, serdes and even the parser/execute logic in Hive. Where will we direct about this part? -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/A-proposal-for-Spark-2-0-tp15122p15793.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: A proposal for Spark 2.0
Yeah, I'd also favor maintaining docs with strictly temporary relevance on JIRA when possible. The wiki is like this weird backwater I only rarely visit. Don't we typically do this kind of stuff with an umbrella issue on JIRA? Tom, wouldn't that work well for you? Nick On Wed, Dec 23, 2015 at 5:06 AM Sean Owen wrote: > I think this will be hard to maintain; we already have JIRA as the de > facto central place to store discussions and prioritize work, and the > 2.x stuff is already a JIRA. The wiki doesn't really hurt, just > probably will never be looked at again. Let's point people in all > cases to JIRA. > > On Tue, Dec 22, 2015 at 11:52 PM, Reynold Xin wrote: > > I started a wiki page: > > > https://cwiki.apache.org/confluence/display/SPARK/Development+Discussions > > > > > > On Tue, Dec 22, 2015 at 6:27 AM, Tom Graves > wrote: > >> > >> Do we have a summary of all the discussions and what is planned for 2.0 > >> then? Perhaps we should put on the wiki for reference. > >> > >> Tom > >> > >> > >> On Tuesday, December 22, 2015 12:12 AM, Reynold Xin < > r...@databricks.com> > >> wrote: > >> > >> > >> FYI I updated the master branch's Spark version to 2.0.0-SNAPSHOT. > >> > >> On Tue, Nov 10, 2015 at 3:10 PM, Reynold Xin > wrote: > >> > >> I’m starting a new thread since the other one got intermixed with > feature > >> requests. Please refrain from making feature request in this thread. Not > >> that we shouldn’t be adding features, but we can always add features in > 1.7, > >> 2.1, 2.2, ... > >> > >> First - I want to propose a premise for how to think about Spark 2.0 and > >> major releases in Spark, based on discussion with several members of the > >> community: a major release should be low overhead and minimally > disruptive > >> to the Spark community. A major release should not be very different > from a > >> minor release and should not be gated based on new features. The main > >> purpose of a major release is an opportunity to fix things that are > broken > >> in the current API and remove certain deprecated APIs (examples follow). > >> > >> For this reason, I would *not* propose doing major releases to break > >> substantial API's or perform large re-architecting that prevent users > from > >> upgrading. Spark has always had a culture of evolving architecture > >> incrementally and making changes - and I don't think we want to change > this > >> model. In fact, we’ve released many architectural changes on the 1.X > line. > >> > >> If the community likes the above model, then to me it seems reasonable > to > >> do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or > immediately > >> after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A cadence > of > >> major releases every 2 years seems doable within the above model. > >> > >> Under this model, here is a list of example things I would propose doing > >> in Spark 2.0, separated into APIs and Operation/Deployment: > >> > >> > >> APIs > >> > >> 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in > >> Spark 1.x. > >> > >> 2. Remove Akka from Spark’s API dependency (in streaming), so user > >> applications can use Akka (SPARK-5293). We have gotten a lot of > complaints > >> about user applications being unable to use Akka due to Spark’s > dependency > >> on Akka. > >> > >> 3. Remove Guava from Spark’s public API (JavaRDD Optional). > >> > >> 4. Better class package structure for low level developer API’s. In > >> particular, we have some DeveloperApi (mostly various listener-related > >> classes) added over the years. Some packages include only one or two > public > >> classes but a lot of private classes. A better structure is to have > public > >> classes isolated to a few public packages, and these public packages > should > >> have minimal private classes for low level developer APIs. > >> > >> 5. Consolidate task metric and accumulator API. Although having some > >> subtle differences, these two are very similar but have completely > different > >> code path. > >> > >> 6. Possibly making Catalyst, Dataset, and DataFrame more general by > moving > >> them to other package(s). They are already used beyond SQL, e.g. in ML > >> pipelines, and will be used by streaming also. > >> > >> > >> Operation/Deployment > >> > >> 1. Scala 2.11 as the default build. We should still support Scala 2.10, > >> but it has been end-of-life. > >> > >> 2. Remove Hadoop 1 support. > >> > >> 3. Assembly-free distribution of Spark: don’t require building an > enormous > >> assembly jar in order to run Spark. > >> > >> > >> > >> > > > > - > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > >
Re: A proposal for Spark 2.0
I think this will be hard to maintain; we already have JIRA as the de facto central place to store discussions and prioritize work, and the 2.x stuff is already a JIRA. The wiki doesn't really hurt, just probably will never be looked at again. Let's point people in all cases to JIRA. On Tue, Dec 22, 2015 at 11:52 PM, Reynold Xin wrote: > I started a wiki page: > https://cwiki.apache.org/confluence/display/SPARK/Development+Discussions > > > On Tue, Dec 22, 2015 at 6:27 AM, Tom Graves wrote: >> >> Do we have a summary of all the discussions and what is planned for 2.0 >> then? Perhaps we should put on the wiki for reference. >> >> Tom >> >> >> On Tuesday, December 22, 2015 12:12 AM, Reynold Xin >> wrote: >> >> >> FYI I updated the master branch's Spark version to 2.0.0-SNAPSHOT. >> >> On Tue, Nov 10, 2015 at 3:10 PM, Reynold Xin wrote: >> >> I’m starting a new thread since the other one got intermixed with feature >> requests. Please refrain from making feature request in this thread. Not >> that we shouldn’t be adding features, but we can always add features in 1.7, >> 2.1, 2.2, ... >> >> First - I want to propose a premise for how to think about Spark 2.0 and >> major releases in Spark, based on discussion with several members of the >> community: a major release should be low overhead and minimally disruptive >> to the Spark community. A major release should not be very different from a >> minor release and should not be gated based on new features. The main >> purpose of a major release is an opportunity to fix things that are broken >> in the current API and remove certain deprecated APIs (examples follow). >> >> For this reason, I would *not* propose doing major releases to break >> substantial API's or perform large re-architecting that prevent users from >> upgrading. Spark has always had a culture of evolving architecture >> incrementally and making changes - and I don't think we want to change this >> model. In fact, we’ve released many architectural changes on the 1.X line. >> >> If the community likes the above model, then to me it seems reasonable to >> do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or immediately >> after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A cadence of >> major releases every 2 years seems doable within the above model. >> >> Under this model, here is a list of example things I would propose doing >> in Spark 2.0, separated into APIs and Operation/Deployment: >> >> >> APIs >> >> 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in >> Spark 1.x. >> >> 2. Remove Akka from Spark’s API dependency (in streaming), so user >> applications can use Akka (SPARK-5293). We have gotten a lot of complaints >> about user applications being unable to use Akka due to Spark’s dependency >> on Akka. >> >> 3. Remove Guava from Spark’s public API (JavaRDD Optional). >> >> 4. Better class package structure for low level developer API’s. In >> particular, we have some DeveloperApi (mostly various listener-related >> classes) added over the years. Some packages include only one or two public >> classes but a lot of private classes. A better structure is to have public >> classes isolated to a few public packages, and these public packages should >> have minimal private classes for low level developer APIs. >> >> 5. Consolidate task metric and accumulator API. Although having some >> subtle differences, these two are very similar but have completely different >> code path. >> >> 6. Possibly making Catalyst, Dataset, and DataFrame more general by moving >> them to other package(s). They are already used beyond SQL, e.g. in ML >> pipelines, and will be used by streaming also. >> >> >> Operation/Deployment >> >> 1. Scala 2.11 as the default build. We should still support Scala 2.10, >> but it has been end-of-life. >> >> 2. Remove Hadoop 1 support. >> >> 3. Assembly-free distribution of Spark: don’t require building an enormous >> assembly jar in order to run Spark. >> >> >> >> > - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: A proposal for Spark 2.0
I started a wiki page: https://cwiki.apache.org/confluence/display/SPARK/Development+Discussions On Tue, Dec 22, 2015 at 6:27 AM, Tom Graves wrote: > Do we have a summary of all the discussions and what is planned for 2.0 > then? Perhaps we should put on the wiki for reference. > > Tom > > > On Tuesday, December 22, 2015 12:12 AM, Reynold Xin > wrote: > > > FYI I updated the master branch's Spark version to 2.0.0-SNAPSHOT. > > On Tue, Nov 10, 2015 at 3:10 PM, Reynold Xin wrote: > > I’m starting a new thread since the other one got intermixed with feature > requests. Please refrain from making feature request in this thread. Not > that we shouldn’t be adding features, but we can always add features in > 1.7, 2.1, 2.2, ... > > First - I want to propose a premise for how to think about Spark 2.0 and > major releases in Spark, based on discussion with several members of the > community: a major release should be low overhead and minimally disruptive > to the Spark community. A major release should not be very different from a > minor release and should not be gated based on new features. The main > purpose of a major release is an opportunity to fix things that are broken > in the current API and remove certain deprecated APIs (examples follow). > > For this reason, I would *not* propose doing major releases to break > substantial API's or perform large re-architecting that prevent users from > upgrading. Spark has always had a culture of evolving architecture > incrementally and making changes - and I don't think we want to change this > model. In fact, we’ve released many architectural changes on the 1.X line. > > If the community likes the above model, then to me it seems reasonable to > do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or immediately > after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A cadence of > major releases every 2 years seems doable within the above model. > > Under this model, here is a list of example things I would propose doing > in Spark 2.0, separated into APIs and Operation/Deployment: > > > APIs > > 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in > Spark 1.x. > > 2. Remove Akka from Spark’s API dependency (in streaming), so user > applications can use Akka (SPARK-5293). We have gotten a lot of complaints > about user applications being unable to use Akka due to Spark’s dependency > on Akka. > > 3. Remove Guava from Spark’s public API (JavaRDD Optional). > > 4. Better class package structure for low level developer API’s. In > particular, we have some DeveloperApi (mostly various listener-related > classes) added over the years. Some packages include only one or two public > classes but a lot of private classes. A better structure is to have public > classes isolated to a few public packages, and these public packages should > have minimal private classes for low level developer APIs. > > 5. Consolidate task metric and accumulator API. Although having some > subtle differences, these two are very similar but have completely > different code path. > > 6. Possibly making Catalyst, Dataset, and DataFrame more general by moving > them to other package(s). They are already used beyond SQL, e.g. in ML > pipelines, and will be used by streaming also. > > > Operation/Deployment > > 1. Scala 2.11 as the default build. We should still support Scala 2.10, > but it has been end-of-life. > > 2. Remove Hadoop 1 support. > > 3. Assembly-free distribution of Spark: don’t require building an enormous > assembly jar in order to run Spark. > > > > >
Re: A proposal for Spark 2.0
Do we have a summary of all the discussions and what is planned for 2.0 then? Perhaps we should put on the wiki for reference. Tom On Tuesday, December 22, 2015 12:12 AM, Reynold Xin wrote: FYI I updated the master branch's Spark version to 2.0.0-SNAPSHOT. On Tue, Nov 10, 2015 at 3:10 PM, Reynold Xin wrote: I’m starting a new thread since the other one got intermixed with feature requests. Please refrain from making feature request in this thread. Not that we shouldn’t be adding features, but we can always add features in 1.7, 2.1, 2.2, ... First - I want to propose a premise for how to think about Spark 2.0 and major releases in Spark, based on discussion with several members of the community: a major release should be low overhead and minimally disruptive to the Spark community. A major release should not be very different from a minor release and should not be gated based on new features. The main purpose of a major release is an opportunity to fix things that are broken in the current API and remove certain deprecated APIs (examples follow). For this reason, I would *not* propose doing major releases to break substantial API's or perform large re-architecting that prevent users from upgrading. Spark has always had a culture of evolving architecture incrementally and making changes - and I don't think we want to change this model. In fact, we’ve released many architectural changes on the 1.X line. If the community likes the above model, then to me it seems reasonable to do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or immediately after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A cadence of major releases every 2 years seems doable within the above model. Under this model, here is a list of example things I would propose doing in Spark 2.0, separated into APIs and Operation/Deployment: APIs 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in Spark 1.x. 2. Remove Akka from Spark’s API dependency (in streaming), so user applications can use Akka (SPARK-5293). We have gotten a lot of complaints about user applications being unable to use Akka due to Spark’s dependency on Akka. 3. Remove Guava from Spark’s public API (JavaRDD Optional). 4. Better class package structure for low level developer API’s. In particular, we have some DeveloperApi (mostly various listener-related classes) added over the years. Some packages include only one or two public classes but a lot of private classes. A better structure is to have public classes isolated to a few public packages, and these public packages should have minimal private classes for low level developer APIs. 5. Consolidate task metric and accumulator API. Although having some subtle differences, these two are very similar but have completely different code path. 6. Possibly making Catalyst, Dataset, and DataFrame more general by moving them to other package(s). They are already used beyond SQL, e.g. in ML pipelines, and will be used by streaming also. Operation/Deployment 1. Scala 2.11 as the default build. We should still support Scala 2.10, but it has been end-of-life. 2. Remove Hadoop 1 support. 3. Assembly-free distribution of Spark: don’t require building an enormous assembly jar in order to run Spark.
Re: A proposal for Spark 2.0
Thanks your quick respose, ok, I will start a new thread with my thoughts Thanks, Allen At 2015-12-22 15:19:49, "Reynold Xin" wrote: I'm not sure if we need special API support for GPUs. You can already use GPUs on individual executor nodes to build your own applications. If we want to leverage GPUs out of the box, I don't think the solution is to provide GPU specific APIs. Rather, we should just switch the underlying execution to GPUs when it is more optimal. Anyway, I don't want to distract this topic, If you want to discuss more about GPUs, please start a new thread. On Mon, Dec 21, 2015 at 11:18 PM, Allen Zhang wrote: plus dev 在 2015-12-22 15:15:59,"Allen Zhang" 写道: Hi Reynold, Any new API support for GPU computing in our 2.0 new version ? -Allen 在 2015-12-22 14:12:50,"Reynold Xin" 写道: FYI I updated the master branch's Spark version to 2.0.0-SNAPSHOT. On Tue, Nov 10, 2015 at 3:10 PM, Reynold Xin wrote: I’m starting a new thread since the other one got intermixed with feature requests. Please refrain from making feature request in this thread. Not that we shouldn’t be adding features, but we can always add features in 1.7, 2.1, 2.2, ... First - I want to propose a premise for how to think about Spark 2.0 and major releases in Spark, based on discussion with several members of the community: a major release should be low overhead and minimally disruptive to the Spark community. A major release should not be very different from a minor release and should not be gated based on new features. The main purpose of a major release is an opportunity to fix things that are broken in the current API and remove certain deprecated APIs (examples follow). For this reason, I would *not* propose doing major releases to break substantial API's or perform large re-architecting that prevent users from upgrading. Spark has always had a culture of evolving architecture incrementally and making changes - and I don't think we want to change this model. In fact, we’ve released many architectural changes on the 1.X line. If the community likes the above model, then to me it seems reasonable to do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or immediately after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A cadence of major releases every 2 years seems doable within the above model. Under this model, here is a list of example things I would propose doing in Spark 2.0, separated into APIs and Operation/Deployment: APIs 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in Spark 1.x. 2. Remove Akka from Spark’s API dependency (in streaming), so user applications can use Akka (SPARK-5293). We have gotten a lot of complaints about user applications being unable to use Akka due to Spark’s dependency on Akka. 3. Remove Guava from Spark’s public API (JavaRDD Optional). 4. Better class package structure for low level developer API’s. In particular, we have some DeveloperApi (mostly various listener-related classes) added over the years. Some packages include only one or two public classes but a lot of private classes. A better structure is to have public classes isolated to a few public packages, and these public packages should have minimal private classes for low level developer APIs. 5. Consolidate task metric and accumulator API. Although having some subtle differences, these two are very similar but have completely different code path. 6. Possibly making Catalyst, Dataset, and DataFrame more general by moving them to other package(s). They are already used beyond SQL, e.g. in ML pipelines, and will be used by streaming also. Operation/Deployment 1. Scala 2.11 as the default build. We should still support Scala 2.10, but it has been end-of-life. 2. Remove Hadoop 1 support. 3. Assembly-free distribution of Spark: don’t require building an enormous assembly jar in order to run Spark.
Re: A proposal for Spark 2.0
I'm not sure if we need special API support for GPUs. You can already use GPUs on individual executor nodes to build your own applications. If we want to leverage GPUs out of the box, I don't think the solution is to provide GPU specific APIs. Rather, we should just switch the underlying execution to GPUs when it is more optimal. Anyway, I don't want to distract this topic, If you want to discuss more about GPUs, please start a new thread. On Mon, Dec 21, 2015 at 11:18 PM, Allen Zhang wrote: > plus dev > > > > > > > 在 2015-12-22 15:15:59,"Allen Zhang" 写道: > > Hi Reynold, > > Any new API support for GPU computing in our 2.0 new version ? > > -Allen > > > > > 在 2015-12-22 14:12:50,"Reynold Xin" 写道: > > FYI I updated the master branch's Spark version to 2.0.0-SNAPSHOT. > > On Tue, Nov 10, 2015 at 3:10 PM, Reynold Xin wrote: > >> I’m starting a new thread since the other one got intermixed with feature >> requests. Please refrain from making feature request in this thread. Not >> that we shouldn’t be adding features, but we can always add features in >> 1.7, 2.1, 2.2, ... >> >> First - I want to propose a premise for how to think about Spark 2.0 and >> major releases in Spark, based on discussion with several members of the >> community: a major release should be low overhead and minimally disruptive >> to the Spark community. A major release should not be very different from a >> minor release and should not be gated based on new features. The main >> purpose of a major release is an opportunity to fix things that are broken >> in the current API and remove certain deprecated APIs (examples follow). >> >> For this reason, I would *not* propose doing major releases to break >> substantial API's or perform large re-architecting that prevent users from >> upgrading. Spark has always had a culture of evolving architecture >> incrementally and making changes - and I don't think we want to change this >> model. In fact, we’ve released many architectural changes on the 1.X line. >> >> If the community likes the above model, then to me it seems reasonable to >> do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or immediately >> after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A cadence of >> major releases every 2 years seems doable within the above model. >> >> Under this model, here is a list of example things I would propose doing >> in Spark 2.0, separated into APIs and Operation/Deployment: >> >> >> APIs >> >> 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in >> Spark 1.x. >> >> 2. Remove Akka from Spark’s API dependency (in streaming), so user >> applications can use Akka (SPARK-5293). We have gotten a lot of complaints >> about user applications being unable to use Akka due to Spark’s dependency >> on Akka. >> >> 3. Remove Guava from Spark’s public API (JavaRDD Optional). >> >> 4. Better class package structure for low level developer API’s. In >> particular, we have some DeveloperApi (mostly various listener-related >> classes) added over the years. Some packages include only one or two public >> classes but a lot of private classes. A better structure is to have public >> classes isolated to a few public packages, and these public packages should >> have minimal private classes for low level developer APIs. >> >> 5. Consolidate task metric and accumulator API. Although having some >> subtle differences, these two are very similar but have completely >> different code path. >> >> 6. Possibly making Catalyst, Dataset, and DataFrame more general by >> moving them to other package(s). They are already used beyond SQL, e.g. in >> ML pipelines, and will be used by streaming also. >> >> >> Operation/Deployment >> >> 1. Scala 2.11 as the default build. We should still support Scala 2.10, >> but it has been end-of-life. >> >> 2. Remove Hadoop 1 support. >> >> 3. Assembly-free distribution of Spark: don’t require building an >> enormous assembly jar in order to run Spark. >> >> > > > > > > > >
Re: A proposal for Spark 2.0
plus dev 在 2015-12-22 15:15:59,"Allen Zhang" 写道: Hi Reynold, Any new API support for GPU computing in our 2.0 new version ? -Allen 在 2015-12-22 14:12:50,"Reynold Xin" 写道: FYI I updated the master branch's Spark version to 2.0.0-SNAPSHOT. On Tue, Nov 10, 2015 at 3:10 PM, Reynold Xin wrote: I’m starting a new thread since the other one got intermixed with feature requests. Please refrain from making feature request in this thread. Not that we shouldn’t be adding features, but we can always add features in 1.7, 2.1, 2.2, ... First - I want to propose a premise for how to think about Spark 2.0 and major releases in Spark, based on discussion with several members of the community: a major release should be low overhead and minimally disruptive to the Spark community. A major release should not be very different from a minor release and should not be gated based on new features. The main purpose of a major release is an opportunity to fix things that are broken in the current API and remove certain deprecated APIs (examples follow). For this reason, I would *not* propose doing major releases to break substantial API's or perform large re-architecting that prevent users from upgrading. Spark has always had a culture of evolving architecture incrementally and making changes - and I don't think we want to change this model. In fact, we’ve released many architectural changes on the 1.X line. If the community likes the above model, then to me it seems reasonable to do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or immediately after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A cadence of major releases every 2 years seems doable within the above model. Under this model, here is a list of example things I would propose doing in Spark 2.0, separated into APIs and Operation/Deployment: APIs 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in Spark 1.x. 2. Remove Akka from Spark’s API dependency (in streaming), so user applications can use Akka (SPARK-5293). We have gotten a lot of complaints about user applications being unable to use Akka due to Spark’s dependency on Akka. 3. Remove Guava from Spark’s public API (JavaRDD Optional). 4. Better class package structure for low level developer API’s. In particular, we have some DeveloperApi (mostly various listener-related classes) added over the years. Some packages include only one or two public classes but a lot of private classes. A better structure is to have public classes isolated to a few public packages, and these public packages should have minimal private classes for low level developer APIs. 5. Consolidate task metric and accumulator API. Although having some subtle differences, these two are very similar but have completely different code path. 6. Possibly making Catalyst, Dataset, and DataFrame more general by moving them to other package(s). They are already used beyond SQL, e.g. in ML pipelines, and will be used by streaming also. Operation/Deployment 1. Scala 2.11 as the default build. We should still support Scala 2.10, but it has been end-of-life. 2. Remove Hadoop 1 support. 3. Assembly-free distribution of Spark: don’t require building an enormous assembly jar in order to run Spark.
Re: A proposal for Spark 2.0
FYI I updated the master branch's Spark version to 2.0.0-SNAPSHOT. On Tue, Nov 10, 2015 at 3:10 PM, Reynold Xin wrote: > I’m starting a new thread since the other one got intermixed with feature > requests. Please refrain from making feature request in this thread. Not > that we shouldn’t be adding features, but we can always add features in > 1.7, 2.1, 2.2, ... > > First - I want to propose a premise for how to think about Spark 2.0 and > major releases in Spark, based on discussion with several members of the > community: a major release should be low overhead and minimally disruptive > to the Spark community. A major release should not be very different from a > minor release and should not be gated based on new features. The main > purpose of a major release is an opportunity to fix things that are broken > in the current API and remove certain deprecated APIs (examples follow). > > For this reason, I would *not* propose doing major releases to break > substantial API's or perform large re-architecting that prevent users from > upgrading. Spark has always had a culture of evolving architecture > incrementally and making changes - and I don't think we want to change this > model. In fact, we’ve released many architectural changes on the 1.X line. > > If the community likes the above model, then to me it seems reasonable to > do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or immediately > after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A cadence of > major releases every 2 years seems doable within the above model. > > Under this model, here is a list of example things I would propose doing > in Spark 2.0, separated into APIs and Operation/Deployment: > > > APIs > > 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in > Spark 1.x. > > 2. Remove Akka from Spark’s API dependency (in streaming), so user > applications can use Akka (SPARK-5293). We have gotten a lot of complaints > about user applications being unable to use Akka due to Spark’s dependency > on Akka. > > 3. Remove Guava from Spark’s public API (JavaRDD Optional). > > 4. Better class package structure for low level developer API’s. In > particular, we have some DeveloperApi (mostly various listener-related > classes) added over the years. Some packages include only one or two public > classes but a lot of private classes. A better structure is to have public > classes isolated to a few public packages, and these public packages should > have minimal private classes for low level developer APIs. > > 5. Consolidate task metric and accumulator API. Although having some > subtle differences, these two are very similar but have completely > different code path. > > 6. Possibly making Catalyst, Dataset, and DataFrame more general by moving > them to other package(s). They are already used beyond SQL, e.g. in ML > pipelines, and will be used by streaming also. > > > Operation/Deployment > > 1. Scala 2.11 as the default build. We should still support Scala 2.10, > but it has been end-of-life. > > 2. Remove Hadoop 1 support. > > 3. Assembly-free distribution of Spark: don’t require building an enormous > assembly jar in order to run Spark. > >
Re: A proposal for Spark 2.0
Hi Kostas With regards to your *second* point. I believe that requiring from the user apps to explicitly declare their dependencies is the most clear API approach when it comes to classpath and classloading. However what about the following API: *SparkContext.addJar(String pathToJar)* . *Is this going to change or affected in someway?* Currently i use spark 1.5.2 in a Java application and i have built a utility class that finds the correct path of a Dependency (myPathOfTheJarDependency=Something like SparkUtils.getJarFullPathFromClass (EsSparkSQL.class, "^elasticsearch-hadoop-2.2.0-beta1.*\\.jar$");), Which is not something beatiful but i can live with. Then i use *javaSparkContext.addJar(myPathOfTheJarDependency)* ; after i have initiated the javaSparkContext. In that way i do not require my SparkCluster to have configuration on the classpath of my application and i explicitly define the dependencies during runtime of my app after each time i initiate a sparkContext. I would be happy and i believe many other users also if i could could continue having the same or similar approach with regards to dependencies Regards 2015-12-08 23:40 GMT+02:00 Kostas Sakellis : > I'd also like to make it a requirement that Spark 2.0 have a stable > dataframe and dataset API - we should not leave these APIs experimental in > the 2.0 release. We already know of at least one breaking change we need to > make to dataframes, now's the time to make any other changes we need to > stabilize these APIs. Anything we can do to make us feel more comfortable > about the dataset and dataframe APIs before the 2.0 release? > > I've also been thinking that in Spark 2.0, we might want to consider > strict classpath isolation for user applications. Hadoop 3 is moving in > this direction. We could, for instance, run all user applications in their > own classloader that only inherits very specific classes from Spark (ie. > public APIs). This will require user apps to explicitly declare their > dependencies as there won't be any accidental class leaking anymore. We do > something like this for *userClasspathFirst option but it is not as strict > as what I described. This is a breaking change but I think it will help > with eliminating weird classpath incompatibility issues between user > applications and Spark system dependencies. > > Thoughts? > > Kostas > > > On Fri, Dec 4, 2015 at 3:28 AM, Sean Owen wrote: > >> To be clear-er, I don't think it's clear yet whether a 1.7 release >> should exist or not. I could see both making sense. It's also not >> really necessary to decide now, well before a 1.6 is even out in the >> field. Deleting the version lost information, and I would not have >> done that given my reply. Reynold maybe I can take this up with you >> offline. >> >> On Thu, Dec 3, 2015 at 6:03 PM, Mark Hamstra >> wrote: >> > Reynold's post fromNov. 25: >> > >> >> I don't think we should drop support for Scala 2.10, or make it harder >> in >> >> terms of operations for people to upgrade. >> >> >> >> If there are further objections, I'm going to bump remove the 1.7 >> version >> >> and retarget things to 2.0 on JIRA. >> > >> > >> > On Thu, Dec 3, 2015 at 12:47 AM, Sean Owen wrote: >> >> >> >> Reynold, did you (or someone else) delete version 1.7.0 in JIRA? I >> >> think that's premature. If there's a 1.7.0 then we've lost info about >> >> what it would contain. It's trivial at any later point to merge the >> >> versions. And, since things change and there's not a pressing need to >> >> decide one way or the other, it seems fine to at least collect this >> >> info like we have things like "1.4.3" that may never be released. I'd >> >> like to add it back? >> >> >> >> On Thu, Nov 26, 2015 at 9:45 AM, Sean Owen wrote: >> >> > Maintaining both a 1.7 and 2.0 is too much work for the project, >> which >> >> > is over-stretched now. This means that after 1.6 it's just small >> >> > maintenance releases in 1.x and no substantial features or evolution. >> >> > This means that the "in progress" APIs in 1.x that will stay that >> way, >> >> > unless one updates to 2.x. It's not unreasonable, but means the >> update >> >> > to the 2.x line isn't going to be that optional for users. >> >> > >> >> > Scala 2.10 is already EOL right? Supporting it in 2.x means >> supporting >> >> > it for a couple years, note. 2.10 is still used today, but that's the >> >> > point of the current stable 1.x release in general: if you want to >> >> > stick to current dependencies, stick to the current release. Although >> >> > I think that's the right way to think about support across major >> >> > versions in general, I can see that 2.x is more of a required update >> >> > for those following the project's fixes and releases. Hence may >> indeed >> >> > be important to just keep supporting 2.10. >> >> > >> >> > I can't see supporting 2.12 at the same time (right?). Is that a >> >> > concern? it will be long since GA by the time 2.x is first released. >> >> > >>
Re: A proposal for Spark 2.0
I'd also like to make it a requirement that Spark 2.0 have a stable dataframe and dataset API - we should not leave these APIs experimental in the 2.0 release. We already know of at least one breaking change we need to make to dataframes, now's the time to make any other changes we need to stabilize these APIs. Anything we can do to make us feel more comfortable about the dataset and dataframe APIs before the 2.0 release? I've also been thinking that in Spark 2.0, we might want to consider strict classpath isolation for user applications. Hadoop 3 is moving in this direction. We could, for instance, run all user applications in their own classloader that only inherits very specific classes from Spark (ie. public APIs). This will require user apps to explicitly declare their dependencies as there won't be any accidental class leaking anymore. We do something like this for *userClasspathFirst option but it is not as strict as what I described. This is a breaking change but I think it will help with eliminating weird classpath incompatibility issues between user applications and Spark system dependencies. Thoughts? Kostas On Fri, Dec 4, 2015 at 3:28 AM, Sean Owen wrote: > To be clear-er, I don't think it's clear yet whether a 1.7 release > should exist or not. I could see both making sense. It's also not > really necessary to decide now, well before a 1.6 is even out in the > field. Deleting the version lost information, and I would not have > done that given my reply. Reynold maybe I can take this up with you > offline. > > On Thu, Dec 3, 2015 at 6:03 PM, Mark Hamstra > wrote: > > Reynold's post fromNov. 25: > > > >> I don't think we should drop support for Scala 2.10, or make it harder > in > >> terms of operations for people to upgrade. > >> > >> If there are further objections, I'm going to bump remove the 1.7 > version > >> and retarget things to 2.0 on JIRA. > > > > > > On Thu, Dec 3, 2015 at 12:47 AM, Sean Owen wrote: > >> > >> Reynold, did you (or someone else) delete version 1.7.0 in JIRA? I > >> think that's premature. If there's a 1.7.0 then we've lost info about > >> what it would contain. It's trivial at any later point to merge the > >> versions. And, since things change and there's not a pressing need to > >> decide one way or the other, it seems fine to at least collect this > >> info like we have things like "1.4.3" that may never be released. I'd > >> like to add it back? > >> > >> On Thu, Nov 26, 2015 at 9:45 AM, Sean Owen wrote: > >> > Maintaining both a 1.7 and 2.0 is too much work for the project, which > >> > is over-stretched now. This means that after 1.6 it's just small > >> > maintenance releases in 1.x and no substantial features or evolution. > >> > This means that the "in progress" APIs in 1.x that will stay that way, > >> > unless one updates to 2.x. It's not unreasonable, but means the update > >> > to the 2.x line isn't going to be that optional for users. > >> > > >> > Scala 2.10 is already EOL right? Supporting it in 2.x means supporting > >> > it for a couple years, note. 2.10 is still used today, but that's the > >> > point of the current stable 1.x release in general: if you want to > >> > stick to current dependencies, stick to the current release. Although > >> > I think that's the right way to think about support across major > >> > versions in general, I can see that 2.x is more of a required update > >> > for those following the project's fixes and releases. Hence may indeed > >> > be important to just keep supporting 2.10. > >> > > >> > I can't see supporting 2.12 at the same time (right?). Is that a > >> > concern? it will be long since GA by the time 2.x is first released. > >> > > >> > There's another fairly coherent worldview where development continues > >> > in 1.7 and focuses on finishing the loose ends and lots of bug fixing. > >> > 2.0 is delayed somewhat into next year, and by that time supporting > >> > 2.11+2.12 and Java 8 looks more feasible and more in tune with > >> > currently deployed versions. > >> > > >> > I can't say I have a strong view but I personally hadn't imagined 2.x > >> > would start now. > >> > > >> > > >> > On Thu, Nov 26, 2015 at 7:00 AM, Reynold Xin > >> > wrote: > >> >> I don't think we should drop support for Scala 2.10, or make it > harder > >> >> in > >> >> terms of operations for people to upgrade. > >> >> > >> >> If there are further objections, I'm going to bump remove the 1.7 > >> >> version > >> >> and retarget things to 2.0 on JIRA. > > - > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > >
Re: A proposal for Spark 2.0
To be clear-er, I don't think it's clear yet whether a 1.7 release should exist or not. I could see both making sense. It's also not really necessary to decide now, well before a 1.6 is even out in the field. Deleting the version lost information, and I would not have done that given my reply. Reynold maybe I can take this up with you offline. On Thu, Dec 3, 2015 at 6:03 PM, Mark Hamstra wrote: > Reynold's post fromNov. 25: > >> I don't think we should drop support for Scala 2.10, or make it harder in >> terms of operations for people to upgrade. >> >> If there are further objections, I'm going to bump remove the 1.7 version >> and retarget things to 2.0 on JIRA. > > > On Thu, Dec 3, 2015 at 12:47 AM, Sean Owen wrote: >> >> Reynold, did you (or someone else) delete version 1.7.0 in JIRA? I >> think that's premature. If there's a 1.7.0 then we've lost info about >> what it would contain. It's trivial at any later point to merge the >> versions. And, since things change and there's not a pressing need to >> decide one way or the other, it seems fine to at least collect this >> info like we have things like "1.4.3" that may never be released. I'd >> like to add it back? >> >> On Thu, Nov 26, 2015 at 9:45 AM, Sean Owen wrote: >> > Maintaining both a 1.7 and 2.0 is too much work for the project, which >> > is over-stretched now. This means that after 1.6 it's just small >> > maintenance releases in 1.x and no substantial features or evolution. >> > This means that the "in progress" APIs in 1.x that will stay that way, >> > unless one updates to 2.x. It's not unreasonable, but means the update >> > to the 2.x line isn't going to be that optional for users. >> > >> > Scala 2.10 is already EOL right? Supporting it in 2.x means supporting >> > it for a couple years, note. 2.10 is still used today, but that's the >> > point of the current stable 1.x release in general: if you want to >> > stick to current dependencies, stick to the current release. Although >> > I think that's the right way to think about support across major >> > versions in general, I can see that 2.x is more of a required update >> > for those following the project's fixes and releases. Hence may indeed >> > be important to just keep supporting 2.10. >> > >> > I can't see supporting 2.12 at the same time (right?). Is that a >> > concern? it will be long since GA by the time 2.x is first released. >> > >> > There's another fairly coherent worldview where development continues >> > in 1.7 and focuses on finishing the loose ends and lots of bug fixing. >> > 2.0 is delayed somewhat into next year, and by that time supporting >> > 2.11+2.12 and Java 8 looks more feasible and more in tune with >> > currently deployed versions. >> > >> > I can't say I have a strong view but I personally hadn't imagined 2.x >> > would start now. >> > >> > >> > On Thu, Nov 26, 2015 at 7:00 AM, Reynold Xin >> > wrote: >> >> I don't think we should drop support for Scala 2.10, or make it harder >> >> in >> >> terms of operations for people to upgrade. >> >> >> >> If there are further objections, I'm going to bump remove the 1.7 >> >> version >> >> and retarget things to 2.0 on JIRA. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: A proposal for Spark 2.0
;> > >>>>>>> A 1.6.x release will only fix bugs - we typically don't change > APIs in > >>>>>>> z releases. The Dataset API is experimental and so we might be > changing the > >>>>>>> APIs before we declare it stable. This is why I think it is > important to > >>>>>>> first stabilize the Dataset API with a Spark 1.7 release before > moving to > >>>>>>> Spark 2.0. This will benefit users that would like to use the new > Dataset > >>>>>>> APIs but can't move to Spark 2.0 because of the backwards > incompatible > >>>>>>> changes, like removal of deprecated APIs, Scala 2.11 etc. > >>>>>>> > >>>>>>> Kostas > >>>>>>> > >>>>>>> > >>>>>>> On Fri, Nov 13, 2015 at 12:26 PM, Mark Hamstra > >>>>>>> wrote: > >>>>>>>> > >>>>>>>> Why does stabilization of those two features require a 1.7 release > >>>>>>>> instead of 1.6.1? > >>>>>>>> > >>>>>>>> On Fri, Nov 13, 2015 at 11:40 AM, Kostas Sakellis > >>>>>>>> wrote: > >>>>>>>>> > >>>>>>>>> We have veered off the topic of Spark 2.0 a little bit here - > yes we > >>>>>>>>> can talk about RDD vs. DS/DF more but lets refocus on Spark 2.0. > I'd like to > >>>>>>>>> propose we have one more 1.x release after Spark 1.6. This will > allow us to > >>>>>>>>> stabilize a few of the new features that were added in 1.6: > >>>>>>>>> > >>>>>>>>> 1) the experimental Datasets API > >>>>>>>>> 2) the new unified memory manager. > >>>>>>>>> > >>>>>>>>> I understand our goal for Spark 2.0 is to offer an easy > transition > >>>>>>>>> but there will be users that won't be able to seamlessly upgrade > given what > >>>>>>>>> we have discussed as in scope for 2.0. For these users, having a > 1.x release > >>>>>>>>> with these new features/APIs stabilized will be very beneficial. > This might > >>>>>>>>> make Spark 1.7 a lighter release but that is not necessarily a > bad thing. > >>>>>>>>> > >>>>>>>>> Any thoughts on this timeline? > >>>>>>>>> > >>>>>>>>> Kostas Sakellis > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On Thu, Nov 12, 2015 at 8:39 PM, Cheng, Hao > > >>>>>>>>> wrote: > >>>>>>>>>> > >>>>>>>>>> Agree, more features/apis/optimization need to be added in > DF/DS. > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> I mean, we need to think about what kind of RDD APIs we have to > >>>>>>>>>> provide to developer, maybe the fundamental API is enough, > like, the > >>>>>>>>>> ShuffledRDD etc.. But PairRDDFunctions probably not in this > category, as we > >>>>>>>>>> can do the same thing easily with DF/DS, even better > performance. > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> From: Mark Hamstra [mailto:m...@clearstorydata.com] > >>>>>>>>>> Sent: Friday, November 13, 2015 11:23 AM > >>>>>>>>>> To: Stephen Boesch > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> Cc: dev@spark.apache.org > >>>>>>>>>> Subject: Re: A proposal for Spark 2.0 > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> Hmmm... to me, that seems like precisely the kind of thing that > >>>>>>>>>> argues for retaining the RDD API but not
Re: A proposal for Spark 2.0
tabricks.com)> > > > >>>> wrote: > > > >>>>> > > > >>>>> I actually think the next one (after 1.6) should be Spark 2.0. The > > > >>>>> reason is that I already know we have to break some part of the > > > >>>>> DataFrame/Dataset API as part of the Dataset design. (e.g. > > > >>>>> DataFrame.map > > > >>>>> should return Dataset rather than RDD). In that case, I'd rather > > > >>>>> break this > > > >>>>> sooner (in one release) than later (in two releases). so the damage > > > >>>>> is > > > >>>>> smaller. > > > >>>>> > > > >>>>> I don't think whether we call Dataset/DataFrame experimental or not > > > >>>>> matters too much for 2.0. We can still call Dataset experimental in > > > >>>>> 2.0 and > > > >>>>> then mark them as stable in 2.1. Despite being "experimental", > > > >>>>> there has > > > >>>>> been no breaking changes to DataFrame from 1.3 to 1.6. > > > >>>>> > > > >>>>> > > > >>>>> > > > >>>>> On Wed, Nov 18, 2015 at 3:43 PM, Mark Hamstra > > > >>>>> mailto:m...@clearstorydata.com)> > > > >>>>> wrote: > > > >>>>>> > > > >>>>>> Ah, got it; by "stabilize" you meant changing the API, not just bug > > > >>>>>> fixing. We're on the same page now. > > > >>>>>> > > > >>>>>> On Wed, Nov 18, 2015 at 3:39 PM, Kostas Sakellis > > > >>>>>> mailto:kos...@cloudera.com)> > > > >>>>>> wrote: > > > >>>>>>> > > > >>>>>>> A 1.6.x release will only fix bugs - we typically don't change > > > >>>>>>> APIs in > > > >>>>>>> z releases. The Dataset API is experimental and so we might be > > > >>>>>>> changing the > > > >>>>>>> APIs before we declare it stable. This is why I think it is > > > >>>>>>> important to > > > >>>>>>> first stabilize the Dataset API with a Spark 1.7 release before > > > >>>>>>> moving to > > > >>>>>>> Spark 2.0. This will benefit users that would like to use the new > > > >>>>>>> Dataset > > > >>>>>>> APIs but can't move to Spark 2.0 because of the backwards > > > >>>>>>> incompatible > > > >>>>>>> changes, like removal of deprecated APIs, Scala 2.11 etc. > > > >>>>>>> > > > >>>>>>> Kostas > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> On Fri, Nov 13, 2015 at 12:26 PM, Mark Hamstra > > > >>>>>>> mailto:m...@clearstorydata.com)> wrote: > > > >>>>>>>> > > > >>>>>>>> Why does stabilization of those two features require a 1.7 > > > >>>>>>>> release > > > >>>>>>>> instead of 1.6.1? > > > >>>>>>>> > > > >>>>>>>> On Fri, Nov 13, 2015 at 11:40 AM, Kostas Sakellis > > > >>>>>>>> mailto:kos...@cloudera.com)> wrote: > > > >>>>>>>>> > > > >>>>>>>>> We have veered off the topic of Spark 2.0 a little bit here - > > > >>>>>>>>> yes we > > > >>>>>>>>> can talk about RDD vs. DS/DF more but lets refocus on Spark > > > >>>>>>>>> 2.0. I'd like to > > > >>>>>>>>> propose we have one more 1.x release after Spark 1.6. This will > > > >>>>>>>>> allow us to > > > >>>>>>>>> stabilize a few of the new features that were added in 1.6: > > > >>>>>>>>> > > > >>>>>>>>> 1) the experimental Datasets API > > > >>>>>>>>&g
Re: A proposal for Spark 2.0
king changes in 2.0 though? Note that we're >> not >> >>>> removing Scala 2.10, we're just making the default build be against >> Scala >> >>>> 2.11 instead of 2.10. There seem to be very few changes that people >> would >> >>>> worry about. If people are going to update their apps, I think it's >> better >> >>>> to make the other small changes in 2.0 at the same time than to >> update once >> >>>> for Dataset and another time for 2.0. >> >>>> >> >>>> BTW just refer to Reynold's original post for the other proposed API >> >>>> changes. >> >>>> >> >>>> Matei >> >>>> >> >>>> On Nov 24, 2015, at 12:27 PM, Sandy Ryza >> wrote: >> >>>> >> >>>> I think that Kostas' logic still holds. The majority of Spark >> users, and >> >>>> likely an even vaster majority of people running vaster jobs, are >> still on >> >>>> RDDs and on the cusp of upgrading to DataFrames. Users will >> probably want >> >>>> to upgrade to the stable version of the Dataset / DataFrame API so >> they >> >>>> don't need to do so twice. Requiring that they absorb all the other >> ways >> >>>> that Spark breaks compatibility in the move to 2.0 makes it much more >> >>>> difficult for them to make this transition. >> >>>> >> >>>> Using the same set of APIs also means that it will be easier to >> backport >> >>>> critical fixes to the 1.x line. >> >>>> >> >>>> It's not clear to me that avoiding breakage of an experimental API >> in the >> >>>> 1.x line outweighs these issues. >> >>>> >> >>>> -Sandy >> >>>> >> >>>> On Mon, Nov 23, 2015 at 10:51 PM, Reynold Xin >> >>>> wrote: >> >>>>> >> >>>>> I actually think the next one (after 1.6) should be Spark 2.0. The >> >>>>> reason is that I already know we have to break some part of the >> >>>>> DataFrame/Dataset API as part of the Dataset design. (e.g. >> DataFrame.map >> >>>>> should return Dataset rather than RDD). In that case, I'd rather >> break this >> >>>>> sooner (in one release) than later (in two releases). so the damage >> is >> >>>>> smaller. >> >>>>> >> >>>>> I don't think whether we call Dataset/DataFrame experimental or not >> >>>>> matters too much for 2.0. We can still call Dataset experimental in >> 2.0 and >> >>>>> then mark them as stable in 2.1. Despite being "experimental", >> there has >> >>>>> been no breaking changes to DataFrame from 1.3 to 1.6. >> >>>>> >> >>>>> >> >>>>> >> >>>>> On Wed, Nov 18, 2015 at 3:43 PM, Mark Hamstra < >> m...@clearstorydata.com> >> >>>>> wrote: >> >>>>>> >> >>>>>> Ah, got it; by "stabilize" you meant changing the API, not just bug >> >>>>>> fixing. We're on the same page now. >> >>>>>> >> >>>>>> On Wed, Nov 18, 2015 at 3:39 PM, Kostas Sakellis < >> kos...@cloudera.com> >> >>>>>> wrote: >> >>>>>>> >> >>>>>>> A 1.6.x release will only fix bugs - we typically don't change >> APIs in >> >>>>>>> z releases. The Dataset API is experimental and so we might be >> changing the >> >>>>>>> APIs before we declare it stable. This is why I think it is >> important to >> >>>>>>> first stabilize the Dataset API with a Spark 1.7 release before >> moving to >> >>>>>>> Spark 2.0. This will benefit users that would like to use the new >> Dataset >> >>>>>>> APIs but can't move to Spark 2.0 because of the backwards >> incompatible >> >>>>>>> changes, like removal of deprecated APIs, Scala 2.11 etc. >> >>>>>>> >> >>>>>>> Kostas >> >>>>>>> >> >>>>>>> >> >>>>&g
Re: A proposal for Spark 2.0
gt; On Nov 24, 2015, at 12:27 PM, Sandy Ryza > wrote: > >>>> > >>>> I think that Kostas' logic still holds. The majority of Spark users, > and > >>>> likely an even vaster majority of people running vaster jobs, are > still on > >>>> RDDs and on the cusp of upgrading to DataFrames. Users will probably > want > >>>> to upgrade to the stable version of the Dataset / DataFrame API so > they > >>>> don't need to do so twice. Requiring that they absorb all the other > ways > >>>> that Spark breaks compatibility in the move to 2.0 makes it much more > >>>> difficult for them to make this transition. > >>>> > >>>> Using the same set of APIs also means that it will be easier to > backport > >>>> critical fixes to the 1.x line. > >>>> > >>>> It's not clear to me that avoiding breakage of an experimental API in > the > >>>> 1.x line outweighs these issues. > >>>> > >>>> -Sandy > >>>> > >>>> On Mon, Nov 23, 2015 at 10:51 PM, Reynold Xin > >>>> wrote: > >>>>> > >>>>> I actually think the next one (after 1.6) should be Spark 2.0. The > >>>>> reason is that I already know we have to break some part of the > >>>>> DataFrame/Dataset API as part of the Dataset design. (e.g. > DataFrame.map > >>>>> should return Dataset rather than RDD). In that case, I'd rather > break this > >>>>> sooner (in one release) than later (in two releases). so the damage > is > >>>>> smaller. > >>>>> > >>>>> I don't think whether we call Dataset/DataFrame experimental or not > >>>>> matters too much for 2.0. We can still call Dataset experimental in > 2.0 and > >>>>> then mark them as stable in 2.1. Despite being "experimental", there > has > >>>>> been no breaking changes to DataFrame from 1.3 to 1.6. > >>>>> > >>>>> > >>>>> > >>>>> On Wed, Nov 18, 2015 at 3:43 PM, Mark Hamstra < > m...@clearstorydata.com> > >>>>> wrote: > >>>>>> > >>>>>> Ah, got it; by "stabilize" you meant changing the API, not just bug > >>>>>> fixing. We're on the same page now. > >>>>>> > >>>>>> On Wed, Nov 18, 2015 at 3:39 PM, Kostas Sakellis < > kos...@cloudera.com> > >>>>>> wrote: > >>>>>>> > >>>>>>> A 1.6.x release will only fix bugs - we typically don't change > APIs in > >>>>>>> z releases. The Dataset API is experimental and so we might be > changing the > >>>>>>> APIs before we declare it stable. This is why I think it is > important to > >>>>>>> first stabilize the Dataset API with a Spark 1.7 release before > moving to > >>>>>>> Spark 2.0. This will benefit users that would like to use the new > Dataset > >>>>>>> APIs but can't move to Spark 2.0 because of the backwards > incompatible > >>>>>>> changes, like removal of deprecated APIs, Scala 2.11 etc. > >>>>>>> > >>>>>>> Kostas > >>>>>>> > >>>>>>> > >>>>>>> On Fri, Nov 13, 2015 at 12:26 PM, Mark Hamstra > >>>>>>> wrote: > >>>>>>>> > >>>>>>>> Why does stabilization of those two features require a 1.7 release > >>>>>>>> instead of 1.6.1? > >>>>>>>> > >>>>>>>> On Fri, Nov 13, 2015 at 11:40 AM, Kostas Sakellis > >>>>>>>> wrote: > >>>>>>>>> > >>>>>>>>> We have veered off the topic of Spark 2.0 a little bit here - > yes we > >>>>>>>>> can talk about RDD vs. DS/DF more but lets refocus on Spark 2.0. > I'd like to > >>>>>>>>> propose we have one more 1.x release after Spark 1.6. This will > allow us to > >>>>>>>>> stabilize a few of the new features that were added in 1.6: > >>>>>>>>> > >>>>>>>>> 1) the experimental Datasets API > >>>>>>>>> 2)
Re: A proposal for Spark 2.0
Pardon for tacking on one more message to this thread, but I'm reminded of one more issue when building the RC today: Scala 2.10 does not in general try to work with Java 8, and indeed I can never fully compile it with Java 8 on Ubuntu or OS X, due to scalac assertion errors. 2.11 is the first that's supposed to work with Java 8. This may be a good reason to drop 2.10 by the time this comes up. On Thu, Nov 26, 2015 at 8:59 PM, Koert Kuipers wrote: > I also thought the idea was to drop 2.10. Do we want to cross build for 3 > scala versions? > - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: A proposal for Spark 2.0
. >>>> >>>> It's not clear to me that avoiding breakage of an experimental API in the >>>> 1.x line outweighs these issues. >>>> >>>> -Sandy >>>> >>>> On Mon, Nov 23, 2015 at 10:51 PM, Reynold Xin >>>> wrote: >>>>> >>>>> I actually think the next one (after 1.6) should be Spark 2.0. The >>>>> reason is that I already know we have to break some part of the >>>>> DataFrame/Dataset API as part of the Dataset design. (e.g. DataFrame.map >>>>> should return Dataset rather than RDD). In that case, I'd rather break >>>>> this >>>>> sooner (in one release) than later (in two releases). so the damage is >>>>> smaller. >>>>> >>>>> I don't think whether we call Dataset/DataFrame experimental or not >>>>> matters too much for 2.0. We can still call Dataset experimental in 2.0 >>>>> and >>>>> then mark them as stable in 2.1. Despite being "experimental", there has >>>>> been no breaking changes to DataFrame from 1.3 to 1.6. >>>>> >>>>> >>>>> >>>>> On Wed, Nov 18, 2015 at 3:43 PM, Mark Hamstra >>>>> wrote: >>>>>> >>>>>> Ah, got it; by "stabilize" you meant changing the API, not just bug >>>>>> fixing. We're on the same page now. >>>>>> >>>>>> On Wed, Nov 18, 2015 at 3:39 PM, Kostas Sakellis >>>>>> wrote: >>>>>>> >>>>>>> A 1.6.x release will only fix bugs - we typically don't change APIs in >>>>>>> z releases. The Dataset API is experimental and so we might be changing >>>>>>> the >>>>>>> APIs before we declare it stable. This is why I think it is important to >>>>>>> first stabilize the Dataset API with a Spark 1.7 release before moving >>>>>>> to >>>>>>> Spark 2.0. This will benefit users that would like to use the new >>>>>>> Dataset >>>>>>> APIs but can't move to Spark 2.0 because of the backwards incompatible >>>>>>> changes, like removal of deprecated APIs, Scala 2.11 etc. >>>>>>> >>>>>>> Kostas >>>>>>> >>>>>>> >>>>>>> On Fri, Nov 13, 2015 at 12:26 PM, Mark Hamstra >>>>>>> wrote: >>>>>>>> >>>>>>>> Why does stabilization of those two features require a 1.7 release >>>>>>>> instead of 1.6.1? >>>>>>>> >>>>>>>> On Fri, Nov 13, 2015 at 11:40 AM, Kostas Sakellis >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> We have veered off the topic of Spark 2.0 a little bit here - yes we >>>>>>>>> can talk about RDD vs. DS/DF more but lets refocus on Spark 2.0. I'd >>>>>>>>> like to >>>>>>>>> propose we have one more 1.x release after Spark 1.6. This will allow >>>>>>>>> us to >>>>>>>>> stabilize a few of the new features that were added in 1.6: >>>>>>>>> >>>>>>>>> 1) the experimental Datasets API >>>>>>>>> 2) the new unified memory manager. >>>>>>>>> >>>>>>>>> I understand our goal for Spark 2.0 is to offer an easy transition >>>>>>>>> but there will be users that won't be able to seamlessly upgrade >>>>>>>>> given what >>>>>>>>> we have discussed as in scope for 2.0. For these users, having a 1.x >>>>>>>>> release >>>>>>>>> with these new features/APIs stabilized will be very beneficial. This >>>>>>>>> might >>>>>>>>> make Spark 1.7 a lighter release but that is not necessarily a bad >>>>>>>>> thing. >>>>>>>>> >>>>>>>>> Any thoughts on this timeline? >>>>>>>>> >>>>>>>>> Kostas Sakellis >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, Nov 12, 2015 at
Re: A proposal for Spark 2.0
, Nov 13, 2015 at 12:26 PM, Mark Hamstra < >>>>>> m...@clearstorydata.com> wrote: >>>>>> >>>>>>> Why does stabilization of those two features require a 1.7 release >>>>>>> instead of 1.6.1? >>>>>>> >>>>>>> On Fri, Nov 13, 2015 at 11:40 AM, Kostas Sakellis < >>>>>>> kos...@cloudera.com> wrote: >>>>>>> >>>>>>>> We have veered off the topic of Spark 2.0 a little bit here - yes >>>>>>>> we can talk about RDD vs. DS/DF more but lets refocus on Spark 2.0. I'd >>>>>>>> like to propose we have one more 1.x release after Spark 1.6. This will >>>>>>>> allow us to stabilize a few of the new features that were added in 1.6: >>>>>>>> >>>>>>>> 1) the experimental Datasets API >>>>>>>> 2) the new unified memory manager. >>>>>>>> >>>>>>>> I understand our goal for Spark 2.0 is to offer an easy transition >>>>>>>> but there will be users that won't be able to seamlessly upgrade given >>>>>>>> what >>>>>>>> we have discussed as in scope for 2.0. For these users, having a 1.x >>>>>>>> release with these new features/APIs stabilized will be very >>>>>>>> beneficial. >>>>>>>> This might make Spark 1.7 a lighter release but that is not >>>>>>>> necessarily a >>>>>>>> bad thing. >>>>>>>> >>>>>>>> Any thoughts on this timeline? >>>>>>>> >>>>>>>> Kostas Sakellis >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Nov 12, 2015 at 8:39 PM, Cheng, Hao >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Agree, more features/apis/optimization need to be added in DF/DS. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> I mean, we need to think about what kind of RDD APIs we have to >>>>>>>>> provide to developer, maybe the fundamental API is enough, like, the >>>>>>>>> ShuffledRDD etc.. But PairRDDFunctions probably not in this >>>>>>>>> category, as >>>>>>>>> we can do the same thing easily with DF/DS, even better performance. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> *From:* Mark Hamstra [mailto:m...@clearstorydata.com] >>>>>>>>> *Sent:* Friday, November 13, 2015 11:23 AM >>>>>>>>> *To:* Stephen Boesch >>>>>>>>> >>>>>>>>> *Cc:* dev@spark.apache.org >>>>>>>>> *Subject:* Re: A proposal for Spark 2.0 >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Hmmm... to me, that seems like precisely the kind of thing that >>>>>>>>> argues for retaining the RDD API but not as the first thing presented >>>>>>>>> to >>>>>>>>> new Spark developers: "Here's how to use groupBy with DataFrames >>>>>>>>> Until >>>>>>>>> the optimizer is more fully developed, that won't always get you the >>>>>>>>> best >>>>>>>>> performance that can be obtained. In these particular circumstances, >>>>>>>>> ..., >>>>>>>>> you may want to use the low-level RDD API while setting >>>>>>>>> preservesPartitioning to true. Like this" >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, Nov 12, 2015 at 7:05 PM, Stephen Boesch >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> My understanding is that the RDD's presently have more support >>>>>>>>> for complete control of partitioning which is a key consideration at >>>>>>>>> scale. While partitioning control is still piecemeal in DF/DS it >
Re: A proposal for Spark 2.0
>>>>> can talk about RDD vs. DS/DF more but lets refocus on Spark 2.0. I'd >>>>>>> like >>>>>>> to propose we have one more 1.x release after Spark 1.6. This will >>>>>>> allow us >>>>>>> to stabilize a few of the new features that were added in 1.6: >>>>>>> >>>>>>> 1) the experimental Datasets API >>>>>>> 2) the new unified memory manager. >>>>>>> >>>>>>> I understand our goal for Spark 2.0 is to offer an easy transition >>>>>>> but there will be users that won't be able to seamlessly upgrade given >>>>>>> what >>>>>>> we have discussed as in scope for 2.0. For these users, having a 1.x >>>>>>> release with these new features/APIs stabilized will be very beneficial. >>>>>>> This might make Spark 1.7 a lighter release but that is not necessarily >>>>>>> a >>>>>>> bad thing. >>>>>>> >>>>>>> Any thoughts on this timeline? >>>>>>> >>>>>>> Kostas Sakellis >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Thu, Nov 12, 2015 at 8:39 PM, Cheng, Hao >>>>>>> wrote: >>>>>>> >>>>>>>> Agree, more features/apis/optimization need to be added in DF/DS. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> I mean, we need to think about what kind of RDD APIs we have to >>>>>>>> provide to developer, maybe the fundamental API is enough, like, the >>>>>>>> ShuffledRDD etc.. But PairRDDFunctions probably not in this category, >>>>>>>> as >>>>>>>> we can do the same thing easily with DF/DS, even better performance. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> *From:* Mark Hamstra [mailto:m...@clearstorydata.com] >>>>>>>> *Sent:* Friday, November 13, 2015 11:23 AM >>>>>>>> *To:* Stephen Boesch >>>>>>>> >>>>>>>> *Cc:* dev@spark.apache.org >>>>>>>> *Subject:* Re: A proposal for Spark 2.0 >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Hmmm... to me, that seems like precisely the kind of thing that >>>>>>>> argues for retaining the RDD API but not as the first thing presented >>>>>>>> to >>>>>>>> new Spark developers: "Here's how to use groupBy with DataFrames >>>>>>>> Until >>>>>>>> the optimizer is more fully developed, that won't always get you the >>>>>>>> best >>>>>>>> performance that can be obtained. In these particular circumstances, >>>>>>>> ..., >>>>>>>> you may want to use the low-level RDD API while setting >>>>>>>> preservesPartitioning to true. Like this" >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Nov 12, 2015 at 7:05 PM, Stephen Boesch >>>>>>>> wrote: >>>>>>>> >>>>>>>> My understanding is that the RDD's presently have more support for >>>>>>>> complete control of partitioning which is a key consideration at scale. >>>>>>>> While partitioning control is still piecemeal in DF/DS it would seem >>>>>>>> premature to make RDD's a second-tier approach to spark dev. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> An example is the use of groupBy when we know that the source >>>>>>>> relation (/RDD) is already partitioned on the grouping expressions. >>>>>>>> AFAIK >>>>>>>> the spark sql still does not allow that knowledge to be applied to the >>>>>>>> optimizer - so a full shuffle will be performed. However in the native >>>>>>>> RDD >>>>>>>> we can use preservesPartitioning=true. >>>>>>>> >>>>>>>>
Re: A proposal for Spark 2.0
> On 25 Nov 2015, at 08:54, Sandy Ryza wrote: > > I see. My concern is / was that cluster operators will be reluctant to > upgrade to 2.0, meaning that developers using those clusters need to stay on > 1.x, and, if they want to move to DataFrames, essentially need to port their > app twice. > > I misunderstood and thought part of the proposal was to drop support for 2.10 > though. If your broad point is that there aren't changes in 2.0 that will > make it less palatable to cluster administrators than releases in the 1.x > line, then yes, 2.0 as the next release sounds fine to me. > > -Sandy > mixing spark versions in a JAR cluster with compatible hadoop native libs isn't so hard: users just deploy them up separately. But: -mixing Scala version is going to be tricky unless the jobs people submit are configured with the different paths -the history server will need to be of the most latest spark version being executed in the cluster - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: A proposal for Spark 2.0
>>>> matters too much for 2.0. We can still call Dataset experimental in 2.0 and >>>> then mark them as stable in 2.1. Despite being "experimental", there has >>>> been no breaking changes to DataFrame from 1.3 to 1.6. >>>> >>>> >>>> >>>> On Wed, Nov 18, 2015 at 3:43 PM, Mark Hamstra >>>> wrote: >>>>> >>>>> Ah, got it; by "stabilize" you meant changing the API, not just bug >>>>> fixing. We're on the same page now. >>>>> >>>>> On Wed, Nov 18, 2015 at 3:39 PM, Kostas Sakellis >>>>> wrote: >>>>>> >>>>>> A 1.6.x release will only fix bugs - we typically don't change APIs in >>>>>> z releases. The Dataset API is experimental and so we might be changing >>>>>> the >>>>>> APIs before we declare it stable. This is why I think it is important to >>>>>> first stabilize the Dataset API with a Spark 1.7 release before moving to >>>>>> Spark 2.0. This will benefit users that would like to use the new Dataset >>>>>> APIs but can't move to Spark 2.0 because of the backwards incompatible >>>>>> changes, like removal of deprecated APIs, Scala 2.11 etc. >>>>>> >>>>>> Kostas >>>>>> >>>>>> >>>>>> On Fri, Nov 13, 2015 at 12:26 PM, Mark Hamstra >>>>>> wrote: >>>>>>> >>>>>>> Why does stabilization of those two features require a 1.7 release >>>>>>> instead of 1.6.1? >>>>>>> >>>>>>> On Fri, Nov 13, 2015 at 11:40 AM, Kostas Sakellis >>>>>>> wrote: >>>>>>>> >>>>>>>> We have veered off the topic of Spark 2.0 a little bit here - yes we >>>>>>>> can talk about RDD vs. DS/DF more but lets refocus on Spark 2.0. I'd >>>>>>>> like to >>>>>>>> propose we have one more 1.x release after Spark 1.6. This will allow >>>>>>>> us to >>>>>>>> stabilize a few of the new features that were added in 1.6: >>>>>>>> >>>>>>>> 1) the experimental Datasets API >>>>>>>> 2) the new unified memory manager. >>>>>>>> >>>>>>>> I understand our goal for Spark 2.0 is to offer an easy transition >>>>>>>> but there will be users that won't be able to seamlessly upgrade given >>>>>>>> what >>>>>>>> we have discussed as in scope for 2.0. For these users, having a 1.x >>>>>>>> release >>>>>>>> with these new features/APIs stabilized will be very beneficial. This >>>>>>>> might >>>>>>>> make Spark 1.7 a lighter release but that is not necessarily a bad >>>>>>>> thing. >>>>>>>> >>>>>>>> Any thoughts on this timeline? >>>>>>>> >>>>>>>> Kostas Sakellis >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Nov 12, 2015 at 8:39 PM, Cheng, Hao >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> Agree, more features/apis/optimization need to be added in DF/DS. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> I mean, we need to think about what kind of RDD APIs we have to >>>>>>>>> provide to developer, maybe the fundamental API is enough, like, the >>>>>>>>> ShuffledRDD etc.. But PairRDDFunctions probably not in this >>>>>>>>> category, as we >>>>>>>>> can do the same thing easily with DF/DS, even better performance. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> From: Mark Hamstra [mailto:m...@clearstorydata.com] >>>>>>>>> Sent: Friday, November 13, 2015 11:23 AM >>>>>>>>> To: Stephen Boesch >>>>>>>>> >>>>>>>>> >>>>>>>>> Cc: dev@spark.apache.org >>>>>>>>> Subject: Re: A pro
Re: A proposal for Spark 2.0
dera.com> wrote: >>>>>> >>>>>>> We have veered off the topic of Spark 2.0 a little bit here - yes we >>>>>>> can talk about RDD vs. DS/DF more but lets refocus on Spark 2.0. I'd >>>>>>> like >>>>>>> to propose we have one more 1.x release after Spark 1.6. This will >>>>>>> allow us >>>>>>> to stabilize a few of the new features that were added in 1.6: >>>>>>> >>>>>>> 1) the experimental Datasets API >>>>>>> 2) the new unified memory manager. >>>>>>> >>>>>>> I understand our goal for Spark 2.0 is to offer an easy transition >>>>>>> but there will be users that won't be able to seamlessly upgrade given >>>>>>> what >>>>>>> we have discussed as in scope for 2.0. For these users, having a 1.x >>>>>>> release with these new features/APIs stabilized will be very beneficial. >>>>>>> This might make Spark 1.7 a lighter release but that is not necessarily >>>>>>> a >>>>>>> bad thing. >>>>>>> >>>>>>> Any thoughts on this timeline? >>>>>>> >>>>>>> Kostas Sakellis >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Thu, Nov 12, 2015 at 8:39 PM, Cheng, Hao >>>>>>> wrote: >>>>>>> >>>>>>>> Agree, more features/apis/optimization need to be added in DF/DS. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> I mean, we need to think about what kind of RDD APIs we have to >>>>>>>> provide to developer, maybe the fundamental API is enough, like, the >>>>>>>> ShuffledRDD etc.. But PairRDDFunctions probably not in this category, >>>>>>>> as >>>>>>>> we can do the same thing easily with DF/DS, even better performance. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> *From:* Mark Hamstra [mailto:m...@clearstorydata.com] >>>>>>>> *Sent:* Friday, November 13, 2015 11:23 AM >>>>>>>> *To:* Stephen Boesch >>>>>>>> >>>>>>>> *Cc:* dev@spark.apache.org >>>>>>>> *Subject:* Re: A proposal for Spark 2.0 >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Hmmm... to me, that seems like precisely the kind of thing that >>>>>>>> argues for retaining the RDD API but not as the first thing presented >>>>>>>> to >>>>>>>> new Spark developers: "Here's how to use groupBy with DataFrames >>>>>>>> Until >>>>>>>> the optimizer is more fully developed, that won't always get you the >>>>>>>> best >>>>>>>> performance that can be obtained. In these particular circumstances, >>>>>>>> ..., >>>>>>>> you may want to use the low-level RDD API while setting >>>>>>>> preservesPartitioning to true. Like this" >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Nov 12, 2015 at 7:05 PM, Stephen Boesch >>>>>>>> wrote: >>>>>>>> >>>>>>>> My understanding is that the RDD's presently have more support for >>>>>>>> complete control of partitioning which is a key consideration at scale. >>>>>>>> While partitioning control is still piecemeal in DF/DS it would seem >>>>>>>> premature to make RDD's a second-tier approach to spark dev. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> An example is the use of groupBy when we know that the source >>>>>>>> relation (/RDD) is already partitioned on the grouping expressions. >>>>>>>> AFAIK >>>>>>>> the spark sql still does not allow that knowledge to be applied to the >>>>>>>> optimizer - so a full shuffle will be performed. However in the native >>>>>>&
Re: A proposal for Spark 2.0
at won't be able to seamlessly upgrade given >>>>>> what >>>>>> we have discussed as in scope for 2.0. For these users, having a 1.x >>>>>> release with these new features/APIs stabilized will be very beneficial. >>>>>> This might make Spark 1.7 a lighter release but that is not necessarily a >>>>>> bad thing. >>>>>> >>>>>> Any thoughts on this timeline? >>>>>> >>>>>> Kostas Sakellis >>>>>> >>>>>> >>>>>> >>>>>> On Thu, Nov 12, 2015 at 8:39 PM, Cheng, Hao >>>>>> wrote: >>>>>> >>>>>>> Agree, more features/apis/optimization need to be added in DF/DS. >>>>>>> >>>>>>> >>>>>>> >>>>>>> I mean, we need to think about what kind of RDD APIs we have to >>>>>>> provide to developer, maybe the fundamental API is enough, like, the >>>>>>> ShuffledRDD etc.. But PairRDDFunctions probably not in this category, >>>>>>> as >>>>>>> we can do the same thing easily with DF/DS, even better performance. >>>>>>> >>>>>>> >>>>>>> >>>>>>> *From:* Mark Hamstra [mailto:m...@clearstorydata.com] >>>>>>> *Sent:* Friday, November 13, 2015 11:23 AM >>>>>>> *To:* Stephen Boesch >>>>>>> >>>>>>> *Cc:* dev@spark.apache.org >>>>>>> *Subject:* Re: A proposal for Spark 2.0 >>>>>>> >>>>>>> >>>>>>> >>>>>>> Hmmm... to me, that seems like precisely the kind of thing that >>>>>>> argues for retaining the RDD API but not as the first thing presented to >>>>>>> new Spark developers: "Here's how to use groupBy with DataFrames >>>>>>> Until >>>>>>> the optimizer is more fully developed, that won't always get you the >>>>>>> best >>>>>>> performance that can be obtained. In these particular circumstances, >>>>>>> ..., >>>>>>> you may want to use the low-level RDD API while setting >>>>>>> preservesPartitioning to true. Like this" >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Thu, Nov 12, 2015 at 7:05 PM, Stephen Boesch >>>>>>> wrote: >>>>>>> >>>>>>> My understanding is that the RDD's presently have more support for >>>>>>> complete control of partitioning which is a key consideration at scale. >>>>>>> While partitioning control is still piecemeal in DF/DS it would seem >>>>>>> premature to make RDD's a second-tier approach to spark dev. >>>>>>> >>>>>>> >>>>>>> >>>>>>> An example is the use of groupBy when we know that the source >>>>>>> relation (/RDD) is already partitioned on the grouping expressions. >>>>>>> AFAIK >>>>>>> the spark sql still does not allow that knowledge to be applied to the >>>>>>> optimizer - so a full shuffle will be performed. However in the native >>>>>>> RDD >>>>>>> we can use preservesPartitioning=true. >>>>>>> >>>>>>> >>>>>>> >>>>>>> 2015-11-12 17:42 GMT-08:00 Mark Hamstra : >>>>>>> >>>>>>> The place of the RDD API in 2.0 is also something I've been >>>>>>> wondering about. I think it may be going too far to deprecate it, but >>>>>>> changing emphasis is something that we might consider. The RDD API came >>>>>>> well before DataFrames and DataSets, so programming guides, introductory >>>>>>> how-to articles and the like have, to this point, also tended to >>>>>>> emphasize >>>>>>> RDDs -- or at least to deal with them early. What I'm thinking is that >>>>>>> with 2.0 maybe we should overhaul all the documentation to de-emphasize >>>>>>> and >>>>>>> reposition RDDs. In this scheme, DataFrames and DataSets would be >>>>>>> intro
Re: A proposal for Spark 2.0
What are the other breaking changes in 2.0 though? Note that we're not removing Scala 2.10, we're just making the default build be against Scala 2.11 instead of 2.10. There seem to be very few changes that people would worry about. If people are going to update their apps, I think it's better to make the other small changes in 2.0 at the same time than to update once for Dataset and another time for 2.0. BTW just refer to Reynold's original post for the other proposed API changes. Matei > On Nov 24, 2015, at 12:27 PM, Sandy Ryza wrote: > > I think that Kostas' logic still holds. The majority of Spark users, and > likely an even vaster majority of people running vaster jobs, are still on > RDDs and on the cusp of upgrading to DataFrames. Users will probably want to > upgrade to the stable version of the Dataset / DataFrame API so they don't > need to do so twice. Requiring that they absorb all the other ways that > Spark breaks compatibility in the move to 2.0 makes it much more difficult > for them to make this transition. > > Using the same set of APIs also means that it will be easier to backport > critical fixes to the 1.x line. > > It's not clear to me that avoiding breakage of an experimental API in the 1.x > line outweighs these issues. > > -Sandy > > On Mon, Nov 23, 2015 at 10:51 PM, Reynold Xin <mailto:r...@databricks.com>> wrote: > I actually think the next one (after 1.6) should be Spark 2.0. The reason is > that I already know we have to break some part of the DataFrame/Dataset API > as part of the Dataset design. (e.g. DataFrame.map should return Dataset > rather than RDD). In that case, I'd rather break this sooner (in one release) > than later (in two releases). so the damage is smaller. > > I don't think whether we call Dataset/DataFrame experimental or not matters > too much for 2.0. We can still call Dataset experimental in 2.0 and then mark > them as stable in 2.1. Despite being "experimental", there has been no > breaking changes to DataFrame from 1.3 to 1.6. > > > > On Wed, Nov 18, 2015 at 3:43 PM, Mark Hamstra <mailto:m...@clearstorydata.com>> wrote: > Ah, got it; by "stabilize" you meant changing the API, not just bug fixing. > We're on the same page now. > > On Wed, Nov 18, 2015 at 3:39 PM, Kostas Sakellis <mailto:kos...@cloudera.com>> wrote: > A 1.6.x release will only fix bugs - we typically don't change APIs in z > releases. The Dataset API is experimental and so we might be changing the > APIs before we declare it stable. This is why I think it is important to > first stabilize the Dataset API with a Spark 1.7 release before moving to > Spark 2.0. This will benefit users that would like to use the new Dataset > APIs but can't move to Spark 2.0 because of the backwards incompatible > changes, like removal of deprecated APIs, Scala 2.11 etc. > > Kostas > > > On Fri, Nov 13, 2015 at 12:26 PM, Mark Hamstra <mailto:m...@clearstorydata.com>> wrote: > Why does stabilization of those two features require a 1.7 release instead of > 1.6.1? > > On Fri, Nov 13, 2015 at 11:40 AM, Kostas Sakellis <mailto:kos...@cloudera.com>> wrote: > We have veered off the topic of Spark 2.0 a little bit here - yes we can talk > about RDD vs. DS/DF more but lets refocus on Spark 2.0. I'd like to propose > we have one more 1.x release after Spark 1.6. This will allow us to stabilize > a few of the new features that were added in 1.6: > > 1) the experimental Datasets API > 2) the new unified memory manager. > > I understand our goal for Spark 2.0 is to offer an easy transition but there > will be users that won't be able to seamlessly upgrade given what we have > discussed as in scope for 2.0. For these users, having a 1.x release with > these new features/APIs stabilized will be very beneficial. This might make > Spark 1.7 a lighter release but that is not necessarily a bad thing. > > Any thoughts on this timeline? > > Kostas Sakellis > > > > On Thu, Nov 12, 2015 at 8:39 PM, Cheng, Hao <mailto:hao.ch...@intel.com>> wrote: > Agree, more features/apis/optimization need to be added in DF/DS. > > > > I mean, we need to think about what kind of RDD APIs we have to provide to > developer, maybe the fundamental API is enough, like, the ShuffledRDD etc.. > But PairRDDFunctions probably not in this category, as we can do the same > thing easily with DF/DS, even better performance. > > <> > From: Mark Hamstra [mailto:m...@clearstorydata.com > <mailto:m...@clearstorydata.com>] > Sent: Friday, November 13, 2015 11:23 AM > To:
Re: A proposal for Spark 2.0
I think that Kostas' logic still holds. The majority of Spark users, and likely an even vaster majority of people running vaster jobs, are still on RDDs and on the cusp of upgrading to DataFrames. Users will probably want to upgrade to the stable version of the Dataset / DataFrame API so they don't need to do so twice. Requiring that they absorb all the other ways that Spark breaks compatibility in the move to 2.0 makes it much more difficult for them to make this transition. Using the same set of APIs also means that it will be easier to backport critical fixes to the 1.x line. It's not clear to me that avoiding breakage of an experimental API in the 1.x line outweighs these issues. -Sandy On Mon, Nov 23, 2015 at 10:51 PM, Reynold Xin wrote: > I actually think the next one (after 1.6) should be Spark 2.0. The reason > is that I already know we have to break some part of the DataFrame/Dataset > API as part of the Dataset design. (e.g. DataFrame.map should return > Dataset rather than RDD). In that case, I'd rather break this sooner (in > one release) than later (in two releases). so the damage is smaller. > > I don't think whether we call Dataset/DataFrame experimental or not > matters too much for 2.0. We can still call Dataset experimental in 2.0 and > then mark them as stable in 2.1. Despite being "experimental", there has > been no breaking changes to DataFrame from 1.3 to 1.6. > > > > On Wed, Nov 18, 2015 at 3:43 PM, Mark Hamstra > wrote: > >> Ah, got it; by "stabilize" you meant changing the API, not just bug >> fixing. We're on the same page now. >> >> On Wed, Nov 18, 2015 at 3:39 PM, Kostas Sakellis >> wrote: >> >>> A 1.6.x release will only fix bugs - we typically don't change APIs in z >>> releases. The Dataset API is experimental and so we might be changing the >>> APIs before we declare it stable. This is why I think it is important to >>> first stabilize the Dataset API with a Spark 1.7 release before moving to >>> Spark 2.0. This will benefit users that would like to use the new Dataset >>> APIs but can't move to Spark 2.0 because of the backwards incompatible >>> changes, like removal of deprecated APIs, Scala 2.11 etc. >>> >>> Kostas >>> >>> >>> On Fri, Nov 13, 2015 at 12:26 PM, Mark Hamstra >>> wrote: >>> >>>> Why does stabilization of those two features require a 1.7 release >>>> instead of 1.6.1? >>>> >>>> On Fri, Nov 13, 2015 at 11:40 AM, Kostas Sakellis >>>> wrote: >>>> >>>>> We have veered off the topic of Spark 2.0 a little bit here - yes we >>>>> can talk about RDD vs. DS/DF more but lets refocus on Spark 2.0. I'd like >>>>> to propose we have one more 1.x release after Spark 1.6. This will allow >>>>> us >>>>> to stabilize a few of the new features that were added in 1.6: >>>>> >>>>> 1) the experimental Datasets API >>>>> 2) the new unified memory manager. >>>>> >>>>> I understand our goal for Spark 2.0 is to offer an easy transition but >>>>> there will be users that won't be able to seamlessly upgrade given what we >>>>> have discussed as in scope for 2.0. For these users, having a 1.x release >>>>> with these new features/APIs stabilized will be very beneficial. This >>>>> might >>>>> make Spark 1.7 a lighter release but that is not necessarily a bad thing. >>>>> >>>>> Any thoughts on this timeline? >>>>> >>>>> Kostas Sakellis >>>>> >>>>> >>>>> >>>>> On Thu, Nov 12, 2015 at 8:39 PM, Cheng, Hao >>>>> wrote: >>>>> >>>>>> Agree, more features/apis/optimization need to be added in DF/DS. >>>>>> >>>>>> >>>>>> >>>>>> I mean, we need to think about what kind of RDD APIs we have to >>>>>> provide to developer, maybe the fundamental API is enough, like, the >>>>>> ShuffledRDD etc.. But PairRDDFunctions probably not in this category, as >>>>>> we can do the same thing easily with DF/DS, even better performance. >>>>>> >>>>>> >>>>>> >>>>>> *From:* Mark Hamstra [mailto:m...@clearstorydata.com] >>>>>> *Sent:* Friday, November 13, 2015 11:23 AM >>>>>> *To:* Stephen Boesch >>>>>> >>>>&
Re: A proposal for Spark 2.0
I actually think the next one (after 1.6) should be Spark 2.0. The reason is that I already know we have to break some part of the DataFrame/Dataset API as part of the Dataset design. (e.g. DataFrame.map should return Dataset rather than RDD). In that case, I'd rather break this sooner (in one release) than later (in two releases). so the damage is smaller. I don't think whether we call Dataset/DataFrame experimental or not matters too much for 2.0. We can still call Dataset experimental in 2.0 and then mark them as stable in 2.1. Despite being "experimental", there has been no breaking changes to DataFrame from 1.3 to 1.6. On Wed, Nov 18, 2015 at 3:43 PM, Mark Hamstra wrote: > Ah, got it; by "stabilize" you meant changing the API, not just bug > fixing. We're on the same page now. > > On Wed, Nov 18, 2015 at 3:39 PM, Kostas Sakellis > wrote: > >> A 1.6.x release will only fix bugs - we typically don't change APIs in z >> releases. The Dataset API is experimental and so we might be changing the >> APIs before we declare it stable. This is why I think it is important to >> first stabilize the Dataset API with a Spark 1.7 release before moving to >> Spark 2.0. This will benefit users that would like to use the new Dataset >> APIs but can't move to Spark 2.0 because of the backwards incompatible >> changes, like removal of deprecated APIs, Scala 2.11 etc. >> >> Kostas >> >> >> On Fri, Nov 13, 2015 at 12:26 PM, Mark Hamstra >> wrote: >> >>> Why does stabilization of those two features require a 1.7 release >>> instead of 1.6.1? >>> >>> On Fri, Nov 13, 2015 at 11:40 AM, Kostas Sakellis >>> wrote: >>> >>>> We have veered off the topic of Spark 2.0 a little bit here - yes we >>>> can talk about RDD vs. DS/DF more but lets refocus on Spark 2.0. I'd like >>>> to propose we have one more 1.x release after Spark 1.6. This will allow us >>>> to stabilize a few of the new features that were added in 1.6: >>>> >>>> 1) the experimental Datasets API >>>> 2) the new unified memory manager. >>>> >>>> I understand our goal for Spark 2.0 is to offer an easy transition but >>>> there will be users that won't be able to seamlessly upgrade given what we >>>> have discussed as in scope for 2.0. For these users, having a 1.x release >>>> with these new features/APIs stabilized will be very beneficial. This might >>>> make Spark 1.7 a lighter release but that is not necessarily a bad thing. >>>> >>>> Any thoughts on this timeline? >>>> >>>> Kostas Sakellis >>>> >>>> >>>> >>>> On Thu, Nov 12, 2015 at 8:39 PM, Cheng, Hao >>>> wrote: >>>> >>>>> Agree, more features/apis/optimization need to be added in DF/DS. >>>>> >>>>> >>>>> >>>>> I mean, we need to think about what kind of RDD APIs we have to >>>>> provide to developer, maybe the fundamental API is enough, like, the >>>>> ShuffledRDD etc.. But PairRDDFunctions probably not in this category, as >>>>> we can do the same thing easily with DF/DS, even better performance. >>>>> >>>>> >>>>> >>>>> *From:* Mark Hamstra [mailto:m...@clearstorydata.com] >>>>> *Sent:* Friday, November 13, 2015 11:23 AM >>>>> *To:* Stephen Boesch >>>>> >>>>> *Cc:* dev@spark.apache.org >>>>> *Subject:* Re: A proposal for Spark 2.0 >>>>> >>>>> >>>>> >>>>> Hmmm... to me, that seems like precisely the kind of thing that argues >>>>> for retaining the RDD API but not as the first thing presented to new >>>>> Spark >>>>> developers: "Here's how to use groupBy with DataFrames Until the >>>>> optimizer is more fully developed, that won't always get you the best >>>>> performance that can be obtained. In these particular circumstances, ..., >>>>> you may want to use the low-level RDD API while setting >>>>> preservesPartitioning to true. Like this" >>>>> >>>>> >>>>> >>>>> On Thu, Nov 12, 2015 at 7:05 PM, Stephen Boesch >>>>> wrote: >>>>> >>>>> My understanding is that the RDD's presently have more support for >>>>> complete control of partitioning which is a
Re: A proposal for Spark 2.0
Ah, got it; by "stabilize" you meant changing the API, not just bug fixing. We're on the same page now. On Wed, Nov 18, 2015 at 3:39 PM, Kostas Sakellis wrote: > A 1.6.x release will only fix bugs - we typically don't change APIs in z > releases. The Dataset API is experimental and so we might be changing the > APIs before we declare it stable. This is why I think it is important to > first stabilize the Dataset API with a Spark 1.7 release before moving to > Spark 2.0. This will benefit users that would like to use the new Dataset > APIs but can't move to Spark 2.0 because of the backwards incompatible > changes, like removal of deprecated APIs, Scala 2.11 etc. > > Kostas > > > On Fri, Nov 13, 2015 at 12:26 PM, Mark Hamstra > wrote: > >> Why does stabilization of those two features require a 1.7 release >> instead of 1.6.1? >> >> On Fri, Nov 13, 2015 at 11:40 AM, Kostas Sakellis >> wrote: >> >>> We have veered off the topic of Spark 2.0 a little bit here - yes we can >>> talk about RDD vs. DS/DF more but lets refocus on Spark 2.0. I'd like to >>> propose we have one more 1.x release after Spark 1.6. This will allow us to >>> stabilize a few of the new features that were added in 1.6: >>> >>> 1) the experimental Datasets API >>> 2) the new unified memory manager. >>> >>> I understand our goal for Spark 2.0 is to offer an easy transition but >>> there will be users that won't be able to seamlessly upgrade given what we >>> have discussed as in scope for 2.0. For these users, having a 1.x release >>> with these new features/APIs stabilized will be very beneficial. This might >>> make Spark 1.7 a lighter release but that is not necessarily a bad thing. >>> >>> Any thoughts on this timeline? >>> >>> Kostas Sakellis >>> >>> >>> >>> On Thu, Nov 12, 2015 at 8:39 PM, Cheng, Hao wrote: >>> >>>> Agree, more features/apis/optimization need to be added in DF/DS. >>>> >>>> >>>> >>>> I mean, we need to think about what kind of RDD APIs we have to provide >>>> to developer, maybe the fundamental API is enough, like, the ShuffledRDD >>>> etc.. But PairRDDFunctions probably not in this category, as we can do the >>>> same thing easily with DF/DS, even better performance. >>>> >>>> >>>> >>>> *From:* Mark Hamstra [mailto:m...@clearstorydata.com] >>>> *Sent:* Friday, November 13, 2015 11:23 AM >>>> *To:* Stephen Boesch >>>> >>>> *Cc:* dev@spark.apache.org >>>> *Subject:* Re: A proposal for Spark 2.0 >>>> >>>> >>>> >>>> Hmmm... to me, that seems like precisely the kind of thing that argues >>>> for retaining the RDD API but not as the first thing presented to new Spark >>>> developers: "Here's how to use groupBy with DataFrames Until the >>>> optimizer is more fully developed, that won't always get you the best >>>> performance that can be obtained. In these particular circumstances, ..., >>>> you may want to use the low-level RDD API while setting >>>> preservesPartitioning to true. Like this" >>>> >>>> >>>> >>>> On Thu, Nov 12, 2015 at 7:05 PM, Stephen Boesch >>>> wrote: >>>> >>>> My understanding is that the RDD's presently have more support for >>>> complete control of partitioning which is a key consideration at scale. >>>> While partitioning control is still piecemeal in DF/DS it would seem >>>> premature to make RDD's a second-tier approach to spark dev. >>>> >>>> >>>> >>>> An example is the use of groupBy when we know that the source relation >>>> (/RDD) is already partitioned on the grouping expressions. AFAIK the spark >>>> sql still does not allow that knowledge to be applied to the optimizer - so >>>> a full shuffle will be performed. However in the native RDD we can use >>>> preservesPartitioning=true. >>>> >>>> >>>> >>>> 2015-11-12 17:42 GMT-08:00 Mark Hamstra : >>>> >>>> The place of the RDD API in 2.0 is also something I've been wondering >>>> about. I think it may be going too far to deprecate it, but changing >>>> emphasis is something that we might consider. The RDD API came well before >>>> DataFrames and D
Re: A proposal for Spark 2.0
A 1.6.x release will only fix bugs - we typically don't change APIs in z releases. The Dataset API is experimental and so we might be changing the APIs before we declare it stable. This is why I think it is important to first stabilize the Dataset API with a Spark 1.7 release before moving to Spark 2.0. This will benefit users that would like to use the new Dataset APIs but can't move to Spark 2.0 because of the backwards incompatible changes, like removal of deprecated APIs, Scala 2.11 etc. Kostas On Fri, Nov 13, 2015 at 12:26 PM, Mark Hamstra wrote: > Why does stabilization of those two features require a 1.7 release instead > of 1.6.1? > > On Fri, Nov 13, 2015 at 11:40 AM, Kostas Sakellis > wrote: > >> We have veered off the topic of Spark 2.0 a little bit here - yes we can >> talk about RDD vs. DS/DF more but lets refocus on Spark 2.0. I'd like to >> propose we have one more 1.x release after Spark 1.6. This will allow us to >> stabilize a few of the new features that were added in 1.6: >> >> 1) the experimental Datasets API >> 2) the new unified memory manager. >> >> I understand our goal for Spark 2.0 is to offer an easy transition but >> there will be users that won't be able to seamlessly upgrade given what we >> have discussed as in scope for 2.0. For these users, having a 1.x release >> with these new features/APIs stabilized will be very beneficial. This might >> make Spark 1.7 a lighter release but that is not necessarily a bad thing. >> >> Any thoughts on this timeline? >> >> Kostas Sakellis >> >> >> >> On Thu, Nov 12, 2015 at 8:39 PM, Cheng, Hao wrote: >> >>> Agree, more features/apis/optimization need to be added in DF/DS. >>> >>> >>> >>> I mean, we need to think about what kind of RDD APIs we have to provide >>> to developer, maybe the fundamental API is enough, like, the ShuffledRDD >>> etc.. But PairRDDFunctions probably not in this category, as we can do the >>> same thing easily with DF/DS, even better performance. >>> >>> >>> >>> *From:* Mark Hamstra [mailto:m...@clearstorydata.com] >>> *Sent:* Friday, November 13, 2015 11:23 AM >>> *To:* Stephen Boesch >>> >>> *Cc:* dev@spark.apache.org >>> *Subject:* Re: A proposal for Spark 2.0 >>> >>> >>> >>> Hmmm... to me, that seems like precisely the kind of thing that argues >>> for retaining the RDD API but not as the first thing presented to new Spark >>> developers: "Here's how to use groupBy with DataFrames Until the >>> optimizer is more fully developed, that won't always get you the best >>> performance that can be obtained. In these particular circumstances, ..., >>> you may want to use the low-level RDD API while setting >>> preservesPartitioning to true. Like this" >>> >>> >>> >>> On Thu, Nov 12, 2015 at 7:05 PM, Stephen Boesch >>> wrote: >>> >>> My understanding is that the RDD's presently have more support for >>> complete control of partitioning which is a key consideration at scale. >>> While partitioning control is still piecemeal in DF/DS it would seem >>> premature to make RDD's a second-tier approach to spark dev. >>> >>> >>> >>> An example is the use of groupBy when we know that the source relation >>> (/RDD) is already partitioned on the grouping expressions. AFAIK the spark >>> sql still does not allow that knowledge to be applied to the optimizer - so >>> a full shuffle will be performed. However in the native RDD we can use >>> preservesPartitioning=true. >>> >>> >>> >>> 2015-11-12 17:42 GMT-08:00 Mark Hamstra : >>> >>> The place of the RDD API in 2.0 is also something I've been wondering >>> about. I think it may be going too far to deprecate it, but changing >>> emphasis is something that we might consider. The RDD API came well before >>> DataFrames and DataSets, so programming guides, introductory how-to >>> articles and the like have, to this point, also tended to emphasize RDDs -- >>> or at least to deal with them early. What I'm thinking is that with 2.0 >>> maybe we should overhaul all the documentation to de-emphasize and >>> reposition RDDs. In this scheme, DataFrames and DataSets would be >>> introduced and fully addressed before RDDs. They would be presented as the >>> normal/default/standard way to do things in Spark. RDDs, in contrast, &
Re: A proposal for Spark 2.0
Hey Matei, > Regarding Scala 2.12, we should definitely support it eventually, but I > don't think we need to block 2.0 on that because it can be added later too. > Has anyone investigated what it would take to run on there? I imagine we > don't need many code changes, just maybe some REPL stuff. Our REPL specific changes were merged in scala/scala and are available as part of 2.11.7 and hopefully be part of 2.12 too. If I am not wrong, REPL stuff is taken care of, we don;t need to keep upgrading REPL code for every scala release now. http://www.scala-lang.org/news/2.11.7 I am +1 on the proposal for Spark 2.0. Thanks, Prashant Sharma On Thu, Nov 12, 2015 at 3:02 AM, Matei Zaharia wrote: > I like the idea of popping out Tachyon to an optional component too to > reduce the number of dependencies. In the future, it might even be useful > to do this for Hadoop, but it requires too many API changes to be worth > doing now. > > Regarding Scala 2.12, we should definitely support it eventually, but I > don't think we need to block 2.0 on that because it can be added later too. > Has anyone investigated what it would take to run on there? I imagine we > don't need many code changes, just maybe some REPL stuff. > > Needless to say, but I'm all for the idea of making "major" releases as > undisruptive as possible in the model Reynold proposed. Keeping everyone > working with the same set of releases is super important. > > Matei > > > On Nov 11, 2015, at 4:58 AM, Sean Owen wrote: > > > > On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin > wrote: > >> to the Spark community. A major release should not be very different > from a > >> minor release and should not be gated based on new features. The main > >> purpose of a major release is an opportunity to fix things that are > broken > >> in the current API and remove certain deprecated APIs (examples follow). > > > > Agree with this stance. Generally, a major release might also be a > > time to replace some big old API or implementation with a new one, but > > I don't see obvious candidates. > > > > I wouldn't mind turning attention to 2.x sooner than later, unless > > there's a fairly good reason to continue adding features in 1.x to a > > 1.7 release. The scope as of 1.6 is already pretty darned big. > > > > > >> 1. Scala 2.11 as the default build. We should still support Scala 2.10, > but > >> it has been end-of-life. > > > > By the time 2.x rolls around, 2.12 will be the main version, 2.11 will > > be quite stable, and 2.10 will have been EOL for a while. I'd propose > > dropping 2.10. Otherwise it's supported for 2 more years. > > > > > >> 2. Remove Hadoop 1 support. > > > > I'd go further to drop support for <2.2 for sure (2.0 and 2.1 were > > sort of 'alpha' and 'beta' releases) and even <2.6. > > > > I'm sure we'll think of a number of other small things -- shading a > > bunch of stuff? reviewing and updating dependencies in light of > > simpler, more recent dependencies to support from Hadoop etc? > > > > Farming out Tachyon to a module? (I felt like someone proposed this?) > > Pop out any Docker stuff to another repo? > > Continue that same effort for EC2? > > Farming out some of the "external" integrations to another repo (? > > controversial) > > > > See also anything marked version "2+" in JIRA. > > > > - > > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > > For additional commands, e-mail: dev-h...@spark.apache.org > > > > > - > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > >
Re: A proposal for Spark 2.0
Producing new x.0 releases of open source projects is a recurrent problem: too radical a change means the old version gets updated anyway (Python 3) and an incompatible version stops takeup (example, Log4Js dropping support for log4j.properties files), Similarly, any radical new feature does tend to push out release times longer than you think (Hadoop 2). I think the lessons I'd draw from those and others is: keep an x.0 version as compatible as possible so that everyone can move, and ship fast. You want to be able to retire the 1.x line. And how to ship fast? Keep those features down. For anyone planning anything radical —a branch with a clear plan/schedule to be merged in is probably the best strategy. I actually think the firefox process is the best here, and that it should have been adopted more in Hadoop; ongoing work is going in in branches for some things (erasure coding, IPv6), but there's still pressure to define the release schedule on feature completeness. https://wiki.mozilla.org/Release_Management/Release_Process see also JDD's article on evolution vs revolution in OSS; 15 years old but still valid. At the time, the Jakarta project was the equivalent of the ASF hadoop/big data stack, and indeed, its traces run through the code and the build & test process if you know what to look for http://incubator.apache.org/learn/rules-for-revolutionaries.html -Steve
Re: A proposal for Spark 2.0
Why does stabilization of those two features require a 1.7 release instead of 1.6.1? On Fri, Nov 13, 2015 at 11:40 AM, Kostas Sakellis wrote: > We have veered off the topic of Spark 2.0 a little bit here - yes we can > talk about RDD vs. DS/DF more but lets refocus on Spark 2.0. I'd like to > propose we have one more 1.x release after Spark 1.6. This will allow us to > stabilize a few of the new features that were added in 1.6: > > 1) the experimental Datasets API > 2) the new unified memory manager. > > I understand our goal for Spark 2.0 is to offer an easy transition but > there will be users that won't be able to seamlessly upgrade given what we > have discussed as in scope for 2.0. For these users, having a 1.x release > with these new features/APIs stabilized will be very beneficial. This might > make Spark 1.7 a lighter release but that is not necessarily a bad thing. > > Any thoughts on this timeline? > > Kostas Sakellis > > > > On Thu, Nov 12, 2015 at 8:39 PM, Cheng, Hao wrote: > >> Agree, more features/apis/optimization need to be added in DF/DS. >> >> >> >> I mean, we need to think about what kind of RDD APIs we have to provide >> to developer, maybe the fundamental API is enough, like, the ShuffledRDD >> etc.. But PairRDDFunctions probably not in this category, as we can do the >> same thing easily with DF/DS, even better performance. >> >> >> >> *From:* Mark Hamstra [mailto:m...@clearstorydata.com] >> *Sent:* Friday, November 13, 2015 11:23 AM >> *To:* Stephen Boesch >> >> *Cc:* dev@spark.apache.org >> *Subject:* Re: A proposal for Spark 2.0 >> >> >> >> Hmmm... to me, that seems like precisely the kind of thing that argues >> for retaining the RDD API but not as the first thing presented to new Spark >> developers: "Here's how to use groupBy with DataFrames Until the >> optimizer is more fully developed, that won't always get you the best >> performance that can be obtained. In these particular circumstances, ..., >> you may want to use the low-level RDD API while setting >> preservesPartitioning to true. Like this" >> >> >> >> On Thu, Nov 12, 2015 at 7:05 PM, Stephen Boesch >> wrote: >> >> My understanding is that the RDD's presently have more support for >> complete control of partitioning which is a key consideration at scale. >> While partitioning control is still piecemeal in DF/DS it would seem >> premature to make RDD's a second-tier approach to spark dev. >> >> >> >> An example is the use of groupBy when we know that the source relation >> (/RDD) is already partitioned on the grouping expressions. AFAIK the spark >> sql still does not allow that knowledge to be applied to the optimizer - so >> a full shuffle will be performed. However in the native RDD we can use >> preservesPartitioning=true. >> >> >> >> 2015-11-12 17:42 GMT-08:00 Mark Hamstra : >> >> The place of the RDD API in 2.0 is also something I've been wondering >> about. I think it may be going too far to deprecate it, but changing >> emphasis is something that we might consider. The RDD API came well before >> DataFrames and DataSets, so programming guides, introductory how-to >> articles and the like have, to this point, also tended to emphasize RDDs -- >> or at least to deal with them early. What I'm thinking is that with 2.0 >> maybe we should overhaul all the documentation to de-emphasize and >> reposition RDDs. In this scheme, DataFrames and DataSets would be >> introduced and fully addressed before RDDs. They would be presented as the >> normal/default/standard way to do things in Spark. RDDs, in contrast, >> would be presented later as a kind of lower-level, closer-to-the-metal API >> that can be used in atypical, more specialized contexts where DataFrames or >> DataSets don't fully fit. >> >> >> >> On Thu, Nov 12, 2015 at 5:17 PM, Cheng, Hao wrote: >> >> I am not sure what the best practice for this specific problem, but it’s >> really worth to think about it in 2.0, as it is a painful issue for lots of >> users. >> >> >> >> By the way, is it also an opportunity to deprecate the RDD API (or >> internal API only?)? As lots of its functionality overlapping with >> DataFrame or DataSet. >> >> >> >> Hao >> >> >> >> *From:* Kostas Sakellis [mailto:kos...@cloudera.com] >> *Sent:* Friday, November 13, 2015 5:27 AM >> *To:* Nicholas Chammas >> *Cc:* Ulanov,
Re: A proposal for Spark 2.0
We have veered off the topic of Spark 2.0 a little bit here - yes we can talk about RDD vs. DS/DF more but lets refocus on Spark 2.0. I'd like to propose we have one more 1.x release after Spark 1.6. This will allow us to stabilize a few of the new features that were added in 1.6: 1) the experimental Datasets API 2) the new unified memory manager. I understand our goal for Spark 2.0 is to offer an easy transition but there will be users that won't be able to seamlessly upgrade given what we have discussed as in scope for 2.0. For these users, having a 1.x release with these new features/APIs stabilized will be very beneficial. This might make Spark 1.7 a lighter release but that is not necessarily a bad thing. Any thoughts on this timeline? Kostas Sakellis On Thu, Nov 12, 2015 at 8:39 PM, Cheng, Hao wrote: > Agree, more features/apis/optimization need to be added in DF/DS. > > > > I mean, we need to think about what kind of RDD APIs we have to provide to > developer, maybe the fundamental API is enough, like, the ShuffledRDD > etc.. But PairRDDFunctions probably not in this category, as we can do the > same thing easily with DF/DS, even better performance. > > > > *From:* Mark Hamstra [mailto:m...@clearstorydata.com] > *Sent:* Friday, November 13, 2015 11:23 AM > *To:* Stephen Boesch > > *Cc:* dev@spark.apache.org > *Subject:* Re: A proposal for Spark 2.0 > > > > Hmmm... to me, that seems like precisely the kind of thing that argues for > retaining the RDD API but not as the first thing presented to new Spark > developers: "Here's how to use groupBy with DataFrames Until the > optimizer is more fully developed, that won't always get you the best > performance that can be obtained. In these particular circumstances, ..., > you may want to use the low-level RDD API while setting > preservesPartitioning to true. Like this" > > > > On Thu, Nov 12, 2015 at 7:05 PM, Stephen Boesch wrote: > > My understanding is that the RDD's presently have more support for > complete control of partitioning which is a key consideration at scale. > While partitioning control is still piecemeal in DF/DS it would seem > premature to make RDD's a second-tier approach to spark dev. > > > > An example is the use of groupBy when we know that the source relation > (/RDD) is already partitioned on the grouping expressions. AFAIK the spark > sql still does not allow that knowledge to be applied to the optimizer - so > a full shuffle will be performed. However in the native RDD we can use > preservesPartitioning=true. > > > > 2015-11-12 17:42 GMT-08:00 Mark Hamstra : > > The place of the RDD API in 2.0 is also something I've been wondering > about. I think it may be going too far to deprecate it, but changing > emphasis is something that we might consider. The RDD API came well before > DataFrames and DataSets, so programming guides, introductory how-to > articles and the like have, to this point, also tended to emphasize RDDs -- > or at least to deal with them early. What I'm thinking is that with 2.0 > maybe we should overhaul all the documentation to de-emphasize and > reposition RDDs. In this scheme, DataFrames and DataSets would be > introduced and fully addressed before RDDs. They would be presented as the > normal/default/standard way to do things in Spark. RDDs, in contrast, > would be presented later as a kind of lower-level, closer-to-the-metal API > that can be used in atypical, more specialized contexts where DataFrames or > DataSets don't fully fit. > > > > On Thu, Nov 12, 2015 at 5:17 PM, Cheng, Hao wrote: > > I am not sure what the best practice for this specific problem, but it’s > really worth to think about it in 2.0, as it is a painful issue for lots of > users. > > > > By the way, is it also an opportunity to deprecate the RDD API (or > internal API only?)? As lots of its functionality overlapping with > DataFrame or DataSet. > > > > Hao > > > > *From:* Kostas Sakellis [mailto:kos...@cloudera.com] > *Sent:* Friday, November 13, 2015 5:27 AM > *To:* Nicholas Chammas > *Cc:* Ulanov, Alexander; Nan Zhu; wi...@qq.com; dev@spark.apache.org; > Reynold Xin > > > *Subject:* Re: A proposal for Spark 2.0 > > > > I know we want to keep breaking changes to a minimum but I'm hoping that > with Spark 2.0 we can also look at better classpath isolation with user > programs. I propose we build on spark.{driver|executor}.userClassPathFirst, > setting it true by default, and not allow any spark transitive dependencies > to leak into user code. For backwards compatibility we can have a whitelist > if we want but I'd be good if we start requi
RE: A proposal for Spark 2.0
Agree, more features/apis/optimization need to be added in DF/DS. I mean, we need to think about what kind of RDD APIs we have to provide to developer, maybe the fundamental API is enough, like, the ShuffledRDD etc.. But PairRDDFunctions probably not in this category, as we can do the same thing easily with DF/DS, even better performance. From: Mark Hamstra [mailto:m...@clearstorydata.com] Sent: Friday, November 13, 2015 11:23 AM To: Stephen Boesch Cc: dev@spark.apache.org Subject: Re: A proposal for Spark 2.0 Hmmm... to me, that seems like precisely the kind of thing that argues for retaining the RDD API but not as the first thing presented to new Spark developers: "Here's how to use groupBy with DataFrames Until the optimizer is more fully developed, that won't always get you the best performance that can be obtained. In these particular circumstances, ..., you may want to use the low-level RDD API while setting preservesPartitioning to true. Like this" On Thu, Nov 12, 2015 at 7:05 PM, Stephen Boesch mailto:java...@gmail.com>> wrote: My understanding is that the RDD's presently have more support for complete control of partitioning which is a key consideration at scale. While partitioning control is still piecemeal in DF/DS it would seem premature to make RDD's a second-tier approach to spark dev. An example is the use of groupBy when we know that the source relation (/RDD) is already partitioned on the grouping expressions. AFAIK the spark sql still does not allow that knowledge to be applied to the optimizer - so a full shuffle will be performed. However in the native RDD we can use preservesPartitioning=true. 2015-11-12 17:42 GMT-08:00 Mark Hamstra mailto:m...@clearstorydata.com>>: The place of the RDD API in 2.0 is also something I've been wondering about. I think it may be going too far to deprecate it, but changing emphasis is something that we might consider. The RDD API came well before DataFrames and DataSets, so programming guides, introductory how-to articles and the like have, to this point, also tended to emphasize RDDs -- or at least to deal with them early. What I'm thinking is that with 2.0 maybe we should overhaul all the documentation to de-emphasize and reposition RDDs. In this scheme, DataFrames and DataSets would be introduced and fully addressed before RDDs. They would be presented as the normal/default/standard way to do things in Spark. RDDs, in contrast, would be presented later as a kind of lower-level, closer-to-the-metal API that can be used in atypical, more specialized contexts where DataFrames or DataSets don't fully fit. On Thu, Nov 12, 2015 at 5:17 PM, Cheng, Hao mailto:hao.ch...@intel.com>> wrote: I am not sure what the best practice for this specific problem, but it’s really worth to think about it in 2.0, as it is a painful issue for lots of users. By the way, is it also an opportunity to deprecate the RDD API (or internal API only?)? As lots of its functionality overlapping with DataFrame or DataSet. Hao From: Kostas Sakellis [mailto:kos...@cloudera.com<mailto:kos...@cloudera.com>] Sent: Friday, November 13, 2015 5:27 AM To: Nicholas Chammas Cc: Ulanov, Alexander; Nan Zhu; wi...@qq.com<mailto:wi...@qq.com>; dev@spark.apache.org<mailto:dev@spark.apache.org>; Reynold Xin Subject: Re: A proposal for Spark 2.0 I know we want to keep breaking changes to a minimum but I'm hoping that with Spark 2.0 we can also look at better classpath isolation with user programs. I propose we build on spark.{driver|executor}.userClassPathFirst, setting it true by default, and not allow any spark transitive dependencies to leak into user code. For backwards compatibility we can have a whitelist if we want but I'd be good if we start requiring user apps to explicitly pull in all their dependencies. From what I can tell, Hadoop 3 is also moving in this direction. Kostas On Thu, Nov 12, 2015 at 9:56 AM, Nicholas Chammas mailto:nicholas.cham...@gmail.com>> wrote: With regards to Machine learning, it would be great to move useful features from MLlib to ML and deprecate the former. Current structure of two separate machine learning packages seems to be somewhat confusing. With regards to GraphX, it would be great to deprecate the use of RDD in GraphX and switch to Dataframe. This will allow GraphX evolve with Tungsten. On that note of deprecating stuff, it might be good to deprecate some things in 2.0 without removing or replacing them immediately. That way 2.0 doesn’t have to wait for everything that we want to deprecate to be replaced all at once. Nick On Thu, Nov 12, 2015 at 12:45 PM Ulanov, Alexander mailto:alexander.ula...@hpe.com>> wrote: Parameter Server is a new feature and thus does not match the goal of 2.0 is “to fix things that are broken in the current API and remove certain deprecated APIs”.
Re: RE: A proposal for Spark 2.0
Yes, I agree with Nan Zhu. I recommend these projects: https://github.com/dmlc/ps-lite (Apache License 2) https://github.com/Microsoft/multiverso (MIT License) Alexander, You may also be interested in the demo(graph on parameter Server) https://github.com/witgo/zen/tree/ps_graphx/graphx/src/main/scala/com/github/cloudml/zen/graphx -- Original -- From: "Ulanov, Alexander";; Date: Fri, Nov 13, 2015 01:44 AM To: "Nan Zhu"; "Guoqiang Li"; Cc: "dev@spark.apache.org"; "Reynold Xin"; Subject: RE: A proposal for Spark 2.0 Parameter Server is a new feature and thus does not match the goal of 2.0 is ??to fix things that are broken in the current API and remove certain deprecated APIs??. At the same time I would be happy to have that feature. With regards to Machine learning, it would be great to move useful features from MLlib to ML and deprecate the former. Current structure of two separate machine learning packages seems to be somewhat confusing. With regards to GraphX, it would be great to deprecate the use of RDD in GraphX and switch to Dataframe. This will allow GraphX evolve with Tungsten. Best regards, Alexander From: Nan Zhu [mailto:zhunanmcg...@gmail.com] Sent: Thursday, November 12, 2015 7:28 AM To: wi...@qq.com Cc: dev@spark.apache.org Subject: Re: A proposal for Spark 2.0 Being specific to Parameter Server, I think the current agreement is that PS shall exist as a third-party library instead of a component of the core code base, isn??t? Best, -- Nan Zhu http://codingcat.me On Thursday, November 12, 2015 at 9:49 AM, wi...@qq.com wrote: Who has the idea of machine learning? Spark missing some features for machine learning, For example, the parameter server. ?? 2015??11??1205:32??Matei Zaharia ?? I like the idea of popping out Tachyon to an optional component too to reduce the number of dependencies. In the future, it might even be useful to do this for Hadoop, but it requires too many API changes to be worth doing now. Regarding Scala 2.12, we should definitely support it eventually, but I don't think we need to block 2.0 on that because it can be added later too. Has anyone investigated what it would take to run on there? I imagine we don't need many code changes, just maybe some REPL stuff. Needless to say, but I'm all for the idea of making "major" releases as undisruptive as possible in the model Reynold proposed. Keeping everyone working with the same set of releases is super important. Matei On Nov 11, 2015, at 4:58 AM, Sean Owen wrote: On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin wrote: to the Spark community. A major release should not be very different from a minor release and should not be gated based on new features. The main purpose of a major release is an opportunity to fix things that are broken in the current API and remove certain deprecated APIs (examples follow). Agree with this stance. Generally, a major release might also be a time to replace some big old API or implementation with a new one, but I don't see obvious candidates. I wouldn't mind turning attention to 2.x sooner than later, unless there's a fairly good reason to continue adding features in 1.x to a 1.7 release. The scope as of 1.6 is already pretty darned big. 1. Scala 2.11 as the default build. We should still support Scala 2.10, but it has been end-of-life. By the time 2.x rolls around, 2.12 will be the main version, 2.11 will be quite stable, and 2.10 will have been EOL for a while. I'd propose dropping 2.10. Otherwise it's supported for 2 more years. 2. Remove Hadoop 1 support. I'd go further to drop support for <2.2 for sure (2.0 and 2.1 were sort of 'alpha' and 'beta' releases) and even <2.6. I'm sure we'll think of a number of other small things -- shading a bunch of stuff? reviewing and updating dependencies in light of simpler, more recent dependencies to support from Hadoop etc? Farming out Tachyon to a module? (I felt like someone proposed this?) Pop out any Docker stuff to another repo? Continue that same effort for EC2? Farming out some of the "external" integrations to another repo (? controversial) See also anything marked version "2+" in JIRA. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: A proposal for Spark 2.0
Hmmm... to me, that seems like precisely the kind of thing that argues for retaining the RDD API but not as the first thing presented to new Spark developers: "Here's how to use groupBy with DataFrames Until the optimizer is more fully developed, that won't always get you the best performance that can be obtained. In these particular circumstances, ..., you may want to use the low-level RDD API while setting preservesPartitioning to true. Like this" On Thu, Nov 12, 2015 at 7:05 PM, Stephen Boesch wrote: > My understanding is that the RDD's presently have more support for > complete control of partitioning which is a key consideration at scale. > While partitioning control is still piecemeal in DF/DS it would seem > premature to make RDD's a second-tier approach to spark dev. > > An example is the use of groupBy when we know that the source relation > (/RDD) is already partitioned on the grouping expressions. AFAIK the spark > sql still does not allow that knowledge to be applied to the optimizer - so > a full shuffle will be performed. However in the native RDD we can use > preservesPartitioning=true. > > 2015-11-12 17:42 GMT-08:00 Mark Hamstra : > >> The place of the RDD API in 2.0 is also something I've been wondering >> about. I think it may be going too far to deprecate it, but changing >> emphasis is something that we might consider. The RDD API came well before >> DataFrames and DataSets, so programming guides, introductory how-to >> articles and the like have, to this point, also tended to emphasize RDDs -- >> or at least to deal with them early. What I'm thinking is that with 2.0 >> maybe we should overhaul all the documentation to de-emphasize and >> reposition RDDs. In this scheme, DataFrames and DataSets would be >> introduced and fully addressed before RDDs. They would be presented as the >> normal/default/standard way to do things in Spark. RDDs, in contrast, >> would be presented later as a kind of lower-level, closer-to-the-metal API >> that can be used in atypical, more specialized contexts where DataFrames or >> DataSets don't fully fit. >> >> On Thu, Nov 12, 2015 at 5:17 PM, Cheng, Hao wrote: >> >>> I am not sure what the best practice for this specific problem, but it’s >>> really worth to think about it in 2.0, as it is a painful issue for lots of >>> users. >>> >>> >>> >>> By the way, is it also an opportunity to deprecate the RDD API (or >>> internal API only?)? As lots of its functionality overlapping with >>> DataFrame or DataSet. >>> >>> >>> >>> Hao >>> >>> >>> >>> *From:* Kostas Sakellis [mailto:kos...@cloudera.com] >>> *Sent:* Friday, November 13, 2015 5:27 AM >>> *To:* Nicholas Chammas >>> *Cc:* Ulanov, Alexander; Nan Zhu; wi...@qq.com; dev@spark.apache.org; >>> Reynold Xin >>> >>> *Subject:* Re: A proposal for Spark 2.0 >>> >>> >>> >>> I know we want to keep breaking changes to a minimum but I'm hoping that >>> with Spark 2.0 we can also look at better classpath isolation with user >>> programs. I propose we build on spark.{driver|executor}.userClassPathFirst, >>> setting it true by default, and not allow any spark transitive dependencies >>> to leak into user code. For backwards compatibility we can have a whitelist >>> if we want but I'd be good if we start requiring user apps to explicitly >>> pull in all their dependencies. From what I can tell, Hadoop 3 is also >>> moving in this direction. >>> >>> >>> >>> Kostas >>> >>> >>> >>> On Thu, Nov 12, 2015 at 9:56 AM, Nicholas Chammas < >>> nicholas.cham...@gmail.com> wrote: >>> >>> With regards to Machine learning, it would be great to move useful >>> features from MLlib to ML and deprecate the former. Current structure of >>> two separate machine learning packages seems to be somewhat confusing. >>> >>> With regards to GraphX, it would be great to deprecate the use of RDD in >>> GraphX and switch to Dataframe. This will allow GraphX evolve with Tungsten. >>> >>> On that note of deprecating stuff, it might be good to deprecate some >>> things in 2.0 without removing or replacing them immediately. That way 2.0 >>> doesn’t have to wait for everything that we want to deprecate to be >>> replaced all at once. >>> >>> Nick >>> >>> >>> >>> >>> >>>
Re: A proposal for Spark 2.0
My understanding is that the RDD's presently have more support for complete control of partitioning which is a key consideration at scale. While partitioning control is still piecemeal in DF/DS it would seem premature to make RDD's a second-tier approach to spark dev. An example is the use of groupBy when we know that the source relation (/RDD) is already partitioned on the grouping expressions. AFAIK the spark sql still does not allow that knowledge to be applied to the optimizer - so a full shuffle will be performed. However in the native RDD we can use preservesPartitioning=true. 2015-11-12 17:42 GMT-08:00 Mark Hamstra : > The place of the RDD API in 2.0 is also something I've been wondering > about. I think it may be going too far to deprecate it, but changing > emphasis is something that we might consider. The RDD API came well before > DataFrames and DataSets, so programming guides, introductory how-to > articles and the like have, to this point, also tended to emphasize RDDs -- > or at least to deal with them early. What I'm thinking is that with 2.0 > maybe we should overhaul all the documentation to de-emphasize and > reposition RDDs. In this scheme, DataFrames and DataSets would be > introduced and fully addressed before RDDs. They would be presented as the > normal/default/standard way to do things in Spark. RDDs, in contrast, > would be presented later as a kind of lower-level, closer-to-the-metal API > that can be used in atypical, more specialized contexts where DataFrames or > DataSets don't fully fit. > > On Thu, Nov 12, 2015 at 5:17 PM, Cheng, Hao wrote: > >> I am not sure what the best practice for this specific problem, but it’s >> really worth to think about it in 2.0, as it is a painful issue for lots of >> users. >> >> >> >> By the way, is it also an opportunity to deprecate the RDD API (or >> internal API only?)? As lots of its functionality overlapping with >> DataFrame or DataSet. >> >> >> >> Hao >> >> >> >> *From:* Kostas Sakellis [mailto:kos...@cloudera.com] >> *Sent:* Friday, November 13, 2015 5:27 AM >> *To:* Nicholas Chammas >> *Cc:* Ulanov, Alexander; Nan Zhu; wi...@qq.com; dev@spark.apache.org; >> Reynold Xin >> >> *Subject:* Re: A proposal for Spark 2.0 >> >> >> >> I know we want to keep breaking changes to a minimum but I'm hoping that >> with Spark 2.0 we can also look at better classpath isolation with user >> programs. I propose we build on spark.{driver|executor}.userClassPathFirst, >> setting it true by default, and not allow any spark transitive dependencies >> to leak into user code. For backwards compatibility we can have a whitelist >> if we want but I'd be good if we start requiring user apps to explicitly >> pull in all their dependencies. From what I can tell, Hadoop 3 is also >> moving in this direction. >> >> >> >> Kostas >> >> >> >> On Thu, Nov 12, 2015 at 9:56 AM, Nicholas Chammas < >> nicholas.cham...@gmail.com> wrote: >> >> With regards to Machine learning, it would be great to move useful >> features from MLlib to ML and deprecate the former. Current structure of >> two separate machine learning packages seems to be somewhat confusing. >> >> With regards to GraphX, it would be great to deprecate the use of RDD in >> GraphX and switch to Dataframe. This will allow GraphX evolve with Tungsten. >> >> On that note of deprecating stuff, it might be good to deprecate some >> things in 2.0 without removing or replacing them immediately. That way 2.0 >> doesn’t have to wait for everything that we want to deprecate to be >> replaced all at once. >> >> Nick >> >> >> >> >> >> On Thu, Nov 12, 2015 at 12:45 PM Ulanov, Alexander < >> alexander.ula...@hpe.com> wrote: >> >> Parameter Server is a new feature and thus does not match the goal of 2.0 >> is “to fix things that are broken in the current API and remove certain >> deprecated APIs”. At the same time I would be happy to have that feature. >> >> >> >> With regards to Machine learning, it would be great to move useful >> features from MLlib to ML and deprecate the former. Current structure of >> two separate machine learning packages seems to be somewhat confusing. >> >> With regards to GraphX, it would be great to deprecate the use of RDD in >> GraphX and switch to Dataframe. This will allow GraphX evolve with Tungsten. >> >> >> >> Best regards, Alexander >> >> >> >> *From:* Nan Zhu [mailto:zhunanm
Re: A proposal for Spark 2.0
The place of the RDD API in 2.0 is also something I've been wondering about. I think it may be going too far to deprecate it, but changing emphasis is something that we might consider. The RDD API came well before DataFrames and DataSets, so programming guides, introductory how-to articles and the like have, to this point, also tended to emphasize RDDs -- or at least to deal with them early. What I'm thinking is that with 2.0 maybe we should overhaul all the documentation to de-emphasize and reposition RDDs. In this scheme, DataFrames and DataSets would be introduced and fully addressed before RDDs. They would be presented as the normal/default/standard way to do things in Spark. RDDs, in contrast, would be presented later as a kind of lower-level, closer-to-the-metal API that can be used in atypical, more specialized contexts where DataFrames or DataSets don't fully fit. On Thu, Nov 12, 2015 at 5:17 PM, Cheng, Hao wrote: > I am not sure what the best practice for this specific problem, but it’s > really worth to think about it in 2.0, as it is a painful issue for lots of > users. > > > > By the way, is it also an opportunity to deprecate the RDD API (or > internal API only?)? As lots of its functionality overlapping with > DataFrame or DataSet. > > > > Hao > > > > *From:* Kostas Sakellis [mailto:kos...@cloudera.com] > *Sent:* Friday, November 13, 2015 5:27 AM > *To:* Nicholas Chammas > *Cc:* Ulanov, Alexander; Nan Zhu; wi...@qq.com; dev@spark.apache.org; > Reynold Xin > > *Subject:* Re: A proposal for Spark 2.0 > > > > I know we want to keep breaking changes to a minimum but I'm hoping that > with Spark 2.0 we can also look at better classpath isolation with user > programs. I propose we build on spark.{driver|executor}.userClassPathFirst, > setting it true by default, and not allow any spark transitive dependencies > to leak into user code. For backwards compatibility we can have a whitelist > if we want but I'd be good if we start requiring user apps to explicitly > pull in all their dependencies. From what I can tell, Hadoop 3 is also > moving in this direction. > > > > Kostas > > > > On Thu, Nov 12, 2015 at 9:56 AM, Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > > With regards to Machine learning, it would be great to move useful > features from MLlib to ML and deprecate the former. Current structure of > two separate machine learning packages seems to be somewhat confusing. > > With regards to GraphX, it would be great to deprecate the use of RDD in > GraphX and switch to Dataframe. This will allow GraphX evolve with Tungsten. > > On that note of deprecating stuff, it might be good to deprecate some > things in 2.0 without removing or replacing them immediately. That way 2.0 > doesn’t have to wait for everything that we want to deprecate to be > replaced all at once. > > Nick > > > > > > On Thu, Nov 12, 2015 at 12:45 PM Ulanov, Alexander < > alexander.ula...@hpe.com> wrote: > > Parameter Server is a new feature and thus does not match the goal of 2.0 > is “to fix things that are broken in the current API and remove certain > deprecated APIs”. At the same time I would be happy to have that feature. > > > > With regards to Machine learning, it would be great to move useful > features from MLlib to ML and deprecate the former. Current structure of > two separate machine learning packages seems to be somewhat confusing. > > With regards to GraphX, it would be great to deprecate the use of RDD in > GraphX and switch to Dataframe. This will allow GraphX evolve with Tungsten. > > > > Best regards, Alexander > > > > *From:* Nan Zhu [mailto:zhunanmcg...@gmail.com] > *Sent:* Thursday, November 12, 2015 7:28 AM > *To:* wi...@qq.com > *Cc:* dev@spark.apache.org > *Subject:* Re: A proposal for Spark 2.0 > > > > Being specific to Parameter Server, I think the current agreement is that > PS shall exist as a third-party library instead of a component of the core > code base, isn’t? > > > > Best, > > > > -- > > Nan Zhu > > http://codingcat.me > > > > On Thursday, November 12, 2015 at 9:49 AM, wi...@qq.com wrote: > > Who has the idea of machine learning? Spark missing some features for > machine learning, For example, the parameter server. > > > > > > 在 2015年11月12日,05:32,Matei Zaharia 写道: > > > > I like the idea of popping out Tachyon to an optional component too to > reduce the number of dependencies. In the future, it might even be useful > to do this for Hadoop, but it requires too many API changes to be worth > doing now. > > > > Regarding Scala 2.12, we should defin
RE: A proposal for Spark 2.0
I am not sure what the best practice for this specific problem, but it’s really worth to think about it in 2.0, as it is a painful issue for lots of users. By the way, is it also an opportunity to deprecate the RDD API (or internal API only?)? As lots of its functionality overlapping with DataFrame or DataSet. Hao From: Kostas Sakellis [mailto:kos...@cloudera.com] Sent: Friday, November 13, 2015 5:27 AM To: Nicholas Chammas Cc: Ulanov, Alexander; Nan Zhu; wi...@qq.com; dev@spark.apache.org; Reynold Xin Subject: Re: A proposal for Spark 2.0 I know we want to keep breaking changes to a minimum but I'm hoping that with Spark 2.0 we can also look at better classpath isolation with user programs. I propose we build on spark.{driver|executor}.userClassPathFirst, setting it true by default, and not allow any spark transitive dependencies to leak into user code. For backwards compatibility we can have a whitelist if we want but I'd be good if we start requiring user apps to explicitly pull in all their dependencies. From what I can tell, Hadoop 3 is also moving in this direction. Kostas On Thu, Nov 12, 2015 at 9:56 AM, Nicholas Chammas mailto:nicholas.cham...@gmail.com>> wrote: With regards to Machine learning, it would be great to move useful features from MLlib to ML and deprecate the former. Current structure of two separate machine learning packages seems to be somewhat confusing. With regards to GraphX, it would be great to deprecate the use of RDD in GraphX and switch to Dataframe. This will allow GraphX evolve with Tungsten. On that note of deprecating stuff, it might be good to deprecate some things in 2.0 without removing or replacing them immediately. That way 2.0 doesn’t have to wait for everything that we want to deprecate to be replaced all at once. Nick On Thu, Nov 12, 2015 at 12:45 PM Ulanov, Alexander mailto:alexander.ula...@hpe.com>> wrote: Parameter Server is a new feature and thus does not match the goal of 2.0 is “to fix things that are broken in the current API and remove certain deprecated APIs”. At the same time I would be happy to have that feature. With regards to Machine learning, it would be great to move useful features from MLlib to ML and deprecate the former. Current structure of two separate machine learning packages seems to be somewhat confusing. With regards to GraphX, it would be great to deprecate the use of RDD in GraphX and switch to Dataframe. This will allow GraphX evolve with Tungsten. Best regards, Alexander From: Nan Zhu [mailto:zhunanmcg...@gmail.com<mailto:zhunanmcg...@gmail.com>] Sent: Thursday, November 12, 2015 7:28 AM To: wi...@qq.com<mailto:wi...@qq.com> Cc: dev@spark.apache.org<mailto:dev@spark.apache.org> Subject: Re: A proposal for Spark 2.0 Being specific to Parameter Server, I think the current agreement is that PS shall exist as a third-party library instead of a component of the core code base, isn’t? Best, -- Nan Zhu http://codingcat.me On Thursday, November 12, 2015 at 9:49 AM, wi...@qq.com<mailto:wi...@qq.com> wrote: Who has the idea of machine learning? Spark missing some features for machine learning, For example, the parameter server. 在 2015年11月12日,05:32,Matei Zaharia mailto:matei.zaha...@gmail.com>> 写道: I like the idea of popping out Tachyon to an optional component too to reduce the number of dependencies. In the future, it might even be useful to do this for Hadoop, but it requires too many API changes to be worth doing now. Regarding Scala 2.12, we should definitely support it eventually, but I don't think we need to block 2.0 on that because it can be added later too. Has anyone investigated what it would take to run on there? I imagine we don't need many code changes, just maybe some REPL stuff. Needless to say, but I'm all for the idea of making "major" releases as undisruptive as possible in the model Reynold proposed. Keeping everyone working with the same set of releases is super important. Matei On Nov 11, 2015, at 4:58 AM, Sean Owen mailto:so...@cloudera.com>> wrote: On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin mailto:r...@databricks.com>> wrote: to the Spark community. A major release should not be very different from a minor release and should not be gated based on new features. The main purpose of a major release is an opportunity to fix things that are broken in the current API and remove certain deprecated APIs (examples follow). Agree with this stance. Generally, a major release might also be a time to replace some big old API or implementation with a new one, but I don't see obvious candidates. I wouldn't mind turning attention to 2.x sooner than later, unless there's a fairly good reason to continue adding features in 1.x to a 1.7 release. The scope as of 1.6 is already pretty darned big. 1. Scala 2.11 as the default build. We should still support Scala 2.10,
Re: A proposal for Spark 2.0
I know we want to keep breaking changes to a minimum but I'm hoping that with Spark 2.0 we can also look at better classpath isolation with user programs. I propose we build on spark.{driver|executor}.userClassPathFirst, setting it true by default, and not allow any spark transitive dependencies to leak into user code. For backwards compatibility we can have a whitelist if we want but I'd be good if we start requiring user apps to explicitly pull in all their dependencies. From what I can tell, Hadoop 3 is also moving in this direction. Kostas On Thu, Nov 12, 2015 at 9:56 AM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > With regards to Machine learning, it would be great to move useful > features from MLlib to ML and deprecate the former. Current structure of > two separate machine learning packages seems to be somewhat confusing. > > With regards to GraphX, it would be great to deprecate the use of RDD in > GraphX and switch to Dataframe. This will allow GraphX evolve with Tungsten. > > On that note of deprecating stuff, it might be good to deprecate some > things in 2.0 without removing or replacing them immediately. That way 2.0 > doesn’t have to wait for everything that we want to deprecate to be > replaced all at once. > > Nick > > > On Thu, Nov 12, 2015 at 12:45 PM Ulanov, Alexander < > alexander.ula...@hpe.com> wrote: > >> Parameter Server is a new feature and thus does not match the goal of 2.0 >> is “to fix things that are broken in the current API and remove certain >> deprecated APIs”. At the same time I would be happy to have that feature. >> >> >> >> With regards to Machine learning, it would be great to move useful >> features from MLlib to ML and deprecate the former. Current structure of >> two separate machine learning packages seems to be somewhat confusing. >> >> With regards to GraphX, it would be great to deprecate the use of RDD in >> GraphX and switch to Dataframe. This will allow GraphX evolve with Tungsten. >> >> >> >> Best regards, Alexander >> >> >> >> *From:* Nan Zhu [mailto:zhunanmcg...@gmail.com] >> *Sent:* Thursday, November 12, 2015 7:28 AM >> *To:* wi...@qq.com >> *Cc:* dev@spark.apache.org >> *Subject:* Re: A proposal for Spark 2.0 >> >> >> >> Being specific to Parameter Server, I think the current agreement is that >> PS shall exist as a third-party library instead of a component of the core >> code base, isn’t? >> >> >> >> Best, >> >> >> >> -- >> >> Nan Zhu >> >> http://codingcat.me >> >> >> >> On Thursday, November 12, 2015 at 9:49 AM, wi...@qq.com wrote: >> >> Who has the idea of machine learning? Spark missing some features for >> machine learning, For example, the parameter server. >> >> >> >> >> >> 在 2015年11月12日,05:32,Matei Zaharia 写道: >> >> >> >> I like the idea of popping out Tachyon to an optional component too to >> reduce the number of dependencies. In the future, it might even be useful >> to do this for Hadoop, but it requires too many API changes to be worth >> doing now. >> >> >> >> Regarding Scala 2.12, we should definitely support it eventually, but I >> don't think we need to block 2.0 on that because it can be added later too. >> Has anyone investigated what it would take to run on there? I imagine we >> don't need many code changes, just maybe some REPL stuff. >> >> >> >> Needless to say, but I'm all for the idea of making "major" releases as >> undisruptive as possible in the model Reynold proposed. Keeping everyone >> working with the same set of releases is super important. >> >> >> >> Matei >> >> >> >> On Nov 11, 2015, at 4:58 AM, Sean Owen wrote: >> >> >> >> On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin >> wrote: >> >> to the Spark community. A major release should not be very different from >> a >> >> minor release and should not be gated based on new features. The main >> >> purpose of a major release is an opportunity to fix things that are broken >> >> in the current API and remove certain deprecated APIs (examples follow). >> >> >> >> Agree with this stance. Generally, a major release might also be a >> >> time to replace some big old API or implementation with a new one, but >> >> I don't see obvious candidates. >> >> >> >> I wouldn't mind turning attention to 2.x soone
Re: A proposal for Spark 2.0
With regards to Machine learning, it would be great to move useful features from MLlib to ML and deprecate the former. Current structure of two separate machine learning packages seems to be somewhat confusing. With regards to GraphX, it would be great to deprecate the use of RDD in GraphX and switch to Dataframe. This will allow GraphX evolve with Tungsten. On that note of deprecating stuff, it might be good to deprecate some things in 2.0 without removing or replacing them immediately. That way 2.0 doesn’t have to wait for everything that we want to deprecate to be replaced all at once. Nick On Thu, Nov 12, 2015 at 12:45 PM Ulanov, Alexander wrote: > Parameter Server is a new feature and thus does not match the goal of 2.0 > is “to fix things that are broken in the current API and remove certain > deprecated APIs”. At the same time I would be happy to have that feature. > > > > With regards to Machine learning, it would be great to move useful > features from MLlib to ML and deprecate the former. Current structure of > two separate machine learning packages seems to be somewhat confusing. > > With regards to GraphX, it would be great to deprecate the use of RDD in > GraphX and switch to Dataframe. This will allow GraphX evolve with Tungsten. > > > > Best regards, Alexander > > > > *From:* Nan Zhu [mailto:zhunanmcg...@gmail.com] > *Sent:* Thursday, November 12, 2015 7:28 AM > *To:* wi...@qq.com > *Cc:* dev@spark.apache.org > *Subject:* Re: A proposal for Spark 2.0 > > > > Being specific to Parameter Server, I think the current agreement is that > PS shall exist as a third-party library instead of a component of the core > code base, isn’t? > > > > Best, > > > > -- > > Nan Zhu > > http://codingcat.me > > > > On Thursday, November 12, 2015 at 9:49 AM, wi...@qq.com wrote: > > Who has the idea of machine learning? Spark missing some features for > machine learning, For example, the parameter server. > > > > > > 在 2015年11月12日,05:32,Matei Zaharia 写道: > > > > I like the idea of popping out Tachyon to an optional component too to > reduce the number of dependencies. In the future, it might even be useful > to do this for Hadoop, but it requires too many API changes to be worth > doing now. > > > > Regarding Scala 2.12, we should definitely support it eventually, but I > don't think we need to block 2.0 on that because it can be added later too. > Has anyone investigated what it would take to run on there? I imagine we > don't need many code changes, just maybe some REPL stuff. > > > > Needless to say, but I'm all for the idea of making "major" releases as > undisruptive as possible in the model Reynold proposed. Keeping everyone > working with the same set of releases is super important. > > > > Matei > > > > On Nov 11, 2015, at 4:58 AM, Sean Owen wrote: > > > > On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin wrote: > > to the Spark community. A major release should not be very different from a > > minor release and should not be gated based on new features. The main > > purpose of a major release is an opportunity to fix things that are broken > > in the current API and remove certain deprecated APIs (examples follow). > > > > Agree with this stance. Generally, a major release might also be a > > time to replace some big old API or implementation with a new one, but > > I don't see obvious candidates. > > > > I wouldn't mind turning attention to 2.x sooner than later, unless > > there's a fairly good reason to continue adding features in 1.x to a > > 1.7 release. The scope as of 1.6 is already pretty darned big. > > > > > > 1. Scala 2.11 as the default build. We should still support Scala 2.10, but > > it has been end-of-life. > > > > By the time 2.x rolls around, 2.12 will be the main version, 2.11 will > > be quite stable, and 2.10 will have been EOL for a while. I'd propose > > dropping 2.10. Otherwise it's supported for 2 more years. > > > > > > 2. Remove Hadoop 1 support. > > > > I'd go further to drop support for <2.2 for sure (2.0 and 2.1 were > > sort of 'alpha' and 'beta' releases) and even <2.6. > > > > I'm sure we'll think of a number of other small things -- shading a > > bunch of stuff? reviewing and updating dependencies in light of > > simpler, more recent dependencies to support from Hadoop etc? > > > > Farming out Tachyon to a module? (I felt like someone proposed this?) > > Pop out any Docker stuff to another repo? > > Continue th
RE: A proposal for Spark 2.0
Parameter Server is a new feature and thus does not match the goal of 2.0 is “to fix things that are broken in the current API and remove certain deprecated APIs”. At the same time I would be happy to have that feature. With regards to Machine learning, it would be great to move useful features from MLlib to ML and deprecate the former. Current structure of two separate machine learning packages seems to be somewhat confusing. With regards to GraphX, it would be great to deprecate the use of RDD in GraphX and switch to Dataframe. This will allow GraphX evolve with Tungsten. Best regards, Alexander From: Nan Zhu [mailto:zhunanmcg...@gmail.com] Sent: Thursday, November 12, 2015 7:28 AM To: wi...@qq.com Cc: dev@spark.apache.org Subject: Re: A proposal for Spark 2.0 Being specific to Parameter Server, I think the current agreement is that PS shall exist as a third-party library instead of a component of the core code base, isn’t? Best, -- Nan Zhu http://codingcat.me On Thursday, November 12, 2015 at 9:49 AM, wi...@qq.com<mailto:wi...@qq.com> wrote: Who has the idea of machine learning? Spark missing some features for machine learning, For example, the parameter server. 在 2015年11月12日,05:32,Matei Zaharia mailto:matei.zaha...@gmail.com>> 写道: I like the idea of popping out Tachyon to an optional component too to reduce the number of dependencies. In the future, it might even be useful to do this for Hadoop, but it requires too many API changes to be worth doing now. Regarding Scala 2.12, we should definitely support it eventually, but I don't think we need to block 2.0 on that because it can be added later too. Has anyone investigated what it would take to run on there? I imagine we don't need many code changes, just maybe some REPL stuff. Needless to say, but I'm all for the idea of making "major" releases as undisruptive as possible in the model Reynold proposed. Keeping everyone working with the same set of releases is super important. Matei On Nov 11, 2015, at 4:58 AM, Sean Owen mailto:so...@cloudera.com>> wrote: On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin mailto:r...@databricks.com>> wrote: to the Spark community. A major release should not be very different from a minor release and should not be gated based on new features. The main purpose of a major release is an opportunity to fix things that are broken in the current API and remove certain deprecated APIs (examples follow). Agree with this stance. Generally, a major release might also be a time to replace some big old API or implementation with a new one, but I don't see obvious candidates. I wouldn't mind turning attention to 2.x sooner than later, unless there's a fairly good reason to continue adding features in 1.x to a 1.7 release. The scope as of 1.6 is already pretty darned big. 1. Scala 2.11 as the default build. We should still support Scala 2.10, but it has been end-of-life. By the time 2.x rolls around, 2.12 will be the main version, 2.11 will be quite stable, and 2.10 will have been EOL for a while. I'd propose dropping 2.10. Otherwise it's supported for 2 more years. 2. Remove Hadoop 1 support. I'd go further to drop support for <2.2 for sure (2.0 and 2.1 were sort of 'alpha' and 'beta' releases) and even <2.6. I'm sure we'll think of a number of other small things -- shading a bunch of stuff? reviewing and updating dependencies in light of simpler, more recent dependencies to support from Hadoop etc? Farming out Tachyon to a module? (I felt like someone proposed this?) Pop out any Docker stuff to another repo? Continue that same effort for EC2? Farming out some of the "external" integrations to another repo (? controversial) See also anything marked version "2+" in JIRA. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org<mailto:dev-unsubscr...@spark.apache.org> For additional commands, e-mail: dev-h...@spark.apache.org<mailto:dev-h...@spark.apache.org> - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org<mailto:dev-unsubscr...@spark.apache.org> For additional commands, e-mail: dev-h...@spark.apache.org<mailto:dev-h...@spark.apache.org> - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org<mailto:dev-unsubscr...@spark.apache.org> For additional commands, e-mail: dev-h...@spark.apache.org<mailto:dev-h...@spark.apache.org>
Re: A proposal for Spark 2.0
Being specific to Parameter Server, I think the current agreement is that PS shall exist as a third-party library instead of a component of the core code base, isn’t? Best, -- Nan Zhu http://codingcat.me On Thursday, November 12, 2015 at 9:49 AM, wi...@qq.com wrote: > Who has the idea of machine learning? Spark missing some features for machine > learning, For example, the parameter server. > > > > 在 2015年11月12日,05:32,Matei Zaharia > (mailto:matei.zaha...@gmail.com)> 写道: > > > > I like the idea of popping out Tachyon to an optional component too to > > reduce the number of dependencies. In the future, it might even be useful > > to do this for Hadoop, but it requires too many API changes to be worth > > doing now. > > > > Regarding Scala 2.12, we should definitely support it eventually, but I > > don't think we need to block 2.0 on that because it can be added later too. > > Has anyone investigated what it would take to run on there? I imagine we > > don't need many code changes, just maybe some REPL stuff. > > > > Needless to say, but I'm all for the idea of making "major" releases as > > undisruptive as possible in the model Reynold proposed. Keeping everyone > > working with the same set of releases is super important. > > > > Matei > > > > > On Nov 11, 2015, at 4:58 AM, Sean Owen > > (mailto:so...@cloudera.com)> wrote: > > > > > > On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin > > (mailto:r...@databricks.com)> wrote: > > > > to the Spark community. A major release should not be very different > > > > from a > > > > minor release and should not be gated based on new features. The main > > > > purpose of a major release is an opportunity to fix things that are > > > > broken > > > > in the current API and remove certain deprecated APIs (examples follow). > > > > > > > > > > > > > Agree with this stance. Generally, a major release might also be a > > > time to replace some big old API or implementation with a new one, but > > > I don't see obvious candidates. > > > > > > I wouldn't mind turning attention to 2.x sooner than later, unless > > > there's a fairly good reason to continue adding features in 1.x to a > > > 1.7 release. The scope as of 1.6 is already pretty darned big. > > > > > > > > > > 1. Scala 2.11 as the default build. We should still support Scala 2.10, > > > > but > > > > it has been end-of-life. > > > > > > > > > > > > > By the time 2.x rolls around, 2.12 will be the main version, 2.11 will > > > be quite stable, and 2.10 will have been EOL for a while. I'd propose > > > dropping 2.10. Otherwise it's supported for 2 more years. > > > > > > > > > > 2. Remove Hadoop 1 support. > > > > > > I'd go further to drop support for <2.2 for sure (2.0 and 2.1 were > > > sort of 'alpha' and 'beta' releases) and even <2.6. > > > > > > I'm sure we'll think of a number of other small things -- shading a > > > bunch of stuff? reviewing and updating dependencies in light of > > > simpler, more recent dependencies to support from Hadoop etc? > > > > > > Farming out Tachyon to a module? (I felt like someone proposed this?) > > > Pop out any Docker stuff to another repo? > > > Continue that same effort for EC2? > > > Farming out some of the "external" integrations to another repo (? > > > controversial) > > > > > > See also anything marked version "2+" in JIRA. > > > > > > - > > > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > > > (mailto:dev-unsubscr...@spark.apache.org) > > > For additional commands, e-mail: dev-h...@spark.apache.org > > > (mailto:dev-h...@spark.apache.org) > > > > > > > > > > > - > > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > > (mailto:dev-unsubscr...@spark.apache.org) > > For additional commands, e-mail: dev-h...@spark.apache.org > > (mailto:dev-h...@spark.apache.org) > > > > > > > > - > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > (mailto:dev-unsubscr...@spark.apache.org) > For additional commands, e-mail: dev-h...@spark.apache.org > (mailto:dev-h...@spark.apache.org) > >
Re: A proposal for Spark 2.0
Who has the idea of machine learning? Spark missing some features for machine learning, For example, the parameter server. > 在 2015年11月12日,05:32,Matei Zaharia 写道: > > I like the idea of popping out Tachyon to an optional component too to reduce > the number of dependencies. In the future, it might even be useful to do this > for Hadoop, but it requires too many API changes to be worth doing now. > > Regarding Scala 2.12, we should definitely support it eventually, but I don't > think we need to block 2.0 on that because it can be added later too. Has > anyone investigated what it would take to run on there? I imagine we don't > need many code changes, just maybe some REPL stuff. > > Needless to say, but I'm all for the idea of making "major" releases as > undisruptive as possible in the model Reynold proposed. Keeping everyone > working with the same set of releases is super important. > > Matei > >> On Nov 11, 2015, at 4:58 AM, Sean Owen wrote: >> >> On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin wrote: >>> to the Spark community. A major release should not be very different from a >>> minor release and should not be gated based on new features. The main >>> purpose of a major release is an opportunity to fix things that are broken >>> in the current API and remove certain deprecated APIs (examples follow). >> >> Agree with this stance. Generally, a major release might also be a >> time to replace some big old API or implementation with a new one, but >> I don't see obvious candidates. >> >> I wouldn't mind turning attention to 2.x sooner than later, unless >> there's a fairly good reason to continue adding features in 1.x to a >> 1.7 release. The scope as of 1.6 is already pretty darned big. >> >> >>> 1. Scala 2.11 as the default build. We should still support Scala 2.10, but >>> it has been end-of-life. >> >> By the time 2.x rolls around, 2.12 will be the main version, 2.11 will >> be quite stable, and 2.10 will have been EOL for a while. I'd propose >> dropping 2.10. Otherwise it's supported for 2 more years. >> >> >>> 2. Remove Hadoop 1 support. >> >> I'd go further to drop support for <2.2 for sure (2.0 and 2.1 were >> sort of 'alpha' and 'beta' releases) and even <2.6. >> >> I'm sure we'll think of a number of other small things -- shading a >> bunch of stuff? reviewing and updating dependencies in light of >> simpler, more recent dependencies to support from Hadoop etc? >> >> Farming out Tachyon to a module? (I felt like someone proposed this?) >> Pop out any Docker stuff to another repo? >> Continue that same effort for EC2? >> Farming out some of the "external" integrations to another repo (? >> controversial) >> >> See also anything marked version "2+" in JIRA. >> >> - >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >> For additional commands, e-mail: dev-h...@spark.apache.org >> > > > - > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: A proposal for Spark 2.0
I like the idea of popping out Tachyon to an optional component too to reduce the number of dependencies. In the future, it might even be useful to do this for Hadoop, but it requires too many API changes to be worth doing now. Regarding Scala 2.12, we should definitely support it eventually, but I don't think we need to block 2.0 on that because it can be added later too. Has anyone investigated what it would take to run on there? I imagine we don't need many code changes, just maybe some REPL stuff. Needless to say, but I'm all for the idea of making "major" releases as undisruptive as possible in the model Reynold proposed. Keeping everyone working with the same set of releases is super important. Matei > On Nov 11, 2015, at 4:58 AM, Sean Owen wrote: > > On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin wrote: >> to the Spark community. A major release should not be very different from a >> minor release and should not be gated based on new features. The main >> purpose of a major release is an opportunity to fix things that are broken >> in the current API and remove certain deprecated APIs (examples follow). > > Agree with this stance. Generally, a major release might also be a > time to replace some big old API or implementation with a new one, but > I don't see obvious candidates. > > I wouldn't mind turning attention to 2.x sooner than later, unless > there's a fairly good reason to continue adding features in 1.x to a > 1.7 release. The scope as of 1.6 is already pretty darned big. > > >> 1. Scala 2.11 as the default build. We should still support Scala 2.10, but >> it has been end-of-life. > > By the time 2.x rolls around, 2.12 will be the main version, 2.11 will > be quite stable, and 2.10 will have been EOL for a while. I'd propose > dropping 2.10. Otherwise it's supported for 2 more years. > > >> 2. Remove Hadoop 1 support. > > I'd go further to drop support for <2.2 for sure (2.0 and 2.1 were > sort of 'alpha' and 'beta' releases) and even <2.6. > > I'm sure we'll think of a number of other small things -- shading a > bunch of stuff? reviewing and updating dependencies in light of > simpler, more recent dependencies to support from Hadoop etc? > > Farming out Tachyon to a module? (I felt like someone proposed this?) > Pop out any Docker stuff to another repo? > Continue that same effort for EC2? > Farming out some of the "external" integrations to another repo (? > controversial) > > See also anything marked version "2+" in JIRA. > > - > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: A proposal for Spark 2.0
Resending my earlier message because it wasn't accepted. Would like to add a proposal to upgrade jars when they do not break APIs and fixes a bug. To be more specific, I would like to see Kryo to be upgraded from 2.21 to 3.x. Kryo 2.x has a bug (e.g.SPARK-7708) that is blocking it usage in production environment. Other projects like Chill is also wanting to upgrade Kryo to 3.x but being blocked because Spark won't upgrade. I think OSS community at large will benefit if we can coordinate to upgrade to Kryo 3.x -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/A-proposal-for-Spark-2-0-tp15122p15164.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: A proposal for Spark 2.0
It looks like Chill is willing to upgrade their Kryo to 3.x if Spark and Hive will. As it is now Spark, Chill, and Hive have Kryo jar but it really can't be used because Kryo 2 can't serdes some classes. Since Spark 2.0 is a major release, it really would be nice if we can resolve the Kryo issue. https://github.com/twitter/chill/pull/230#issuecomment-155845959 -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/A-proposal-for-Spark-2-0-tp15122p15163.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: A proposal for Spark 2.0
If Scala 2.12 will require Java 8 and we want to enable cross-compiling Spark against Scala 2.11 and 2.12, couldn't we just make Java 8 a requirement if you want to use Scala 2.12? On Wed, Nov 11, 2015 at 9:29 AM, Koert Kuipers wrote: > i would drop scala 2.10, but definitely keep java 7 > > cross build for scala 2.12 is great, but i dont know how that works with > java 8 requirement. dont want to make java 8 mandatory. > > and probably stating the obvious, but a lot of apis got polluted due to > binary compatibility requirement. cleaning that up assuming only source > compatibility would be a good idea, right? > > On Tue, Nov 10, 2015 at 6:10 PM, Reynold Xin wrote: > >> I’m starting a new thread since the other one got intermixed with feature >> requests. Please refrain from making feature request in this thread. Not >> that we shouldn’t be adding features, but we can always add features in >> 1.7, 2.1, 2.2, ... >> >> First - I want to propose a premise for how to think about Spark 2.0 and >> major releases in Spark, based on discussion with several members of the >> community: a major release should be low overhead and minimally disruptive >> to the Spark community. A major release should not be very different from a >> minor release and should not be gated based on new features. The main >> purpose of a major release is an opportunity to fix things that are broken >> in the current API and remove certain deprecated APIs (examples follow). >> >> For this reason, I would *not* propose doing major releases to break >> substantial API's or perform large re-architecting that prevent users from >> upgrading. Spark has always had a culture of evolving architecture >> incrementally and making changes - and I don't think we want to change this >> model. In fact, we’ve released many architectural changes on the 1.X line. >> >> If the community likes the above model, then to me it seems reasonable to >> do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or immediately >> after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A cadence of >> major releases every 2 years seems doable within the above model. >> >> Under this model, here is a list of example things I would propose doing >> in Spark 2.0, separated into APIs and Operation/Deployment: >> >> >> APIs >> >> 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in >> Spark 1.x. >> >> 2. Remove Akka from Spark’s API dependency (in streaming), so user >> applications can use Akka (SPARK-5293). We have gotten a lot of complaints >> about user applications being unable to use Akka due to Spark’s dependency >> on Akka. >> >> 3. Remove Guava from Spark’s public API (JavaRDD Optional). >> >> 4. Better class package structure for low level developer API’s. In >> particular, we have some DeveloperApi (mostly various listener-related >> classes) added over the years. Some packages include only one or two public >> classes but a lot of private classes. A better structure is to have public >> classes isolated to a few public packages, and these public packages should >> have minimal private classes for low level developer APIs. >> >> 5. Consolidate task metric and accumulator API. Although having some >> subtle differences, these two are very similar but have completely >> different code path. >> >> 6. Possibly making Catalyst, Dataset, and DataFrame more general by >> moving them to other package(s). They are already used beyond SQL, e.g. in >> ML pipelines, and will be used by streaming also. >> >> >> Operation/Deployment >> >> 1. Scala 2.11 as the default build. We should still support Scala 2.10, >> but it has been end-of-life. >> >> 2. Remove Hadoop 1 support. >> >> 3. Assembly-free distribution of Spark: don’t require building an >> enormous assembly jar in order to run Spark. >> >> >
Re: A proposal for Spark 2.0
good point about dropping <2.2 for hadoop. you dont want to deal with protobuf 2.4 for example On Wed, Nov 11, 2015 at 4:58 AM, Sean Owen wrote: > On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin wrote: > > to the Spark community. A major release should not be very different > from a > > minor release and should not be gated based on new features. The main > > purpose of a major release is an opportunity to fix things that are > broken > > in the current API and remove certain deprecated APIs (examples follow). > > Agree with this stance. Generally, a major release might also be a > time to replace some big old API or implementation with a new one, but > I don't see obvious candidates. > > I wouldn't mind turning attention to 2.x sooner than later, unless > there's a fairly good reason to continue adding features in 1.x to a > 1.7 release. The scope as of 1.6 is already pretty darned big. > > > > 1. Scala 2.11 as the default build. We should still support Scala 2.10, > but > > it has been end-of-life. > > By the time 2.x rolls around, 2.12 will be the main version, 2.11 will > be quite stable, and 2.10 will have been EOL for a while. I'd propose > dropping 2.10. Otherwise it's supported for 2 more years. > > > > 2. Remove Hadoop 1 support. > > I'd go further to drop support for <2.2 for sure (2.0 and 2.1 were > sort of 'alpha' and 'beta' releases) and even <2.6. > > I'm sure we'll think of a number of other small things -- shading a > bunch of stuff? reviewing and updating dependencies in light of > simpler, more recent dependencies to support from Hadoop etc? > > Farming out Tachyon to a module? (I felt like someone proposed this?) > Pop out any Docker stuff to another repo? > Continue that same effort for EC2? > Farming out some of the "external" integrations to another repo (? > controversial) > > See also anything marked version "2+" in JIRA. > > - > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > >
Re: A proposal for Spark 2.0
i would drop scala 2.10, but definitely keep java 7 cross build for scala 2.12 is great, but i dont know how that works with java 8 requirement. dont want to make java 8 mandatory. and probably stating the obvious, but a lot of apis got polluted due to binary compatibility requirement. cleaning that up assuming only source compatibility would be a good idea, right? On Tue, Nov 10, 2015 at 6:10 PM, Reynold Xin wrote: > I’m starting a new thread since the other one got intermixed with feature > requests. Please refrain from making feature request in this thread. Not > that we shouldn’t be adding features, but we can always add features in > 1.7, 2.1, 2.2, ... > > First - I want to propose a premise for how to think about Spark 2.0 and > major releases in Spark, based on discussion with several members of the > community: a major release should be low overhead and minimally disruptive > to the Spark community. A major release should not be very different from a > minor release and should not be gated based on new features. The main > purpose of a major release is an opportunity to fix things that are broken > in the current API and remove certain deprecated APIs (examples follow). > > For this reason, I would *not* propose doing major releases to break > substantial API's or perform large re-architecting that prevent users from > upgrading. Spark has always had a culture of evolving architecture > incrementally and making changes - and I don't think we want to change this > model. In fact, we’ve released many architectural changes on the 1.X line. > > If the community likes the above model, then to me it seems reasonable to > do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or immediately > after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A cadence of > major releases every 2 years seems doable within the above model. > > Under this model, here is a list of example things I would propose doing > in Spark 2.0, separated into APIs and Operation/Deployment: > > > APIs > > 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in > Spark 1.x. > > 2. Remove Akka from Spark’s API dependency (in streaming), so user > applications can use Akka (SPARK-5293). We have gotten a lot of complaints > about user applications being unable to use Akka due to Spark’s dependency > on Akka. > > 3. Remove Guava from Spark’s public API (JavaRDD Optional). > > 4. Better class package structure for low level developer API’s. In > particular, we have some DeveloperApi (mostly various listener-related > classes) added over the years. Some packages include only one or two public > classes but a lot of private classes. A better structure is to have public > classes isolated to a few public packages, and these public packages should > have minimal private classes for low level developer APIs. > > 5. Consolidate task metric and accumulator API. Although having some > subtle differences, these two are very similar but have completely > different code path. > > 6. Possibly making Catalyst, Dataset, and DataFrame more general by moving > them to other package(s). They are already used beyond SQL, e.g. in ML > pipelines, and will be used by streaming also. > > > Operation/Deployment > > 1. Scala 2.11 as the default build. We should still support Scala 2.10, > but it has been end-of-life. > > 2. Remove Hadoop 1 support. > > 3. Assembly-free distribution of Spark: don’t require building an enormous > assembly jar in order to run Spark. > >
Re: A proposal for Spark 2.0
Hi, Reconsidering the execution model behind Streaming would be a good candidate here, as Spark will not be able to provide the low latency and sophisticated windowing semantics that more and more use-cases will require. Maybe relaxing the strict batch model would help a lot. (Mainly this would hit the shuffling, but the shuffle package suffers from overlapping functionalities, lack of good modularity anyway. Look at how coalesce implemented for example - inefficiency also kicks in there.) On Wed, Nov 11, 2015 at 12:48 PM Tim Preece wrote: > Considering Spark 2.x will run for 2 years, would moving up to Scala 2.12 ( > pencilled in for Jan 2016 ) make any sense ? - although that would then > pre-req Java 8. > > > > -- > View this message in context: > http://apache-spark-developers-list.1001551.n3.nabble.com/A-proposal-for-Spark-2-0-tp15122p15153.html > Sent from the Apache Spark Developers List mailing list archive at > Nabble.com. > > - > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > >
Re: A proposal for Spark 2.0
Considering Spark 2.x will run for 2 years, would moving up to Scala 2.12 ( pencilled in for Jan 2016 ) make any sense ? - although that would then pre-req Java 8. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/A-proposal-for-Spark-2-0-tp15122p15153.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: A proposal for Spark 2.0
On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin wrote: > to the Spark community. A major release should not be very different from a > minor release and should not be gated based on new features. The main > purpose of a major release is an opportunity to fix things that are broken > in the current API and remove certain deprecated APIs (examples follow). Agree with this stance. Generally, a major release might also be a time to replace some big old API or implementation with a new one, but I don't see obvious candidates. I wouldn't mind turning attention to 2.x sooner than later, unless there's a fairly good reason to continue adding features in 1.x to a 1.7 release. The scope as of 1.6 is already pretty darned big. > 1. Scala 2.11 as the default build. We should still support Scala 2.10, but > it has been end-of-life. By the time 2.x rolls around, 2.12 will be the main version, 2.11 will be quite stable, and 2.10 will have been EOL for a while. I'd propose dropping 2.10. Otherwise it's supported for 2 more years. > 2. Remove Hadoop 1 support. I'd go further to drop support for <2.2 for sure (2.0 and 2.1 were sort of 'alpha' and 'beta' releases) and even <2.6. I'm sure we'll think of a number of other small things -- shading a bunch of stuff? reviewing and updating dependencies in light of simpler, more recent dependencies to support from Hadoop etc? Farming out Tachyon to a module? (I felt like someone proposed this?) Pop out any Docker stuff to another repo? Continue that same effort for EC2? Farming out some of the "external" integrations to another repo (? controversial) See also anything marked version "2+" in JIRA. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: A proposal for Spark 2.0
Hi, I fully agree that. Actually, I'm working on PR to add "client" and "exploded" profiles in Maven build. The client profile create a spark-client-assembly jar, largely more lightweight that the spark-assembly. In our case, we construct jobs that don't require all the spark server side. It means that the minimal size of the generated jar is about 120MB, and it's painful in spark-submit submission time. That's why I started to remove unecessary dependencies in spark-assembly. On the other hand, I'm also working on the "exploded" mode: instead of using a fat monolithic spark-assembly jar file, I'm working on a exploded mode, allowing users to view/change the dependencies. For the client profile, I've already something ready, I will propose the PR very soon (by the end of this week hopefully). For the exploded profile, I need more time. My $0.02 Regards JB On 11/11/2015 12:53 AM, Reynold Xin wrote: On Tue, Nov 10, 2015 at 3:35 PM, Nicholas Chammas mailto:nicholas.cham...@gmail.com>> wrote: > 3. Assembly-free distribution of Spark: don’t require building an enormous assembly jar in order to run Spark. Could you elaborate a bit on this? I'm not sure what an assembly-free distribution means. Right now we ship Spark using a single assembly jar, which causes a few different problems: - total number of classes are limited on some configurations - dependency swapping is harder The proposal is to just avoid a single fat jar. -- Jean-Baptiste Onofré jbono...@apache.org http://blog.nanthrax.net Talend - http://www.talend.com - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: A proposal for Spark 2.0
Agree, it makes sense. Regards JB On 11/11/2015 01:28 AM, Reynold Xin wrote: Echoing Shivaram here. I don't think it makes a lot of sense to add more features to the 1.x line. We should still do critical bug fixes though. On Tue, Nov 10, 2015 at 4:23 PM, Shivaram Venkataraman mailto:shiva...@eecs.berkeley.edu>> wrote: +1 On a related note I think making it lightweight will ensure that we stay on the current release schedule and don't unnecessarily delay 2.0 to wait for new features / big architectural changes. In terms of fixes to 1.x, I think our current policy of back-porting fixes to older releases would still apply. I don't think developing new features on both 1.x and 2.x makes a lot of sense as we would like users to switch to 2.x. Shivaram On Tue, Nov 10, 2015 at 4:02 PM, Kostas Sakellis mailto:kos...@cloudera.com>> wrote: > +1 on a lightweight 2.0 > > What is the thinking around the 1.x line after Spark 2.0 is released? If not > terminated, how will we determine what goes into each major version line? > Will 1.x only be for stability fixes? > > Thanks, > Kostas > > On Tue, Nov 10, 2015 at 3:41 PM, Patrick Wendell mailto:pwend...@gmail.com>> wrote: >> >> I also feel the same as Reynold. I agree we should minimize API breaks and >> focus on fixing things around the edge that were mistakes (e.g. exposing >> Guava and Akka) rather than any overhaul that could fragment the community. >> Ideally a major release is a lightweight process we can do every couple of >> years, with minimal impact for users. >> >> - Patrick >> >> On Tue, Nov 10, 2015 at 3:35 PM, Nicholas Chammas >> mailto:nicholas.cham...@gmail.com>> wrote: >>> >>> > For this reason, I would *not* propose doing major releases to break >>> > substantial API's or perform large re-architecting that prevent users from >>> > upgrading. Spark has always had a culture of evolving architecture >>> > incrementally and making changes - and I don't think we want to change this >>> > model. >>> >>> +1 for this. The Python community went through a lot of turmoil over the >>> Python 2 -> Python 3 transition because the upgrade process was too painful >>> for too long. The Spark community will benefit greatly from our explicitly >>> looking to avoid a similar situation. >>> >>> > 3. Assembly-free distribution of Spark: don’t require building an >>> > enormous assembly jar in order to run Spark. >>> >>> Could you elaborate a bit on this? I'm not sure what an assembly-free >>> distribution means. >>> >>> Nick >>> >>> On Tue, Nov 10, 2015 at 6:11 PM Reynold Xin mailto:r...@databricks.com>> wrote: I’m starting a new thread since the other one got intermixed with feature requests. Please refrain from making feature request in this thread. Not that we shouldn’t be adding features, but we can always add features in 1.7, 2.1, 2.2, ... First - I want to propose a premise for how to think about Spark 2.0 and major releases in Spark, based on discussion with several members of the community: a major release should be low overhead and minimally disruptive to the Spark community. A major release should not be very different from a minor release and should not be gated based on new features. The main purpose of a major release is an opportunity to fix things that are broken in the current API and remove certain deprecated APIs (examples follow). For this reason, I would *not* propose doing major releases to break substantial API's or perform large re-architecting that prevent users from upgrading. Spark has always had a culture of evolving architecture incrementally and making changes - and I don't think we want to change this model. In fact, we’ve released many architectural changes on the 1.X line. If the community likes the above model, then to me it seems reasonable to do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or immediately after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A cadence of major releases every 2 years seems doable within the above model. Under this model, here is a list of example things I would propose doing in Spark 2.0, separated into APIs and Operation/Deployment: APIs 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in Spark 1.x. 2. Remove Akka from Spark’s API dep
Re: A proposal for Spark 2.0
To take a stab at an example of something concrete and anticipatory I can go back to something I mentioned previously. It's not really a good example because I don't mean to imply that I believe that its premises are true, but try to go with it If we were to decide that real-time, event-based streaming is something that we really think we'll want to do in the 2.x cycle and that the current API (after having deprecations removed and clear mistakes/inadequacies remedied) isn't adequate to support that, would we want to "take our best shot" at defining a new API at the outset of 2.0? Another way of looking at it is whether API changes in 2.0 should be entirely backward-looking, trying to fix problems that we've already identified or whether there is room for some forward-looking changes that are intended to open new directions for Spark development. On Tue, Nov 10, 2015 at 7:04 PM, Mark Hamstra wrote: > Heh... ok, I was intentionally pushing those bullet points to be extreme > to find where people would start pushing back, and I'll agree that we do > probably want some new features in 2.0 -- but I think we've got good > agreement that new features aren't really the main point of doing a 2.0 > release. > > I don't really have a concrete example of an anticipatory change, and > that's actually kind of the problem with trying to anticipate what we'll > need in the way of new public API and the like: Until what we already have > is clearly inadequate, it hard to concretely imagine how things really > should be. At this point I don't have anything specific where I can say "I > really want to do __ with Spark in the future, and I think it should be > changed in this way in 2.0 to allow me to do that." I'm just wondering > whether we want to even entertain those kinds of change requests if people > have them, or whether we can just delay making those kinds of decisions > until it is really obvious that what we have does't work and that there is > clearly something better that should be done. > > On Tue, Nov 10, 2015 at 6:51 PM, Reynold Xin wrote: > >> Mark, >> >> I think we are in agreement, although I wouldn't go to the extreme and >> say "a release with no new features might even be best." >> >> Can you elaborate "anticipatory changes"? A concrete example or so would >> be helpful. >> >> On Tue, Nov 10, 2015 at 5:19 PM, Mark Hamstra >> wrote: >> >>> I'm liking the way this is shaping up, and I'd summarize it this way >>> (let me know if I'm misunderstanding or misrepresenting anything): >>> >>>- New features are not at all the focus of Spark 2.0 -- in fact, a >>>release with no new features might even be best. >>>- Remove deprecated API that we agree really should be deprecated. >>>- Fix/change publicly-visible things that anyone who has spent any >>>time looking at already knows are mistakes or should be done better, but >>>that can't be changed within 1.x. >>> >>> Do we want to attempt anticipatory changes at all? In other words, are >>> there things we want to do in 2.x for which we already know that we'll want >>> to make publicly-visible changes or that, if we don't add or change it now, >>> will fall into the "everybody knows it shouldn't be that way" category when >>> it comes time to discuss the Spark 3.0 release? I'd be fine if we don't >>> try at all to anticipate what is needed -- working from the premise that >>> being forced into a 3.x release earlier than we expect would be less >>> painful than trying to back out a mistake made at the outset of 2.0 while >>> trying to guess what we'll need. >>> >>> On Tue, Nov 10, 2015 at 3:10 PM, Reynold Xin >>> wrote: >>> I’m starting a new thread since the other one got intermixed with feature requests. Please refrain from making feature request in this thread. Not that we shouldn’t be adding features, but we can always add features in 1.7, 2.1, 2.2, ... First - I want to propose a premise for how to think about Spark 2.0 and major releases in Spark, based on discussion with several members of the community: a major release should be low overhead and minimally disruptive to the Spark community. A major release should not be very different from a minor release and should not be gated based on new features. The main purpose of a major release is an opportunity to fix things that are broken in the current API and remove certain deprecated APIs (examples follow). For this reason, I would *not* propose doing major releases to break substantial API's or perform large re-architecting that prevent users from upgrading. Spark has always had a culture of evolving architecture incrementally and making changes - and I don't think we want to change this model. In fact, we’ve released many architectural changes on the 1.X line. If the community likes the above model, then to me it seems reasonable to do Spark 2.0 either after Spa
Re: A proposal for Spark 2.0
Heh... ok, I was intentionally pushing those bullet points to be extreme to find where people would start pushing back, and I'll agree that we do probably want some new features in 2.0 -- but I think we've got good agreement that new features aren't really the main point of doing a 2.0 release. I don't really have a concrete example of an anticipatory change, and that's actually kind of the problem with trying to anticipate what we'll need in the way of new public API and the like: Until what we already have is clearly inadequate, it hard to concretely imagine how things really should be. At this point I don't have anything specific where I can say "I really want to do __ with Spark in the future, and I think it should be changed in this way in 2.0 to allow me to do that." I'm just wondering whether we want to even entertain those kinds of change requests if people have them, or whether we can just delay making those kinds of decisions until it is really obvious that what we have does't work and that there is clearly something better that should be done. On Tue, Nov 10, 2015 at 6:51 PM, Reynold Xin wrote: > Mark, > > I think we are in agreement, although I wouldn't go to the extreme and say > "a release with no new features might even be best." > > Can you elaborate "anticipatory changes"? A concrete example or so would > be helpful. > > On Tue, Nov 10, 2015 at 5:19 PM, Mark Hamstra > wrote: > >> I'm liking the way this is shaping up, and I'd summarize it this way (let >> me know if I'm misunderstanding or misrepresenting anything): >> >>- New features are not at all the focus of Spark 2.0 -- in fact, a >>release with no new features might even be best. >>- Remove deprecated API that we agree really should be deprecated. >>- Fix/change publicly-visible things that anyone who has spent any >>time looking at already knows are mistakes or should be done better, but >>that can't be changed within 1.x. >> >> Do we want to attempt anticipatory changes at all? In other words, are >> there things we want to do in 2.x for which we already know that we'll want >> to make publicly-visible changes or that, if we don't add or change it now, >> will fall into the "everybody knows it shouldn't be that way" category when >> it comes time to discuss the Spark 3.0 release? I'd be fine if we don't >> try at all to anticipate what is needed -- working from the premise that >> being forced into a 3.x release earlier than we expect would be less >> painful than trying to back out a mistake made at the outset of 2.0 while >> trying to guess what we'll need. >> >> On Tue, Nov 10, 2015 at 3:10 PM, Reynold Xin wrote: >> >>> I’m starting a new thread since the other one got intermixed with >>> feature requests. Please refrain from making feature request in this >>> thread. Not that we shouldn’t be adding features, but we can always add >>> features in 1.7, 2.1, 2.2, ... >>> >>> First - I want to propose a premise for how to think about Spark 2.0 and >>> major releases in Spark, based on discussion with several members of the >>> community: a major release should be low overhead and minimally disruptive >>> to the Spark community. A major release should not be very different from a >>> minor release and should not be gated based on new features. The main >>> purpose of a major release is an opportunity to fix things that are broken >>> in the current API and remove certain deprecated APIs (examples follow). >>> >>> For this reason, I would *not* propose doing major releases to break >>> substantial API's or perform large re-architecting that prevent users from >>> upgrading. Spark has always had a culture of evolving architecture >>> incrementally and making changes - and I don't think we want to change this >>> model. In fact, we’ve released many architectural changes on the 1.X line. >>> >>> If the community likes the above model, then to me it seems reasonable >>> to do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or >>> immediately after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A >>> cadence of major releases every 2 years seems doable within the above model. >>> >>> Under this model, here is a list of example things I would propose doing >>> in Spark 2.0, separated into APIs and Operation/Deployment: >>> >>> >>> APIs >>> >>> 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in >>> Spark 1.x. >>> >>> 2. Remove Akka from Spark’s API dependency (in streaming), so user >>> applications can use Akka (SPARK-5293). We have gotten a lot of complaints >>> about user applications being unable to use Akka due to Spark’s dependency >>> on Akka. >>> >>> 3. Remove Guava from Spark’s public API (JavaRDD Optional). >>> >>> 4. Better class package structure for low level developer API’s. In >>> particular, we have some DeveloperApi (mostly various listener-related >>> classes) added over the years. Some packages include only one or two public >>> classes but a lot of pri
Re: A proposal for Spark 2.0
On Tue, Nov 10, 2015 at 6:51 PM, Reynold Xin wrote: > I think we are in agreement, although I wouldn't go to the extreme and say > "a release with no new features might even be best." > > Can you elaborate "anticipatory changes"? A concrete example or so would be > helpful. I don't know if that's what Mark had in mind, but I'd count the "remove Guava Optional from Java API" in that category. It would be nice to have an alternative before that API is removed, although I have no idea how you'd do it nicely, given that they're all in return types (so overloading doesn't really work). - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: A proposal for Spark 2.0
Mark, I think we are in agreement, although I wouldn't go to the extreme and say "a release with no new features might even be best." Can you elaborate "anticipatory changes"? A concrete example or so would be helpful. On Tue, Nov 10, 2015 at 5:19 PM, Mark Hamstra wrote: > I'm liking the way this is shaping up, and I'd summarize it this way (let > me know if I'm misunderstanding or misrepresenting anything): > >- New features are not at all the focus of Spark 2.0 -- in fact, a >release with no new features might even be best. >- Remove deprecated API that we agree really should be deprecated. >- Fix/change publicly-visible things that anyone who has spent any >time looking at already knows are mistakes or should be done better, but >that can't be changed within 1.x. > > Do we want to attempt anticipatory changes at all? In other words, are > there things we want to do in 2.x for which we already know that we'll want > to make publicly-visible changes or that, if we don't add or change it now, > will fall into the "everybody knows it shouldn't be that way" category when > it comes time to discuss the Spark 3.0 release? I'd be fine if we don't > try at all to anticipate what is needed -- working from the premise that > being forced into a 3.x release earlier than we expect would be less > painful than trying to back out a mistake made at the outset of 2.0 while > trying to guess what we'll need. > > On Tue, Nov 10, 2015 at 3:10 PM, Reynold Xin wrote: > >> I’m starting a new thread since the other one got intermixed with feature >> requests. Please refrain from making feature request in this thread. Not >> that we shouldn’t be adding features, but we can always add features in >> 1.7, 2.1, 2.2, ... >> >> First - I want to propose a premise for how to think about Spark 2.0 and >> major releases in Spark, based on discussion with several members of the >> community: a major release should be low overhead and minimally disruptive >> to the Spark community. A major release should not be very different from a >> minor release and should not be gated based on new features. The main >> purpose of a major release is an opportunity to fix things that are broken >> in the current API and remove certain deprecated APIs (examples follow). >> >> For this reason, I would *not* propose doing major releases to break >> substantial API's or perform large re-architecting that prevent users from >> upgrading. Spark has always had a culture of evolving architecture >> incrementally and making changes - and I don't think we want to change this >> model. In fact, we’ve released many architectural changes on the 1.X line. >> >> If the community likes the above model, then to me it seems reasonable to >> do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or immediately >> after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A cadence of >> major releases every 2 years seems doable within the above model. >> >> Under this model, here is a list of example things I would propose doing >> in Spark 2.0, separated into APIs and Operation/Deployment: >> >> >> APIs >> >> 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in >> Spark 1.x. >> >> 2. Remove Akka from Spark’s API dependency (in streaming), so user >> applications can use Akka (SPARK-5293). We have gotten a lot of complaints >> about user applications being unable to use Akka due to Spark’s dependency >> on Akka. >> >> 3. Remove Guava from Spark’s public API (JavaRDD Optional). >> >> 4. Better class package structure for low level developer API’s. In >> particular, we have some DeveloperApi (mostly various listener-related >> classes) added over the years. Some packages include only one or two public >> classes but a lot of private classes. A better structure is to have public >> classes isolated to a few public packages, and these public packages should >> have minimal private classes for low level developer APIs. >> >> 5. Consolidate task metric and accumulator API. Although having some >> subtle differences, these two are very similar but have completely >> different code path. >> >> 6. Possibly making Catalyst, Dataset, and DataFrame more general by >> moving them to other package(s). They are already used beyond SQL, e.g. in >> ML pipelines, and will be used by streaming also. >> >> >> Operation/Deployment >> >> 1. Scala 2.11 as the default build. We should still support Scala 2.10, >> but it has been end-of-life. >> >> 2. Remove Hadoop 1 support. >> >> 3. Assembly-free distribution of Spark: don’t require building an >> enormous assembly jar in order to run Spark. >> >> >
Re: A proposal for Spark 2.0
Agree. If it is deprecated, get rid of it in 2.0 If the deprecation was a mistake, let's fix that. Suds Sent from my iPhone On Nov 10, 2015, at 5:04 PM, Reynold Xin wrote: Maybe a better idea is to un-deprecate an API if it is too important to not be removed. I don't think we can drop Java 7 support. It's way too soon. On Tue, Nov 10, 2015 at 4:59 PM, Mark Hamstra wrote: > Really, Sandy? "Extra consideration" even for already-deprecated API? If > we're not going to remove these with a major version change, then just when > will we remove them? > > On Tue, Nov 10, 2015 at 4:53 PM, Sandy Ryza > wrote: > >> Another +1 to Reynold's proposal. >> >> Maybe this is obvious, but I'd like to advocate against a blanket removal >> of deprecated / developer APIs. Many APIs can likely be removed without >> material impact (e.g. the SparkContext constructor that takes preferred >> node location data), while others likely see heavier usage (e.g. I wouldn't >> be surprised if mapPartitionsWithContext was baked into a number of apps) >> and merit a little extra consideration. >> >> Maybe also obvious, but I think a migration guide with API equivlents and >> the like would be incredibly useful in easing the transition. >> >> -Sandy >> >> On Tue, Nov 10, 2015 at 4:28 PM, Reynold Xin wrote: >> >>> Echoing Shivaram here. I don't think it makes a lot of sense to add more >>> features to the 1.x line. We should still do critical bug fixes though. >>> >>> >>> On Tue, Nov 10, 2015 at 4:23 PM, Shivaram Venkataraman < >>> shiva...@eecs.berkeley.edu> wrote: >>> +1 On a related note I think making it lightweight will ensure that we stay on the current release schedule and don't unnecessarily delay 2.0 to wait for new features / big architectural changes. In terms of fixes to 1.x, I think our current policy of back-porting fixes to older releases would still apply. I don't think developing new features on both 1.x and 2.x makes a lot of sense as we would like users to switch to 2.x. Shivaram On Tue, Nov 10, 2015 at 4:02 PM, Kostas Sakellis wrote: > +1 on a lightweight 2.0 > > What is the thinking around the 1.x line after Spark 2.0 is released? If not > terminated, how will we determine what goes into each major version line? > Will 1.x only be for stability fixes? > > Thanks, > Kostas > > On Tue, Nov 10, 2015 at 3:41 PM, Patrick Wendell wrote: >> >> I also feel the same as Reynold. I agree we should minimize API breaks and >> focus on fixing things around the edge that were mistakes (e.g. exposing >> Guava and Akka) rather than any overhaul that could fragment the community. >> Ideally a major release is a lightweight process we can do every couple of >> years, with minimal impact for users. >> >> - Patrick >> >> On Tue, Nov 10, 2015 at 3:35 PM, Nicholas Chammas >> wrote: >>> >>> > For this reason, I would *not* propose doing major releases to break >>> > substantial API's or perform large re-architecting that prevent users from >>> > upgrading. Spark has always had a culture of evolving architecture >>> > incrementally and making changes - and I don't think we want to change this >>> > model. >>> >>> +1 for this. The Python community went through a lot of turmoil over the >>> Python 2 -> Python 3 transition because the upgrade process was too painful >>> for too long. The Spark community will benefit greatly from our explicitly >>> looking to avoid a similar situation. >>> >>> > 3. Assembly-free distribution of Spark: don’t require building an >>> > enormous assembly jar in order to run Spark. >>> >>> Could you elaborate a bit on this? I'm not sure what an assembly-free >>> distribution means. >>> >>> Nick >>> >>> On Tue, Nov 10, 2015 at 6:11 PM Reynold Xin wrote: I’m starting a new thread since the other one got intermixed with feature requests. Please refrain from making feature request in this thread. Not that we shouldn’t be adding features, but we can always add features in 1.7, 2.1, 2.2, ... First - I want to propose a premise for how to think about Spark 2.0 and major releases in Spark, based on discussion with several members of the community: a major release should be low overhead and minimally disruptive to the Spark community. A major release should not be very different from a minor release and should not be gated based on new features. The main purpose of a major release is an opportunity to fix things that are broken in the current API and remove certain deprecated APIs (examples >>
Re: A proposal for Spark 2.0
I'm liking the way this is shaping up, and I'd summarize it this way (let me know if I'm misunderstanding or misrepresenting anything): - New features are not at all the focus of Spark 2.0 -- in fact, a release with no new features might even be best. - Remove deprecated API that we agree really should be deprecated. - Fix/change publicly-visible things that anyone who has spent any time looking at already knows are mistakes or should be done better, but that can't be changed within 1.x. Do we want to attempt anticipatory changes at all? In other words, are there things we want to do in 2.x for which we already know that we'll want to make publicly-visible changes or that, if we don't add or change it now, will fall into the "everybody knows it shouldn't be that way" category when it comes time to discuss the Spark 3.0 release? I'd be fine if we don't try at all to anticipate what is needed -- working from the premise that being forced into a 3.x release earlier than we expect would be less painful than trying to back out a mistake made at the outset of 2.0 while trying to guess what we'll need. On Tue, Nov 10, 2015 at 3:10 PM, Reynold Xin wrote: > I’m starting a new thread since the other one got intermixed with feature > requests. Please refrain from making feature request in this thread. Not > that we shouldn’t be adding features, but we can always add features in > 1.7, 2.1, 2.2, ... > > First - I want to propose a premise for how to think about Spark 2.0 and > major releases in Spark, based on discussion with several members of the > community: a major release should be low overhead and minimally disruptive > to the Spark community. A major release should not be very different from a > minor release and should not be gated based on new features. The main > purpose of a major release is an opportunity to fix things that are broken > in the current API and remove certain deprecated APIs (examples follow). > > For this reason, I would *not* propose doing major releases to break > substantial API's or perform large re-architecting that prevent users from > upgrading. Spark has always had a culture of evolving architecture > incrementally and making changes - and I don't think we want to change this > model. In fact, we’ve released many architectural changes on the 1.X line. > > If the community likes the above model, then to me it seems reasonable to > do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or immediately > after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A cadence of > major releases every 2 years seems doable within the above model. > > Under this model, here is a list of example things I would propose doing > in Spark 2.0, separated into APIs and Operation/Deployment: > > > APIs > > 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in > Spark 1.x. > > 2. Remove Akka from Spark’s API dependency (in streaming), so user > applications can use Akka (SPARK-5293). We have gotten a lot of complaints > about user applications being unable to use Akka due to Spark’s dependency > on Akka. > > 3. Remove Guava from Spark’s public API (JavaRDD Optional). > > 4. Better class package structure for low level developer API’s. In > particular, we have some DeveloperApi (mostly various listener-related > classes) added over the years. Some packages include only one or two public > classes but a lot of private classes. A better structure is to have public > classes isolated to a few public packages, and these public packages should > have minimal private classes for low level developer APIs. > > 5. Consolidate task metric and accumulator API. Although having some > subtle differences, these two are very similar but have completely > different code path. > > 6. Possibly making Catalyst, Dataset, and DataFrame more general by moving > them to other package(s). They are already used beyond SQL, e.g. in ML > pipelines, and will be used by streaming also. > > > Operation/Deployment > > 1. Scala 2.11 as the default build. We should still support Scala 2.10, > but it has been end-of-life. > > 2. Remove Hadoop 1 support. > > 3. Assembly-free distribution of Spark: don’t require building an enormous > assembly jar in order to run Spark. > >
Re: A proposal for Spark 2.0
Maybe a better idea is to un-deprecate an API if it is too important to not be removed. I don't think we can drop Java 7 support. It's way too soon. On Tue, Nov 10, 2015 at 4:59 PM, Mark Hamstra wrote: > Really, Sandy? "Extra consideration" even for already-deprecated API? If > we're not going to remove these with a major version change, then just when > will we remove them? > > On Tue, Nov 10, 2015 at 4:53 PM, Sandy Ryza > wrote: > >> Another +1 to Reynold's proposal. >> >> Maybe this is obvious, but I'd like to advocate against a blanket removal >> of deprecated / developer APIs. Many APIs can likely be removed without >> material impact (e.g. the SparkContext constructor that takes preferred >> node location data), while others likely see heavier usage (e.g. I wouldn't >> be surprised if mapPartitionsWithContext was baked into a number of apps) >> and merit a little extra consideration. >> >> Maybe also obvious, but I think a migration guide with API equivlents and >> the like would be incredibly useful in easing the transition. >> >> -Sandy >> >> On Tue, Nov 10, 2015 at 4:28 PM, Reynold Xin wrote: >> >>> Echoing Shivaram here. I don't think it makes a lot of sense to add more >>> features to the 1.x line. We should still do critical bug fixes though. >>> >>> >>> On Tue, Nov 10, 2015 at 4:23 PM, Shivaram Venkataraman < >>> shiva...@eecs.berkeley.edu> wrote: >>> +1 On a related note I think making it lightweight will ensure that we stay on the current release schedule and don't unnecessarily delay 2.0 to wait for new features / big architectural changes. In terms of fixes to 1.x, I think our current policy of back-porting fixes to older releases would still apply. I don't think developing new features on both 1.x and 2.x makes a lot of sense as we would like users to switch to 2.x. Shivaram On Tue, Nov 10, 2015 at 4:02 PM, Kostas Sakellis wrote: > +1 on a lightweight 2.0 > > What is the thinking around the 1.x line after Spark 2.0 is released? If not > terminated, how will we determine what goes into each major version line? > Will 1.x only be for stability fixes? > > Thanks, > Kostas > > On Tue, Nov 10, 2015 at 3:41 PM, Patrick Wendell wrote: >> >> I also feel the same as Reynold. I agree we should minimize API breaks and >> focus on fixing things around the edge that were mistakes (e.g. exposing >> Guava and Akka) rather than any overhaul that could fragment the community. >> Ideally a major release is a lightweight process we can do every couple of >> years, with minimal impact for users. >> >> - Patrick >> >> On Tue, Nov 10, 2015 at 3:35 PM, Nicholas Chammas >> wrote: >>> >>> > For this reason, I would *not* propose doing major releases to break >>> > substantial API's or perform large re-architecting that prevent users from >>> > upgrading. Spark has always had a culture of evolving architecture >>> > incrementally and making changes - and I don't think we want to change this >>> > model. >>> >>> +1 for this. The Python community went through a lot of turmoil over the >>> Python 2 -> Python 3 transition because the upgrade process was too painful >>> for too long. The Spark community will benefit greatly from our explicitly >>> looking to avoid a similar situation. >>> >>> > 3. Assembly-free distribution of Spark: don’t require building an >>> > enormous assembly jar in order to run Spark. >>> >>> Could you elaborate a bit on this? I'm not sure what an assembly-free >>> distribution means. >>> >>> Nick >>> >>> On Tue, Nov 10, 2015 at 6:11 PM Reynold Xin wrote: I’m starting a new thread since the other one got intermixed with feature requests. Please refrain from making feature request in this thread. Not that we shouldn’t be adding features, but we can always add features in 1.7, 2.1, 2.2, ... First - I want to propose a premise for how to think about Spark 2.0 and major releases in Spark, based on discussion with several members of the community: a major release should be low overhead and minimally disruptive to the Spark community. A major release should not be very different from a minor release and should not be gated based on new features. The main purpose of a major release is an opportunity to fix things that are broken in the current API and remove certain deprecated APIs (examples follow). For this reason, I would *not* propose doing major releases to break substantial API's or perform large re-architecting that prev
Re: A proposal for Spark 2.0
Really, Sandy? "Extra consideration" even for already-deprecated API? If we're not going to remove these with a major version change, then just when will we remove them? On Tue, Nov 10, 2015 at 4:53 PM, Sandy Ryza wrote: > Another +1 to Reynold's proposal. > > Maybe this is obvious, but I'd like to advocate against a blanket removal > of deprecated / developer APIs. Many APIs can likely be removed without > material impact (e.g. the SparkContext constructor that takes preferred > node location data), while others likely see heavier usage (e.g. I wouldn't > be surprised if mapPartitionsWithContext was baked into a number of apps) > and merit a little extra consideration. > > Maybe also obvious, but I think a migration guide with API equivlents and > the like would be incredibly useful in easing the transition. > > -Sandy > > On Tue, Nov 10, 2015 at 4:28 PM, Reynold Xin wrote: > >> Echoing Shivaram here. I don't think it makes a lot of sense to add more >> features to the 1.x line. We should still do critical bug fixes though. >> >> >> On Tue, Nov 10, 2015 at 4:23 PM, Shivaram Venkataraman < >> shiva...@eecs.berkeley.edu> wrote: >> >>> +1 >>> >>> On a related note I think making it lightweight will ensure that we >>> stay on the current release schedule and don't unnecessarily delay 2.0 >>> to wait for new features / big architectural changes. >>> >>> In terms of fixes to 1.x, I think our current policy of back-porting >>> fixes to older releases would still apply. I don't think developing >>> new features on both 1.x and 2.x makes a lot of sense as we would like >>> users to switch to 2.x. >>> >>> Shivaram >>> >>> On Tue, Nov 10, 2015 at 4:02 PM, Kostas Sakellis >>> wrote: >>> > +1 on a lightweight 2.0 >>> > >>> > What is the thinking around the 1.x line after Spark 2.0 is released? >>> If not >>> > terminated, how will we determine what goes into each major version >>> line? >>> > Will 1.x only be for stability fixes? >>> > >>> > Thanks, >>> > Kostas >>> > >>> > On Tue, Nov 10, 2015 at 3:41 PM, Patrick Wendell >>> wrote: >>> >> >>> >> I also feel the same as Reynold. I agree we should minimize API >>> breaks and >>> >> focus on fixing things around the edge that were mistakes (e.g. >>> exposing >>> >> Guava and Akka) rather than any overhaul that could fragment the >>> community. >>> >> Ideally a major release is a lightweight process we can do every >>> couple of >>> >> years, with minimal impact for users. >>> >> >>> >> - Patrick >>> >> >>> >> On Tue, Nov 10, 2015 at 3:35 PM, Nicholas Chammas >>> >> wrote: >>> >>> >>> >>> > For this reason, I would *not* propose doing major releases to >>> break >>> >>> > substantial API's or perform large re-architecting that prevent >>> users from >>> >>> > upgrading. Spark has always had a culture of evolving architecture >>> >>> > incrementally and making changes - and I don't think we want to >>> change this >>> >>> > model. >>> >>> >>> >>> +1 for this. The Python community went through a lot of turmoil over >>> the >>> >>> Python 2 -> Python 3 transition because the upgrade process was too >>> painful >>> >>> for too long. The Spark community will benefit greatly from our >>> explicitly >>> >>> looking to avoid a similar situation. >>> >>> >>> >>> > 3. Assembly-free distribution of Spark: don’t require building an >>> >>> > enormous assembly jar in order to run Spark. >>> >>> >>> >>> Could you elaborate a bit on this? I'm not sure what an assembly-free >>> >>> distribution means. >>> >>> >>> >>> Nick >>> >>> >>> >>> On Tue, Nov 10, 2015 at 6:11 PM Reynold Xin >>> wrote: >>> >>> I’m starting a new thread since the other one got intermixed with >>> feature requests. Please refrain from making feature request in >>> this thread. >>> Not that we shouldn’t be adding features, but we can always add >>> features in >>> 1.7, 2.1, 2.2, ... >>> >>> First - I want to propose a premise for how to think about Spark >>> 2.0 and >>> major releases in Spark, based on discussion with several members >>> of the >>> community: a major release should be low overhead and minimally >>> disruptive >>> to the Spark community. A major release should not be very >>> different from a >>> minor release and should not be gated based on new features. The >>> main >>> purpose of a major release is an opportunity to fix things that are >>> broken >>> in the current API and remove certain deprecated APIs (examples >>> follow). >>> >>> For this reason, I would *not* propose doing major releases to break >>> substantial API's or perform large re-architecting that prevent >>> users from >>> upgrading. Spark has always had a culture of evolving architecture >>> incrementally and making changes - and I don't think we want to >>> change this >>> model. In fact, we’ve released many architectural changes on the >>> 1.X line. >>> >>> If the community likes the above model, then to me it seems
Re: A proposal for Spark 2.0
Oh and another question - should Spark 2.0 support Java 7? On Tue, Nov 10, 2015 at 4:53 PM, Sandy Ryza wrote: > Another +1 to Reynold's proposal. > > Maybe this is obvious, but I'd like to advocate against a blanket removal > of deprecated / developer APIs. Many APIs can likely be removed without > material impact (e.g. the SparkContext constructor that takes preferred > node location data), while others likely see heavier usage (e.g. I wouldn't > be surprised if mapPartitionsWithContext was baked into a number of apps) > and merit a little extra consideration. > > Maybe also obvious, but I think a migration guide with API equivlents and > the like would be incredibly useful in easing the transition. > > -Sandy > > On Tue, Nov 10, 2015 at 4:28 PM, Reynold Xin wrote: > >> Echoing Shivaram here. I don't think it makes a lot of sense to add more >> features to the 1.x line. We should still do critical bug fixes though. >> >> >> On Tue, Nov 10, 2015 at 4:23 PM, Shivaram Venkataraman < >> shiva...@eecs.berkeley.edu> wrote: >> >>> +1 >>> >>> On a related note I think making it lightweight will ensure that we >>> stay on the current release schedule and don't unnecessarily delay 2.0 >>> to wait for new features / big architectural changes. >>> >>> In terms of fixes to 1.x, I think our current policy of back-porting >>> fixes to older releases would still apply. I don't think developing >>> new features on both 1.x and 2.x makes a lot of sense as we would like >>> users to switch to 2.x. >>> >>> Shivaram >>> >>> On Tue, Nov 10, 2015 at 4:02 PM, Kostas Sakellis >>> wrote: >>> > +1 on a lightweight 2.0 >>> > >>> > What is the thinking around the 1.x line after Spark 2.0 is released? >>> If not >>> > terminated, how will we determine what goes into each major version >>> line? >>> > Will 1.x only be for stability fixes? >>> > >>> > Thanks, >>> > Kostas >>> > >>> > On Tue, Nov 10, 2015 at 3:41 PM, Patrick Wendell >>> wrote: >>> >> >>> >> I also feel the same as Reynold. I agree we should minimize API >>> breaks and >>> >> focus on fixing things around the edge that were mistakes (e.g. >>> exposing >>> >> Guava and Akka) rather than any overhaul that could fragment the >>> community. >>> >> Ideally a major release is a lightweight process we can do every >>> couple of >>> >> years, with minimal impact for users. >>> >> >>> >> - Patrick >>> >> >>> >> On Tue, Nov 10, 2015 at 3:35 PM, Nicholas Chammas >>> >> wrote: >>> >>> >>> >>> > For this reason, I would *not* propose doing major releases to >>> break >>> >>> > substantial API's or perform large re-architecting that prevent >>> users from >>> >>> > upgrading. Spark has always had a culture of evolving architecture >>> >>> > incrementally and making changes - and I don't think we want to >>> change this >>> >>> > model. >>> >>> >>> >>> +1 for this. The Python community went through a lot of turmoil over >>> the >>> >>> Python 2 -> Python 3 transition because the upgrade process was too >>> painful >>> >>> for too long. The Spark community will benefit greatly from our >>> explicitly >>> >>> looking to avoid a similar situation. >>> >>> >>> >>> > 3. Assembly-free distribution of Spark: don’t require building an >>> >>> > enormous assembly jar in order to run Spark. >>> >>> >>> >>> Could you elaborate a bit on this? I'm not sure what an assembly-free >>> >>> distribution means. >>> >>> >>> >>> Nick >>> >>> >>> >>> On Tue, Nov 10, 2015 at 6:11 PM Reynold Xin >>> wrote: >>> >>> I’m starting a new thread since the other one got intermixed with >>> feature requests. Please refrain from making feature request in >>> this thread. >>> Not that we shouldn’t be adding features, but we can always add >>> features in >>> 1.7, 2.1, 2.2, ... >>> >>> First - I want to propose a premise for how to think about Spark >>> 2.0 and >>> major releases in Spark, based on discussion with several members >>> of the >>> community: a major release should be low overhead and minimally >>> disruptive >>> to the Spark community. A major release should not be very >>> different from a >>> minor release and should not be gated based on new features. The >>> main >>> purpose of a major release is an opportunity to fix things that are >>> broken >>> in the current API and remove certain deprecated APIs (examples >>> follow). >>> >>> For this reason, I would *not* propose doing major releases to break >>> substantial API's or perform large re-architecting that prevent >>> users from >>> upgrading. Spark has always had a culture of evolving architecture >>> incrementally and making changes - and I don't think we want to >>> change this >>> model. In fact, we’ve released many architectural changes on the >>> 1.X line. >>> >>> If the community likes the above model, then to me it seems >>> reasonable >>> to do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or >>> immediately >>> >>>
Re: A proposal for Spark 2.0
Another +1 to Reynold's proposal. Maybe this is obvious, but I'd like to advocate against a blanket removal of deprecated / developer APIs. Many APIs can likely be removed without material impact (e.g. the SparkContext constructor that takes preferred node location data), while others likely see heavier usage (e.g. I wouldn't be surprised if mapPartitionsWithContext was baked into a number of apps) and merit a little extra consideration. Maybe also obvious, but I think a migration guide with API equivlents and the like would be incredibly useful in easing the transition. -Sandy On Tue, Nov 10, 2015 at 4:28 PM, Reynold Xin wrote: > Echoing Shivaram here. I don't think it makes a lot of sense to add more > features to the 1.x line. We should still do critical bug fixes though. > > > On Tue, Nov 10, 2015 at 4:23 PM, Shivaram Venkataraman < > shiva...@eecs.berkeley.edu> wrote: > >> +1 >> >> On a related note I think making it lightweight will ensure that we >> stay on the current release schedule and don't unnecessarily delay 2.0 >> to wait for new features / big architectural changes. >> >> In terms of fixes to 1.x, I think our current policy of back-porting >> fixes to older releases would still apply. I don't think developing >> new features on both 1.x and 2.x makes a lot of sense as we would like >> users to switch to 2.x. >> >> Shivaram >> >> On Tue, Nov 10, 2015 at 4:02 PM, Kostas Sakellis >> wrote: >> > +1 on a lightweight 2.0 >> > >> > What is the thinking around the 1.x line after Spark 2.0 is released? >> If not >> > terminated, how will we determine what goes into each major version >> line? >> > Will 1.x only be for stability fixes? >> > >> > Thanks, >> > Kostas >> > >> > On Tue, Nov 10, 2015 at 3:41 PM, Patrick Wendell >> wrote: >> >> >> >> I also feel the same as Reynold. I agree we should minimize API breaks >> and >> >> focus on fixing things around the edge that were mistakes (e.g. >> exposing >> >> Guava and Akka) rather than any overhaul that could fragment the >> community. >> >> Ideally a major release is a lightweight process we can do every >> couple of >> >> years, with minimal impact for users. >> >> >> >> - Patrick >> >> >> >> On Tue, Nov 10, 2015 at 3:35 PM, Nicholas Chammas >> >> wrote: >> >>> >> >>> > For this reason, I would *not* propose doing major releases to break >> >>> > substantial API's or perform large re-architecting that prevent >> users from >> >>> > upgrading. Spark has always had a culture of evolving architecture >> >>> > incrementally and making changes - and I don't think we want to >> change this >> >>> > model. >> >>> >> >>> +1 for this. The Python community went through a lot of turmoil over >> the >> >>> Python 2 -> Python 3 transition because the upgrade process was too >> painful >> >>> for too long. The Spark community will benefit greatly from our >> explicitly >> >>> looking to avoid a similar situation. >> >>> >> >>> > 3. Assembly-free distribution of Spark: don’t require building an >> >>> > enormous assembly jar in order to run Spark. >> >>> >> >>> Could you elaborate a bit on this? I'm not sure what an assembly-free >> >>> distribution means. >> >>> >> >>> Nick >> >>> >> >>> On Tue, Nov 10, 2015 at 6:11 PM Reynold Xin >> wrote: >> >> I’m starting a new thread since the other one got intermixed with >> feature requests. Please refrain from making feature request in this >> thread. >> Not that we shouldn’t be adding features, but we can always add >> features in >> 1.7, 2.1, 2.2, ... >> >> First - I want to propose a premise for how to think about Spark 2.0 >> and >> major releases in Spark, based on discussion with several members of >> the >> community: a major release should be low overhead and minimally >> disruptive >> to the Spark community. A major release should not be very different >> from a >> minor release and should not be gated based on new features. The main >> purpose of a major release is an opportunity to fix things that are >> broken >> in the current API and remove certain deprecated APIs (examples >> follow). >> >> For this reason, I would *not* propose doing major releases to break >> substantial API's or perform large re-architecting that prevent >> users from >> upgrading. Spark has always had a culture of evolving architecture >> incrementally and making changes - and I don't think we want to >> change this >> model. In fact, we’ve released many architectural changes on the 1.X >> line. >> >> If the community likes the above model, then to me it seems >> reasonable >> to do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or >> immediately >> after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A >> cadence of >> major releases every 2 years seems doable within the above model. >> >> Under this model, here is a list of example things I would propose >> doing >> in Spark
Re: A proposal for Spark 2.0
Echoing Shivaram here. I don't think it makes a lot of sense to add more features to the 1.x line. We should still do critical bug fixes though. On Tue, Nov 10, 2015 at 4:23 PM, Shivaram Venkataraman < shiva...@eecs.berkeley.edu> wrote: > +1 > > On a related note I think making it lightweight will ensure that we > stay on the current release schedule and don't unnecessarily delay 2.0 > to wait for new features / big architectural changes. > > In terms of fixes to 1.x, I think our current policy of back-porting > fixes to older releases would still apply. I don't think developing > new features on both 1.x and 2.x makes a lot of sense as we would like > users to switch to 2.x. > > Shivaram > > On Tue, Nov 10, 2015 at 4:02 PM, Kostas Sakellis > wrote: > > +1 on a lightweight 2.0 > > > > What is the thinking around the 1.x line after Spark 2.0 is released? If > not > > terminated, how will we determine what goes into each major version line? > > Will 1.x only be for stability fixes? > > > > Thanks, > > Kostas > > > > On Tue, Nov 10, 2015 at 3:41 PM, Patrick Wendell > wrote: > >> > >> I also feel the same as Reynold. I agree we should minimize API breaks > and > >> focus on fixing things around the edge that were mistakes (e.g. exposing > >> Guava and Akka) rather than any overhaul that could fragment the > community. > >> Ideally a major release is a lightweight process we can do every couple > of > >> years, with minimal impact for users. > >> > >> - Patrick > >> > >> On Tue, Nov 10, 2015 at 3:35 PM, Nicholas Chammas > >> wrote: > >>> > >>> > For this reason, I would *not* propose doing major releases to break > >>> > substantial API's or perform large re-architecting that prevent > users from > >>> > upgrading. Spark has always had a culture of evolving architecture > >>> > incrementally and making changes - and I don't think we want to > change this > >>> > model. > >>> > >>> +1 for this. The Python community went through a lot of turmoil over > the > >>> Python 2 -> Python 3 transition because the upgrade process was too > painful > >>> for too long. The Spark community will benefit greatly from our > explicitly > >>> looking to avoid a similar situation. > >>> > >>> > 3. Assembly-free distribution of Spark: don’t require building an > >>> > enormous assembly jar in order to run Spark. > >>> > >>> Could you elaborate a bit on this? I'm not sure what an assembly-free > >>> distribution means. > >>> > >>> Nick > >>> > >>> On Tue, Nov 10, 2015 at 6:11 PM Reynold Xin > wrote: > > I’m starting a new thread since the other one got intermixed with > feature requests. Please refrain from making feature request in this > thread. > Not that we shouldn’t be adding features, but we can always add > features in > 1.7, 2.1, 2.2, ... > > First - I want to propose a premise for how to think about Spark 2.0 > and > major releases in Spark, based on discussion with several members of > the > community: a major release should be low overhead and minimally > disruptive > to the Spark community. A major release should not be very different > from a > minor release and should not be gated based on new features. The main > purpose of a major release is an opportunity to fix things that are > broken > in the current API and remove certain deprecated APIs (examples > follow). > > For this reason, I would *not* propose doing major releases to break > substantial API's or perform large re-architecting that prevent users > from > upgrading. Spark has always had a culture of evolving architecture > incrementally and making changes - and I don't think we want to > change this > model. In fact, we’ve released many architectural changes on the 1.X > line. > > If the community likes the above model, then to me it seems reasonable > to do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or > immediately > after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A > cadence of > major releases every 2 years seems doable within the above model. > > Under this model, here is a list of example things I would propose > doing > in Spark 2.0, separated into APIs and Operation/Deployment: > > > APIs > > 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in > Spark 1.x. > > 2. Remove Akka from Spark’s API dependency (in streaming), so user > applications can use Akka (SPARK-5293). We have gotten a lot of > complaints > about user applications being unable to use Akka due to Spark’s > dependency > on Akka. > > 3. Remove Guava from Spark’s public API (JavaRDD Optional). > > 4. Better class package structure for low level developer API’s. In > particular, we have some DeveloperApi (mostly various listener-related > classes) added over the years. Some packages include only one or two > public > cl
Re: A proposal for Spark 2.0
+1 On a related note I think making it lightweight will ensure that we stay on the current release schedule and don't unnecessarily delay 2.0 to wait for new features / big architectural changes. In terms of fixes to 1.x, I think our current policy of back-porting fixes to older releases would still apply. I don't think developing new features on both 1.x and 2.x makes a lot of sense as we would like users to switch to 2.x. Shivaram On Tue, Nov 10, 2015 at 4:02 PM, Kostas Sakellis wrote: > +1 on a lightweight 2.0 > > What is the thinking around the 1.x line after Spark 2.0 is released? If not > terminated, how will we determine what goes into each major version line? > Will 1.x only be for stability fixes? > > Thanks, > Kostas > > On Tue, Nov 10, 2015 at 3:41 PM, Patrick Wendell wrote: >> >> I also feel the same as Reynold. I agree we should minimize API breaks and >> focus on fixing things around the edge that were mistakes (e.g. exposing >> Guava and Akka) rather than any overhaul that could fragment the community. >> Ideally a major release is a lightweight process we can do every couple of >> years, with minimal impact for users. >> >> - Patrick >> >> On Tue, Nov 10, 2015 at 3:35 PM, Nicholas Chammas >> wrote: >>> >>> > For this reason, I would *not* propose doing major releases to break >>> > substantial API's or perform large re-architecting that prevent users from >>> > upgrading. Spark has always had a culture of evolving architecture >>> > incrementally and making changes - and I don't think we want to change >>> > this >>> > model. >>> >>> +1 for this. The Python community went through a lot of turmoil over the >>> Python 2 -> Python 3 transition because the upgrade process was too painful >>> for too long. The Spark community will benefit greatly from our explicitly >>> looking to avoid a similar situation. >>> >>> > 3. Assembly-free distribution of Spark: don’t require building an >>> > enormous assembly jar in order to run Spark. >>> >>> Could you elaborate a bit on this? I'm not sure what an assembly-free >>> distribution means. >>> >>> Nick >>> >>> On Tue, Nov 10, 2015 at 6:11 PM Reynold Xin wrote: I’m starting a new thread since the other one got intermixed with feature requests. Please refrain from making feature request in this thread. Not that we shouldn’t be adding features, but we can always add features in 1.7, 2.1, 2.2, ... First - I want to propose a premise for how to think about Spark 2.0 and major releases in Spark, based on discussion with several members of the community: a major release should be low overhead and minimally disruptive to the Spark community. A major release should not be very different from a minor release and should not be gated based on new features. The main purpose of a major release is an opportunity to fix things that are broken in the current API and remove certain deprecated APIs (examples follow). For this reason, I would *not* propose doing major releases to break substantial API's or perform large re-architecting that prevent users from upgrading. Spark has always had a culture of evolving architecture incrementally and making changes - and I don't think we want to change this model. In fact, we’ve released many architectural changes on the 1.X line. If the community likes the above model, then to me it seems reasonable to do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or immediately after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A cadence of major releases every 2 years seems doable within the above model. Under this model, here is a list of example things I would propose doing in Spark 2.0, separated into APIs and Operation/Deployment: APIs 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in Spark 1.x. 2. Remove Akka from Spark’s API dependency (in streaming), so user applications can use Akka (SPARK-5293). We have gotten a lot of complaints about user applications being unable to use Akka due to Spark’s dependency on Akka. 3. Remove Guava from Spark’s public API (JavaRDD Optional). 4. Better class package structure for low level developer API’s. In particular, we have some DeveloperApi (mostly various listener-related classes) added over the years. Some packages include only one or two public classes but a lot of private classes. A better structure is to have public classes isolated to a few public packages, and these public packages should have minimal private classes for low level developer APIs. 5. Consolidate task metric and accumulator API. Although having some subtle differences, these two are very similar but have completely different code path. 6. Possibly making Catalyst, Dataset, and DataFrame more general b
Re: A proposal for Spark 2.0
Would be also good to fix api breakages introduced as part of 1.0 (where there is missing functionality now), overhaul & remove all deprecated config/features/combinations, api changes that we need to make to public api which has been deferred for minor releases. Regards, Mridul On Tue, Nov 10, 2015 at 3:10 PM, Reynold Xin wrote: > I’m starting a new thread since the other one got intermixed with feature > requests. Please refrain from making feature request in this thread. Not > that we shouldn’t be adding features, but we can always add features in 1.7, > 2.1, 2.2, ... > > First - I want to propose a premise for how to think about Spark 2.0 and > major releases in Spark, based on discussion with several members of the > community: a major release should be low overhead and minimally disruptive > to the Spark community. A major release should not be very different from a > minor release and should not be gated based on new features. The main > purpose of a major release is an opportunity to fix things that are broken > in the current API and remove certain deprecated APIs (examples follow). > > For this reason, I would *not* propose doing major releases to break > substantial API's or perform large re-architecting that prevent users from > upgrading. Spark has always had a culture of evolving architecture > incrementally and making changes - and I don't think we want to change this > model. In fact, we’ve released many architectural changes on the 1.X line. > > If the community likes the above model, then to me it seems reasonable to do > Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or immediately after > Spark 1.7. It will be 18 or 21 months since Spark 1.0. A cadence of major > releases every 2 years seems doable within the above model. > > Under this model, here is a list of example things I would propose doing in > Spark 2.0, separated into APIs and Operation/Deployment: > > > APIs > > 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in Spark > 1.x. > > 2. Remove Akka from Spark’s API dependency (in streaming), so user > applications can use Akka (SPARK-5293). We have gotten a lot of complaints > about user applications being unable to use Akka due to Spark’s dependency > on Akka. > > 3. Remove Guava from Spark’s public API (JavaRDD Optional). > > 4. Better class package structure for low level developer API’s. In > particular, we have some DeveloperApi (mostly various listener-related > classes) added over the years. Some packages include only one or two public > classes but a lot of private classes. A better structure is to have public > classes isolated to a few public packages, and these public packages should > have minimal private classes for low level developer APIs. > > 5. Consolidate task metric and accumulator API. Although having some subtle > differences, these two are very similar but have completely different code > path. > > 6. Possibly making Catalyst, Dataset, and DataFrame more general by moving > them to other package(s). They are already used beyond SQL, e.g. in ML > pipelines, and will be used by streaming also. > > > Operation/Deployment > > 1. Scala 2.11 as the default build. We should still support Scala 2.10, but > it has been end-of-life. > > 2. Remove Hadoop 1 support. > > 3. Assembly-free distribution of Spark: don’t require building an enormous > assembly jar in order to run Spark. > - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: A proposal for Spark 2.0
+1 on a lightweight 2.0 What is the thinking around the 1.x line after Spark 2.0 is released? If not terminated, how will we determine what goes into each major version line? Will 1.x only be for stability fixes? Thanks, Kostas On Tue, Nov 10, 2015 at 3:41 PM, Patrick Wendell wrote: > I also feel the same as Reynold. I agree we should minimize API breaks and > focus on fixing things around the edge that were mistakes (e.g. exposing > Guava and Akka) rather than any overhaul that could fragment the community. > Ideally a major release is a lightweight process we can do every couple of > years, with minimal impact for users. > > - Patrick > > On Tue, Nov 10, 2015 at 3:35 PM, Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> > For this reason, I would *not* propose doing major releases to break >> substantial API's or perform large re-architecting that prevent users from >> upgrading. Spark has always had a culture of evolving architecture >> incrementally and making changes - and I don't think we want to change this >> model. >> >> +1 for this. The Python community went through a lot of turmoil over the >> Python 2 -> Python 3 transition because the upgrade process was too painful >> for too long. The Spark community will benefit greatly from our explicitly >> looking to avoid a similar situation. >> >> > 3. Assembly-free distribution of Spark: don’t require building an >> enormous assembly jar in order to run Spark. >> >> Could you elaborate a bit on this? I'm not sure what an assembly-free >> distribution means. >> >> Nick >> >> On Tue, Nov 10, 2015 at 6:11 PM Reynold Xin wrote: >> >>> I’m starting a new thread since the other one got intermixed with >>> feature requests. Please refrain from making feature request in this >>> thread. Not that we shouldn’t be adding features, but we can always add >>> features in 1.7, 2.1, 2.2, ... >>> >>> First - I want to propose a premise for how to think about Spark 2.0 and >>> major releases in Spark, based on discussion with several members of the >>> community: a major release should be low overhead and minimally disruptive >>> to the Spark community. A major release should not be very different from a >>> minor release and should not be gated based on new features. The main >>> purpose of a major release is an opportunity to fix things that are broken >>> in the current API and remove certain deprecated APIs (examples follow). >>> >>> For this reason, I would *not* propose doing major releases to break >>> substantial API's or perform large re-architecting that prevent users from >>> upgrading. Spark has always had a culture of evolving architecture >>> incrementally and making changes - and I don't think we want to change this >>> model. In fact, we’ve released many architectural changes on the 1.X line. >>> >>> If the community likes the above model, then to me it seems reasonable >>> to do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or >>> immediately after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A >>> cadence of major releases every 2 years seems doable within the above model. >>> >>> Under this model, here is a list of example things I would propose doing >>> in Spark 2.0, separated into APIs and Operation/Deployment: >>> >>> >>> APIs >>> >>> 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in >>> Spark 1.x. >>> >>> 2. Remove Akka from Spark’s API dependency (in streaming), so user >>> applications can use Akka (SPARK-5293). We have gotten a lot of complaints >>> about user applications being unable to use Akka due to Spark’s dependency >>> on Akka. >>> >>> 3. Remove Guava from Spark’s public API (JavaRDD Optional). >>> >>> 4. Better class package structure for low level developer API’s. In >>> particular, we have some DeveloperApi (mostly various listener-related >>> classes) added over the years. Some packages include only one or two public >>> classes but a lot of private classes. A better structure is to have public >>> classes isolated to a few public packages, and these public packages should >>> have minimal private classes for low level developer APIs. >>> >>> 5. Consolidate task metric and accumulator API. Although having some >>> subtle differences, these two are very similar but have completely >>> different code path. >>> >>> 6. Possibly making Catalyst, Dataset, and DataFrame more general by >>> moving them to other package(s). They are already used beyond SQL, e.g. in >>> ML pipelines, and will be used by streaming also. >>> >>> >>> Operation/Deployment >>> >>> 1. Scala 2.11 as the default build. We should still support Scala 2.10, >>> but it has been end-of-life. >>> >>> 2. Remove Hadoop 1 support. >>> >>> 3. Assembly-free distribution of Spark: don’t require building an >>> enormous assembly jar in order to run Spark. >>> >>> >
Re: A proposal for Spark 2.0
There's a proposal / discussion of the assembly-less distributions at https://github.com/vanzin/spark/pull/2/files / https://issues.apache.org/jira/browse/SPARK-11157. On Tue, Nov 10, 2015 at 3:53 PM, Reynold Xin wrote: > > On Tue, Nov 10, 2015 at 3:35 PM, Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> >> > 3. Assembly-free distribution of Spark: don’t require building an >> enormous assembly jar in order to run Spark. >> >> Could you elaborate a bit on this? I'm not sure what an assembly-free >> distribution means. >> >> > Right now we ship Spark using a single assembly jar, which causes a few > different problems: > > - total number of classes are limited on some configurations > > - dependency swapping is harder > > > The proposal is to just avoid a single fat jar. > > >
Re: A proposal for Spark 2.0
On Tue, Nov 10, 2015 at 3:35 PM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > > > 3. Assembly-free distribution of Spark: don’t require building an > enormous assembly jar in order to run Spark. > > Could you elaborate a bit on this? I'm not sure what an assembly-free > distribution means. > > Right now we ship Spark using a single assembly jar, which causes a few different problems: - total number of classes are limited on some configurations - dependency swapping is harder The proposal is to just avoid a single fat jar.
Re: A proposal for Spark 2.0
I also feel the same as Reynold. I agree we should minimize API breaks and focus on fixing things around the edge that were mistakes (e.g. exposing Guava and Akka) rather than any overhaul that could fragment the community. Ideally a major release is a lightweight process we can do every couple of years, with minimal impact for users. - Patrick On Tue, Nov 10, 2015 at 3:35 PM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > > For this reason, I would *not* propose doing major releases to break > substantial API's or perform large re-architecting that prevent users from > upgrading. Spark has always had a culture of evolving architecture > incrementally and making changes - and I don't think we want to change this > model. > > +1 for this. The Python community went through a lot of turmoil over the > Python 2 -> Python 3 transition because the upgrade process was too painful > for too long. The Spark community will benefit greatly from our explicitly > looking to avoid a similar situation. > > > 3. Assembly-free distribution of Spark: don’t require building an > enormous assembly jar in order to run Spark. > > Could you elaborate a bit on this? I'm not sure what an assembly-free > distribution means. > > Nick > > On Tue, Nov 10, 2015 at 6:11 PM Reynold Xin wrote: > >> I’m starting a new thread since the other one got intermixed with feature >> requests. Please refrain from making feature request in this thread. Not >> that we shouldn’t be adding features, but we can always add features in >> 1.7, 2.1, 2.2, ... >> >> First - I want to propose a premise for how to think about Spark 2.0 and >> major releases in Spark, based on discussion with several members of the >> community: a major release should be low overhead and minimally disruptive >> to the Spark community. A major release should not be very different from a >> minor release and should not be gated based on new features. The main >> purpose of a major release is an opportunity to fix things that are broken >> in the current API and remove certain deprecated APIs (examples follow). >> >> For this reason, I would *not* propose doing major releases to break >> substantial API's or perform large re-architecting that prevent users from >> upgrading. Spark has always had a culture of evolving architecture >> incrementally and making changes - and I don't think we want to change this >> model. In fact, we’ve released many architectural changes on the 1.X line. >> >> If the community likes the above model, then to me it seems reasonable to >> do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or immediately >> after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A cadence of >> major releases every 2 years seems doable within the above model. >> >> Under this model, here is a list of example things I would propose doing >> in Spark 2.0, separated into APIs and Operation/Deployment: >> >> >> APIs >> >> 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in >> Spark 1.x. >> >> 2. Remove Akka from Spark’s API dependency (in streaming), so user >> applications can use Akka (SPARK-5293). We have gotten a lot of complaints >> about user applications being unable to use Akka due to Spark’s dependency >> on Akka. >> >> 3. Remove Guava from Spark’s public API (JavaRDD Optional). >> >> 4. Better class package structure for low level developer API’s. In >> particular, we have some DeveloperApi (mostly various listener-related >> classes) added over the years. Some packages include only one or two public >> classes but a lot of private classes. A better structure is to have public >> classes isolated to a few public packages, and these public packages should >> have minimal private classes for low level developer APIs. >> >> 5. Consolidate task metric and accumulator API. Although having some >> subtle differences, these two are very similar but have completely >> different code path. >> >> 6. Possibly making Catalyst, Dataset, and DataFrame more general by >> moving them to other package(s). They are already used beyond SQL, e.g. in >> ML pipelines, and will be used by streaming also. >> >> >> Operation/Deployment >> >> 1. Scala 2.11 as the default build. We should still support Scala 2.10, >> but it has been end-of-life. >> >> 2. Remove Hadoop 1 support. >> >> 3. Assembly-free distribution of Spark: don’t require building an >> enormous assembly jar in order to run Spark. >> >>
Re: A proposal for Spark 2.0
> For this reason, I would *not* propose doing major releases to break substantial API's or perform large re-architecting that prevent users from upgrading. Spark has always had a culture of evolving architecture incrementally and making changes - and I don't think we want to change this model. +1 for this. The Python community went through a lot of turmoil over the Python 2 -> Python 3 transition because the upgrade process was too painful for too long. The Spark community will benefit greatly from our explicitly looking to avoid a similar situation. > 3. Assembly-free distribution of Spark: don’t require building an enormous assembly jar in order to run Spark. Could you elaborate a bit on this? I'm not sure what an assembly-free distribution means. Nick On Tue, Nov 10, 2015 at 6:11 PM Reynold Xin wrote: > I’m starting a new thread since the other one got intermixed with feature > requests. Please refrain from making feature request in this thread. Not > that we shouldn’t be adding features, but we can always add features in > 1.7, 2.1, 2.2, ... > > First - I want to propose a premise for how to think about Spark 2.0 and > major releases in Spark, based on discussion with several members of the > community: a major release should be low overhead and minimally disruptive > to the Spark community. A major release should not be very different from a > minor release and should not be gated based on new features. The main > purpose of a major release is an opportunity to fix things that are broken > in the current API and remove certain deprecated APIs (examples follow). > > For this reason, I would *not* propose doing major releases to break > substantial API's or perform large re-architecting that prevent users from > upgrading. Spark has always had a culture of evolving architecture > incrementally and making changes - and I don't think we want to change this > model. In fact, we’ve released many architectural changes on the 1.X line. > > If the community likes the above model, then to me it seems reasonable to > do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or immediately > after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A cadence of > major releases every 2 years seems doable within the above model. > > Under this model, here is a list of example things I would propose doing > in Spark 2.0, separated into APIs and Operation/Deployment: > > > APIs > > 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in > Spark 1.x. > > 2. Remove Akka from Spark’s API dependency (in streaming), so user > applications can use Akka (SPARK-5293). We have gotten a lot of complaints > about user applications being unable to use Akka due to Spark’s dependency > on Akka. > > 3. Remove Guava from Spark’s public API (JavaRDD Optional). > > 4. Better class package structure for low level developer API’s. In > particular, we have some DeveloperApi (mostly various listener-related > classes) added over the years. Some packages include only one or two public > classes but a lot of private classes. A better structure is to have public > classes isolated to a few public packages, and these public packages should > have minimal private classes for low level developer APIs. > > 5. Consolidate task metric and accumulator API. Although having some > subtle differences, these two are very similar but have completely > different code path. > > 6. Possibly making Catalyst, Dataset, and DataFrame more general by moving > them to other package(s). They are already used beyond SQL, e.g. in ML > pipelines, and will be used by streaming also. > > > Operation/Deployment > > 1. Scala 2.11 as the default build. We should still support Scala 2.10, > but it has been end-of-life. > > 2. Remove Hadoop 1 support. > > 3. Assembly-free distribution of Spark: don’t require building an enormous > assembly jar in order to run Spark. > >