Re: [VOTE] Drop Spark 1.x support to focus on Spark 2.x
I think so ;) Regards JB On 11/10/2017 09:29 AM, Reuven Lax wrote: Sounds good. I doubt we will have much opposition from users, in which case Beam 2.3.0 can deprecate Spark 1.x On Thu, Nov 9, 2017 at 11:54 PM, Jean-Baptiste Onofré wrote: Hi all, thanks a lot for all your feedback. The trend is about to upgrade to Spark 2.x and drop Spark 1.x support. However, some of you (especially Reuven and Robert) commented that users have to be pinged as well. It makes perfect sense, and it was my intention. I propose the following action plan: - from the technical front, currently, I have two private branches ready: one with Spark 1.x & Spark 2.x support (with a common module and three artifacts), another one with an upgrade to Spark 2.x (dropping 1.x). I will merge the later on the PR. - I will forward the vote e-mail to the user mailing list, hopefully we will have user feedback. Thanks again, Regards JB On 11/08/2017 08:27 AM, Jean-Baptiste Onofré wrote: Hi all, as you might know, we are working on Spark 2.x support in the Spark runner. I'm working on a PR about that: https://github.com/apache/beam/pull/3808 Today, we have something working with both Spark 1.x and 2.x from a code standpoint, but I have to deal with dependencies. It's the first step of the update as I'm still using RDD, the second step would be to support dataframe (but for that, I would need PCollection elements with schemas, that's another topic on which Eugene, Reuven and I are discussing). However, as all major distributions now ship Spark 2.x, I don't think it's required anymore to support Spark 1.x. If we agree, I will update and cleanup the PR to only support and focus on Spark 2.x. So, that's why I'm calling for a vote: [ ] +1 to drop Spark 1.x support and upgrade to Spark 2.x only [ ] 0 (I don't care ;)) [ ] -1, I would like to still support Spark 1.x, and so having support of both Spark 1.x and 2.x (please provide specific comment) This vote is open for 48 hours (I have the commits ready, just waiting the end of the vote to push on the PR). Thanks ! Regards JB -- Jean-Baptiste Onofré jbono...@apache.org http://blog.nanthrax.net Talend - http://www.talend.com -- Jean-Baptiste Onofré jbono...@apache.org http://blog.nanthrax.net Talend - http://www.talend.com
Re: [VOTE] Drop Spark 1.x support to focus on Spark 2.x
Sounds good. I doubt we will have much opposition from users, in which case Beam 2.3.0 can deprecate Spark 1.x On Thu, Nov 9, 2017 at 11:54 PM, Jean-Baptiste Onofré wrote: > Hi all, > > thanks a lot for all your feedback. > > The trend is about to upgrade to Spark 2.x and drop Spark 1.x support. > > However, some of you (especially Reuven and Robert) commented that users > have to be pinged as well. It makes perfect sense, and it was my intention. > > I propose the following action plan: > - from the technical front, currently, I have two private branches ready: > one with Spark 1.x & Spark 2.x support (with a common module and three > artifacts), another one with an upgrade to Spark 2.x (dropping 1.x). I will > merge the later on the PR. > - I will forward the vote e-mail to the user mailing list, hopefully we > will have user feedback. > > Thanks again, > Regards > JB > > > On 11/08/2017 08:27 AM, Jean-Baptiste Onofré wrote: > >> Hi all, >> >> as you might know, we are working on Spark 2.x support in the Spark >> runner. >> >> I'm working on a PR about that: >> >> https://github.com/apache/beam/pull/3808 >> >> Today, we have something working with both Spark 1.x and 2.x from a code >> standpoint, but I have to deal with dependencies. It's the first step of >> the update as I'm still using RDD, the second step would be to support >> dataframe (but for that, I would need PCollection elements with schemas, >> that's another topic on which Eugene, Reuven and I are discussing). >> >> However, as all major distributions now ship Spark 2.x, I don't think >> it's required anymore to support Spark 1.x. >> >> If we agree, I will update and cleanup the PR to only support and focus >> on Spark 2.x. >> >> So, that's why I'm calling for a vote: >> >>[ ] +1 to drop Spark 1.x support and upgrade to Spark 2.x only >>[ ] 0 (I don't care ;)) >>[ ] -1, I would like to still support Spark 1.x, and so having support >> of both Spark 1.x and 2.x (please provide specific comment) >> >> This vote is open for 48 hours (I have the commits ready, just waiting >> the end of the vote to push on the PR). >> >> Thanks ! >> Regards >> JB >> > > -- > Jean-Baptiste Onofré > jbono...@apache.org > http://blog.nanthrax.net > Talend - http://www.talend.com >
Re: [VOTE] Drop Spark 1.x support to focus on Spark 2.x
Hi all, thanks a lot for all your feedback. The trend is about to upgrade to Spark 2.x and drop Spark 1.x support. However, some of you (especially Reuven and Robert) commented that users have to be pinged as well. It makes perfect sense, and it was my intention. I propose the following action plan: - from the technical front, currently, I have two private branches ready: one with Spark 1.x & Spark 2.x support (with a common module and three artifacts), another one with an upgrade to Spark 2.x (dropping 1.x). I will merge the later on the PR. - I will forward the vote e-mail to the user mailing list, hopefully we will have user feedback. Thanks again, Regards JB On 11/08/2017 08:27 AM, Jean-Baptiste Onofré wrote: Hi all, as you might know, we are working on Spark 2.x support in the Spark runner. I'm working on a PR about that: https://github.com/apache/beam/pull/3808 Today, we have something working with both Spark 1.x and 2.x from a code standpoint, but I have to deal with dependencies. It's the first step of the update as I'm still using RDD, the second step would be to support dataframe (but for that, I would need PCollection elements with schemas, that's another topic on which Eugene, Reuven and I are discussing). However, as all major distributions now ship Spark 2.x, I don't think it's required anymore to support Spark 1.x. If we agree, I will update and cleanup the PR to only support and focus on Spark 2.x. So, that's why I'm calling for a vote: [ ] +1 to drop Spark 1.x support and upgrade to Spark 2.x only [ ] 0 (I don't care ;)) [ ] -1, I would like to still support Spark 1.x, and so having support of both Spark 1.x and 2.x (please provide specific comment) This vote is open for 48 hours (I have the commits ready, just waiting the end of the vote to push on the PR). Thanks ! Regards JB -- Jean-Baptiste Onofré jbono...@apache.org http://blog.nanthrax.net Talend - http://www.talend.com
Re: [VOTE] Drop Spark 1.x support to focus on Spark 2.x
On Thu, Nov 9, 2017 at 11:05 AM, Kenneth Knowles wrote: > I think it makes sense to communicate with email to users@ and in the > release notes of 2.2.0. Totally agree. > That communication should be specific and indicate > whether we are planning to merely not work on it anymore or actually remove > it in 2.3.0. There seems to be some ambiguity in this vote which of these two options we're actually considering. I'm certainly +1 on relegating it to maintenance mode at least. I don't have a good sense on the burden of keeping it around, nor the number of potential (current?) users we'd be alienating, which seem to be the driving factors. The fact that all major distributions ship 2.x is very different than the question of whether most users have migrated to 2.x. > On Thu, Nov 9, 2017 at 6:35 AM, Amit Sela wrote: > >> +1 for dropping Spark 1 support. >> I don't think we have enough users to justify supporting both, and its been >> a long time since this idea originally came-up (when Spark2 wasn't stable) >> and now Spark 2 is standard in all Hadoop distros. >> As for switching to the Dataframe API, as long as Spark 2 doesn't support >> scanning through the state periodically (even if no data for a key), >> watermarks won't fire keys that didn't see updates. >> >> On Thu, Nov 9, 2017 at 9:12 AM Thomas Weise wrote: >> >> > +1 (non-binding) for dropping 1.x support >> > >> > I don't have the impression that there is significant adoption for Beam >> on >> > Spark 1.x ? A stronger Spark runner that works well on 2.x will be better >> > for Beam adoption than a runner that has to compromise due to 1.x >> baggage. >> > Development efforts can go into improving the runner. >> > >> > Thanks, >> > Thomas >> > >> > >> > On Thu, Nov 9, 2017 at 4:08 AM, Srinivas Reddy < >> srinivas96all...@gmail.com >> > > >> > wrote: >> > >> > > +1 >> > > >> > > >> > > >> > > -- >> > > Srinivas Reddy >> > > >> > > http://mrsrinivas.com/ >> > > >> > > >> > > (Sent via gmail web) >> > > >> > > On 8 November 2017 at 14:27, Jean-Baptiste Onofré >> > wrote: >> > > >> > > > Hi all, >> > > > >> > > > as you might know, we are working on Spark 2.x support in the Spark >> > > runner. >> > > > >> > > > I'm working on a PR about that: >> > > > >> > > > https://github.com/apache/beam/pull/3808 >> > > > >> > > > Today, we have something working with both Spark 1.x and 2.x from a >> > code >> > > > standpoint, but I have to deal with dependencies. It's the first step >> > of >> > > > the update as I'm still using RDD, the second step would be to >> support >> > > > dataframe (but for that, I would need PCollection elements with >> > schemas, >> > > > that's another topic on which Eugene, Reuven and I are discussing). >> > > > >> > > > However, as all major distributions now ship Spark 2.x, I don't think >> > > it's >> > > > required anymore to support Spark 1.x. >> > > > >> > > > If we agree, I will update and cleanup the PR to only support and >> focus >> > > on >> > > > Spark 2.x. >> > > > >> > > > So, that's why I'm calling for a vote: >> > > > >> > > > [ ] +1 to drop Spark 1.x support and upgrade to Spark 2.x only >> > > > [ ] 0 (I don't care ;)) >> > > > [ ] -1, I would like to still support Spark 1.x, and so having >> > support >> > > > of both Spark 1.x and 2.x (please provide specific comment) >> > > > >> > > > This vote is open for 48 hours (I have the commits ready, just >> waiting >> > > the >> > > > end of the vote to push on the PR). >> > > > >> > > > Thanks ! >> > > > Regards >> > > > JB >> > > > -- >> > > > Jean-Baptiste Onofré >> > > > jbono...@apache.org >> > > > http://blog.nanthrax.net >> > > > Talend - http://www.talend.com >> > > > >> > > >> > >>
Re: [VOTE] Drop Spark 1.x support to focus on Spark 2.x
+1 from me. However let's notify users@ first. If we do get a lot of pushback from users (which I doubt we will), we might reconsider dropping Spark 1 support. On Thu, Nov 9, 2017 at 11:05 AM, Kenneth Knowles wrote: > +1 from me, with a friendly deprecation process > > I am convinced by the following: > > - We don't have the resources to make both great, and anyhow it isn't > worth it > - People keeping up with Beam releases are likely to be keeping up with > Spark as well > - Spark 1 users already have a Spark 1 runner for Beam and can keep using > it (and we don't actually lose the ability to update it in a pinch) > - Key features like portability (hence Python) will be some time so we > should definitely not waste effort building that feature with Spark 1 in > mind > > I think it makes sense to communicate with email to users@ and in the > release notes of 2.2.0. That communication should be specific and indicate > whether we are planning to merely not work on it anymore or actually remove > it in 2.3.0. > > Kenn > > On Thu, Nov 9, 2017 at 6:35 AM, Amit Sela wrote: > > > +1 for dropping Spark 1 support. > > I don't think we have enough users to justify supporting both, and its > been > > a long time since this idea originally came-up (when Spark2 wasn't > stable) > > and now Spark 2 is standard in all Hadoop distros. > > As for switching to the Dataframe API, as long as Spark 2 doesn't support > > scanning through the state periodically (even if no data for a key), > > watermarks won't fire keys that didn't see updates. > > > > On Thu, Nov 9, 2017 at 9:12 AM Thomas Weise wrote: > > > > > +1 (non-binding) for dropping 1.x support > > > > > > I don't have the impression that there is significant adoption for Beam > > on > > > Spark 1.x ? A stronger Spark runner that works well on 2.x will be > better > > > for Beam adoption than a runner that has to compromise due to 1.x > > baggage. > > > Development efforts can go into improving the runner. > > > > > > Thanks, > > > Thomas > > > > > > > > > On Thu, Nov 9, 2017 at 4:08 AM, Srinivas Reddy < > > srinivas96all...@gmail.com > > > > > > > wrote: > > > > > > > +1 > > > > > > > > > > > > > > > > -- > > > > Srinivas Reddy > > > > > > > > http://mrsrinivas.com/ > > > > > > > > > > > > (Sent via gmail web) > > > > > > > > On 8 November 2017 at 14:27, Jean-Baptiste Onofré > > > wrote: > > > > > > > > > Hi all, > > > > > > > > > > as you might know, we are working on Spark 2.x support in the Spark > > > > runner. > > > > > > > > > > I'm working on a PR about that: > > > > > > > > > > https://github.com/apache/beam/pull/3808 > > > > > > > > > > Today, we have something working with both Spark 1.x and 2.x from a > > > code > > > > > standpoint, but I have to deal with dependencies. It's the first > step > > > of > > > > > the update as I'm still using RDD, the second step would be to > > support > > > > > dataframe (but for that, I would need PCollection elements with > > > schemas, > > > > > that's another topic on which Eugene, Reuven and I are discussing). > > > > > > > > > > However, as all major distributions now ship Spark 2.x, I don't > think > > > > it's > > > > > required anymore to support Spark 1.x. > > > > > > > > > > If we agree, I will update and cleanup the PR to only support and > > focus > > > > on > > > > > Spark 2.x. > > > > > > > > > > So, that's why I'm calling for a vote: > > > > > > > > > > [ ] +1 to drop Spark 1.x support and upgrade to Spark 2.x only > > > > > [ ] 0 (I don't care ;)) > > > > > [ ] -1, I would like to still support Spark 1.x, and so having > > > support > > > > > of both Spark 1.x and 2.x (please provide specific comment) > > > > > > > > > > This vote is open for 48 hours (I have the commits ready, just > > waiting > > > > the > > > > > end of the vote to push on the PR). > > > > > > > > > > Thanks ! > > > > > Regards > > > > > JB > > > > > -- > > > > > Jean-Baptiste Onofré > > > > > jbono...@apache.org > > > > > http://blog.nanthrax.net > > > > > Talend - http://www.talend.com > > > > > > > > > > > > > > >
Re: [VOTE] Drop Spark 1.x support to focus on Spark 2.x
+1 from me, with a friendly deprecation process I am convinced by the following: - We don't have the resources to make both great, and anyhow it isn't worth it - People keeping up with Beam releases are likely to be keeping up with Spark as well - Spark 1 users already have a Spark 1 runner for Beam and can keep using it (and we don't actually lose the ability to update it in a pinch) - Key features like portability (hence Python) will be some time so we should definitely not waste effort building that feature with Spark 1 in mind I think it makes sense to communicate with email to users@ and in the release notes of 2.2.0. That communication should be specific and indicate whether we are planning to merely not work on it anymore or actually remove it in 2.3.0. Kenn On Thu, Nov 9, 2017 at 6:35 AM, Amit Sela wrote: > +1 for dropping Spark 1 support. > I don't think we have enough users to justify supporting both, and its been > a long time since this idea originally came-up (when Spark2 wasn't stable) > and now Spark 2 is standard in all Hadoop distros. > As for switching to the Dataframe API, as long as Spark 2 doesn't support > scanning through the state periodically (even if no data for a key), > watermarks won't fire keys that didn't see updates. > > On Thu, Nov 9, 2017 at 9:12 AM Thomas Weise wrote: > > > +1 (non-binding) for dropping 1.x support > > > > I don't have the impression that there is significant adoption for Beam > on > > Spark 1.x ? A stronger Spark runner that works well on 2.x will be better > > for Beam adoption than a runner that has to compromise due to 1.x > baggage. > > Development efforts can go into improving the runner. > > > > Thanks, > > Thomas > > > > > > On Thu, Nov 9, 2017 at 4:08 AM, Srinivas Reddy < > srinivas96all...@gmail.com > > > > > wrote: > > > > > +1 > > > > > > > > > > > > -- > > > Srinivas Reddy > > > > > > http://mrsrinivas.com/ > > > > > > > > > (Sent via gmail web) > > > > > > On 8 November 2017 at 14:27, Jean-Baptiste Onofré > > wrote: > > > > > > > Hi all, > > > > > > > > as you might know, we are working on Spark 2.x support in the Spark > > > runner. > > > > > > > > I'm working on a PR about that: > > > > > > > > https://github.com/apache/beam/pull/3808 > > > > > > > > Today, we have something working with both Spark 1.x and 2.x from a > > code > > > > standpoint, but I have to deal with dependencies. It's the first step > > of > > > > the update as I'm still using RDD, the second step would be to > support > > > > dataframe (but for that, I would need PCollection elements with > > schemas, > > > > that's another topic on which Eugene, Reuven and I are discussing). > > > > > > > > However, as all major distributions now ship Spark 2.x, I don't think > > > it's > > > > required anymore to support Spark 1.x. > > > > > > > > If we agree, I will update and cleanup the PR to only support and > focus > > > on > > > > Spark 2.x. > > > > > > > > So, that's why I'm calling for a vote: > > > > > > > > [ ] +1 to drop Spark 1.x support and upgrade to Spark 2.x only > > > > [ ] 0 (I don't care ;)) > > > > [ ] -1, I would like to still support Spark 1.x, and so having > > support > > > > of both Spark 1.x and 2.x (please provide specific comment) > > > > > > > > This vote is open for 48 hours (I have the commits ready, just > waiting > > > the > > > > end of the vote to push on the PR). > > > > > > > > Thanks ! > > > > Regards > > > > JB > > > > -- > > > > Jean-Baptiste Onofré > > > > jbono...@apache.org > > > > http://blog.nanthrax.net > > > > Talend - http://www.talend.com > > > > > > > > > >
Re: [VOTE] Drop Spark 1.x support to focus on Spark 2.x
+1 for dropping Spark 1 support. I don't think we have enough users to justify supporting both, and its been a long time since this idea originally came-up (when Spark2 wasn't stable) and now Spark 2 is standard in all Hadoop distros. As for switching to the Dataframe API, as long as Spark 2 doesn't support scanning through the state periodically (even if no data for a key), watermarks won't fire keys that didn't see updates. On Thu, Nov 9, 2017 at 9:12 AM Thomas Weise wrote: > +1 (non-binding) for dropping 1.x support > > I don't have the impression that there is significant adoption for Beam on > Spark 1.x ? A stronger Spark runner that works well on 2.x will be better > for Beam adoption than a runner that has to compromise due to 1.x baggage. > Development efforts can go into improving the runner. > > Thanks, > Thomas > > > On Thu, Nov 9, 2017 at 4:08 AM, Srinivas Reddy > > wrote: > > > +1 > > > > > > > > -- > > Srinivas Reddy > > > > http://mrsrinivas.com/ > > > > > > (Sent via gmail web) > > > > On 8 November 2017 at 14:27, Jean-Baptiste Onofré > wrote: > > > > > Hi all, > > > > > > as you might know, we are working on Spark 2.x support in the Spark > > runner. > > > > > > I'm working on a PR about that: > > > > > > https://github.com/apache/beam/pull/3808 > > > > > > Today, we have something working with both Spark 1.x and 2.x from a > code > > > standpoint, but I have to deal with dependencies. It's the first step > of > > > the update as I'm still using RDD, the second step would be to support > > > dataframe (but for that, I would need PCollection elements with > schemas, > > > that's another topic on which Eugene, Reuven and I are discussing). > > > > > > However, as all major distributions now ship Spark 2.x, I don't think > > it's > > > required anymore to support Spark 1.x. > > > > > > If we agree, I will update and cleanup the PR to only support and focus > > on > > > Spark 2.x. > > > > > > So, that's why I'm calling for a vote: > > > > > > [ ] +1 to drop Spark 1.x support and upgrade to Spark 2.x only > > > [ ] 0 (I don't care ;)) > > > [ ] -1, I would like to still support Spark 1.x, and so having > support > > > of both Spark 1.x and 2.x (please provide specific comment) > > > > > > This vote is open for 48 hours (I have the commits ready, just waiting > > the > > > end of the vote to push on the PR). > > > > > > Thanks ! > > > Regards > > > JB > > > -- > > > Jean-Baptiste Onofré > > > jbono...@apache.org > > > http://blog.nanthrax.net > > > Talend - http://www.talend.com > > > > > >
Re: [VOTE] Drop Spark 1.x support to focus on Spark 2.x
+1 (non-binding) for dropping 1.x support I don't have the impression that there is significant adoption for Beam on Spark 1.x ? A stronger Spark runner that works well on 2.x will be better for Beam adoption than a runner that has to compromise due to 1.x baggage. Development efforts can go into improving the runner. Thanks, Thomas On Thu, Nov 9, 2017 at 4:08 AM, Srinivas Reddy wrote: > +1 > > > > -- > Srinivas Reddy > > http://mrsrinivas.com/ > > > (Sent via gmail web) > > On 8 November 2017 at 14:27, Jean-Baptiste Onofré wrote: > > > Hi all, > > > > as you might know, we are working on Spark 2.x support in the Spark > runner. > > > > I'm working on a PR about that: > > > > https://github.com/apache/beam/pull/3808 > > > > Today, we have something working with both Spark 1.x and 2.x from a code > > standpoint, but I have to deal with dependencies. It's the first step of > > the update as I'm still using RDD, the second step would be to support > > dataframe (but for that, I would need PCollection elements with schemas, > > that's another topic on which Eugene, Reuven and I are discussing). > > > > However, as all major distributions now ship Spark 2.x, I don't think > it's > > required anymore to support Spark 1.x. > > > > If we agree, I will update and cleanup the PR to only support and focus > on > > Spark 2.x. > > > > So, that's why I'm calling for a vote: > > > > [ ] +1 to drop Spark 1.x support and upgrade to Spark 2.x only > > [ ] 0 (I don't care ;)) > > [ ] -1, I would like to still support Spark 1.x, and so having support > > of both Spark 1.x and 2.x (please provide specific comment) > > > > This vote is open for 48 hours (I have the commits ready, just waiting > the > > end of the vote to push on the PR). > > > > Thanks ! > > Regards > > JB > > -- > > Jean-Baptiste Onofré > > jbono...@apache.org > > http://blog.nanthrax.net > > Talend - http://www.talend.com > > >
Re: [VOTE] Drop Spark 1.x support to focus on Spark 2.x
+1 -- Srinivas Reddy http://mrsrinivas.com/ (Sent via gmail web) On 8 November 2017 at 14:27, Jean-Baptiste Onofré wrote: > Hi all, > > as you might know, we are working on Spark 2.x support in the Spark runner. > > I'm working on a PR about that: > > https://github.com/apache/beam/pull/3808 > > Today, we have something working with both Spark 1.x and 2.x from a code > standpoint, but I have to deal with dependencies. It's the first step of > the update as I'm still using RDD, the second step would be to support > dataframe (but for that, I would need PCollection elements with schemas, > that's another topic on which Eugene, Reuven and I are discussing). > > However, as all major distributions now ship Spark 2.x, I don't think it's > required anymore to support Spark 1.x. > > If we agree, I will update and cleanup the PR to only support and focus on > Spark 2.x. > > So, that's why I'm calling for a vote: > > [ ] +1 to drop Spark 1.x support and upgrade to Spark 2.x only > [ ] 0 (I don't care ;)) > [ ] -1, I would like to still support Spark 1.x, and so having support > of both Spark 1.x and 2.x (please provide specific comment) > > This vote is open for 48 hours (I have the commits ready, just waiting the > end of the vote to push on the PR). > > Thanks ! > Regards > JB > -- > Jean-Baptiste Onofré > jbono...@apache.org > http://blog.nanthrax.net > Talend - http://www.talend.com >
Re: [VOTE] Drop Spark 1.x support to focus on Spark 2.x
+1 for the move to Spark 2 modulo preventing users and deciding on support: I agree that having compatibility for both versions of Spark is desirable but I am not sure if is worth the effort. Apart of the reasons mentioned by Holden and Pei, I will add that the burden of simultaneous maintenance could be bigger than the return, and also that most Big Data/Cloud distributions have moved already to Spark 2, so it makes sense to prioritize the new users better than the legacy ones, in particular if we consider that Beam is a ‘recent’ project. We can announce the end of the support for Spark 1 in the release notes of Beam 2.2 and decide if we will support it in maintenance mode, in this case we will backport or fix any reported issue related to the Spark 1 runner on the 2.2.x branch let’s say for a year, but we won’t add new functionalities. Or we can just decide not to support it anymore and encourage users to move to Spark 2. On Thu, Nov 9, 2017 at 6:59 AM, Pei HE wrote: > +1 on moving forward with Spark 2.x only. > Spark 1 users can still use already released Spark runners, and we can > support them with minor version releases for future bug fixes. > > I don't see how important it is to make future Beam releases available to > Spark 1 users. If they choose not to upgrade Spark clusters, maybe they > don't need the newest Beam releases as well. > > I think it is more important to 1). be able to leverage new features in > Spark 2.x, 2.) extend user base to Spark 2. > -- > Pei > > > On Thu, Nov 9, 2017 at 1:45 PM, Holden Karau wrote: > >> That's a good point about Oozie does only supporting only Spark 1 or 2 at a >> time on a cluster -- but do we know people using Oozie and Spark 1 that >> would still be using Spark 1 by the time of the next BEAM release? The last >> Spark 1 release was a year ago (and last non-maintenance release almost 20 >> months ago). >> >> On Wed, Nov 8, 2017 at 9:30 PM, NerdyNick wrote: >> >> > I don't know if ditching Spark 1 out right right now would be a great >> move >> > given that a lot of the main support applications around spark haven't >> yet >> > fully moved to Spark 2 yet. Yet alone have support for having a cluster >> > with both. Oozie for example is still pre stable release for their Spark >> 1 >> > and can't support a cluster with mixed Spark version. I think maybe doing >> > as suggested above with the common, spark1, spark2 packaging might be >> best >> > during this carry over phase. Maybe even just flag spark 1 as deprecated >> > and just being maintained might be enough. >> > >> > On Wed, Nov 8, 2017 at 10:25 PM, Holden Karau >> > wrote: >> > >> > > Also, upgrading Spark 1 to 2 is generally easier than changing JVM >> > > versions. For folks using YARN or the hosted environments it pretty >> much >> > > trivial since you can effectively have distinct Spark clusters for each >> > > job. >> > > >> > > On Wed, Nov 8, 2017 at 9:19 PM, Holden Karau >> > wrote: >> > > >> > > > I'm +1 on dropping Spark 1. There are a lot of exciting improvements >> in >> > > > Spark 2, and trying to write efficient code that runs between Spark 1 >> > and >> > > > Spark 2 is super painful in the long term. It would be one thing if >> > there >> > > > were a lot of people available to work on the Spark runners, but it >> > seems >> > > > like we'd be better spent focusing our energy on the future. >> > > > >> > > > I don't know a lot of folks who are stuck on Spark 1, and the few >> that >> > I >> > > > know are planning to migrate in the next few months anyways. >> > > > >> > > > Note: this is a non-binding vote as I'm not a committer or PMC >> member. >> > > > >> > > > On Wed, Nov 8, 2017 at 3:43 AM, Ted Yu wrote: >> > > > >> > > >> Having both Spark1 and Spark2 modules would benefit wider user base. >> > > >> >> > > >> I would vote for that. >> > > >> >> > > >> Cheers >> > > >> >> > > >> On Wed, Nov 8, 2017 at 12:51 AM, Jean-Baptiste Onofré < >> > j...@nanthrax.net> >> > > >> wrote: >> > > >> >> > > >> > Hi Robert, >> > > >> > >> > > >> > Thanks for your feedback ! >> > > >> > >> > > >> > From an user perspective, with the current state of the PR, the >> same >> > > >> > pipelines can run on both Spark 1.x and 2.x: the only difference >> is >> > > the >> > > >> > dependencies set. >> > > >> > >> > > >> > I'm calling the vote to get suck kind of feedback: if we consider >> > > Spark >> > > >> > 1.x still need to be supported, no problem, I will improve the PR >> to >> > > >> have >> > > >> > three modules (common, spark1, spark2) and let users pick the >> > desired >> > > >> > version. >> > > >> > >> > > >> > Let's wait a bit other feedbacks, I will update the PR >> accordingly. >> > > >> > >> > > >> > Regards >> > > >> > JB >> > > >> > >> > > >> > >> > > >> > On 11/08/2017 09:47 AM, Robert Bradshaw wrote: >> > > >> > >> > > >> >> I'm generally a -0.5 on this change, or at least doing so >> hastily. >> > > >> >> >> > > >> >> As with dropping Java 7 support, I think this should at
Re: [VOTE] Drop Spark 1.x support to focus on Spark 2.x
+1 on moving forward with Spark 2.x only. Spark 1 users can still use already released Spark runners, and we can support them with minor version releases for future bug fixes. I don't see how important it is to make future Beam releases available to Spark 1 users. If they choose not to upgrade Spark clusters, maybe they don't need the newest Beam releases as well. I think it is more important to 1). be able to leverage new features in Spark 2.x, 2.) extend user base to Spark 2. -- Pei On Thu, Nov 9, 2017 at 1:45 PM, Holden Karau wrote: > That's a good point about Oozie does only supporting only Spark 1 or 2 at a > time on a cluster -- but do we know people using Oozie and Spark 1 that > would still be using Spark 1 by the time of the next BEAM release? The last > Spark 1 release was a year ago (and last non-maintenance release almost 20 > months ago). > > On Wed, Nov 8, 2017 at 9:30 PM, NerdyNick wrote: > > > I don't know if ditching Spark 1 out right right now would be a great > move > > given that a lot of the main support applications around spark haven't > yet > > fully moved to Spark 2 yet. Yet alone have support for having a cluster > > with both. Oozie for example is still pre stable release for their Spark > 1 > > and can't support a cluster with mixed Spark version. I think maybe doing > > as suggested above with the common, spark1, spark2 packaging might be > best > > during this carry over phase. Maybe even just flag spark 1 as deprecated > > and just being maintained might be enough. > > > > On Wed, Nov 8, 2017 at 10:25 PM, Holden Karau > > wrote: > > > > > Also, upgrading Spark 1 to 2 is generally easier than changing JVM > > > versions. For folks using YARN or the hosted environments it pretty > much > > > trivial since you can effectively have distinct Spark clusters for each > > > job. > > > > > > On Wed, Nov 8, 2017 at 9:19 PM, Holden Karau > > wrote: > > > > > > > I'm +1 on dropping Spark 1. There are a lot of exciting improvements > in > > > > Spark 2, and trying to write efficient code that runs between Spark 1 > > and > > > > Spark 2 is super painful in the long term. It would be one thing if > > there > > > > were a lot of people available to work on the Spark runners, but it > > seems > > > > like we'd be better spent focusing our energy on the future. > > > > > > > > I don't know a lot of folks who are stuck on Spark 1, and the few > that > > I > > > > know are planning to migrate in the next few months anyways. > > > > > > > > Note: this is a non-binding vote as I'm not a committer or PMC > member. > > > > > > > > On Wed, Nov 8, 2017 at 3:43 AM, Ted Yu wrote: > > > > > > > >> Having both Spark1 and Spark2 modules would benefit wider user base. > > > >> > > > >> I would vote for that. > > > >> > > > >> Cheers > > > >> > > > >> On Wed, Nov 8, 2017 at 12:51 AM, Jean-Baptiste Onofré < > > j...@nanthrax.net> > > > >> wrote: > > > >> > > > >> > Hi Robert, > > > >> > > > > >> > Thanks for your feedback ! > > > >> > > > > >> > From an user perspective, with the current state of the PR, the > same > > > >> > pipelines can run on both Spark 1.x and 2.x: the only difference > is > > > the > > > >> > dependencies set. > > > >> > > > > >> > I'm calling the vote to get suck kind of feedback: if we consider > > > Spark > > > >> > 1.x still need to be supported, no problem, I will improve the PR > to > > > >> have > > > >> > three modules (common, spark1, spark2) and let users pick the > > desired > > > >> > version. > > > >> > > > > >> > Let's wait a bit other feedbacks, I will update the PR > accordingly. > > > >> > > > > >> > Regards > > > >> > JB > > > >> > > > > >> > > > > >> > On 11/08/2017 09:47 AM, Robert Bradshaw wrote: > > > >> > > > > >> >> I'm generally a -0.5 on this change, or at least doing so > hastily. > > > >> >> > > > >> >> As with dropping Java 7 support, I think this should at least be > > > >> >> announced in release notes that we're considering dropping > support > > in > > > >> >> the subsequent release, as this dev list likely does not reach a > > > >> >> substantial portion of the userbase. > > > >> >> > > > >> >> How much work is it to move from a Spark 1.x cluster to a Spark > 2.x > > > >> >> cluster? I get the feeling it's not nearly as transparent as > > > upgrading > > > >> >> Java versions. Can Spark 1.x pipelines be run on Spark 2.x > > clusters, > > > >> >> or is a new cluster (and/or upgrading all pipelines) required > (e.g. > > > >> >> for those who operate spark clusters shared among their many > > users)? > > > >> >> > > > >> >> Looks like the latest release of Spark 1.x was about a year ago, > > > >> >> overlapping a bit with the 2.x series which is coming up on 1.5 > > years > > > >> >> old, so I could see a lot of people still using 1.x even if 2.x > is > > > >> >> clearly the future. But it sure doesn't seem very backwards > > > >> >> compatible. > > > >> >> > > > >> >> Mostly I'm not comfortable with dropping 1.x in the same release > as > > > >
Re: [VOTE] Drop Spark 1.x support to focus on Spark 2.x
That's a good point about Oozie does only supporting only Spark 1 or 2 at a time on a cluster -- but do we know people using Oozie and Spark 1 that would still be using Spark 1 by the time of the next BEAM release? The last Spark 1 release was a year ago (and last non-maintenance release almost 20 months ago). On Wed, Nov 8, 2017 at 9:30 PM, NerdyNick wrote: > I don't know if ditching Spark 1 out right right now would be a great move > given that a lot of the main support applications around spark haven't yet > fully moved to Spark 2 yet. Yet alone have support for having a cluster > with both. Oozie for example is still pre stable release for their Spark 1 > and can't support a cluster with mixed Spark version. I think maybe doing > as suggested above with the common, spark1, spark2 packaging might be best > during this carry over phase. Maybe even just flag spark 1 as deprecated > and just being maintained might be enough. > > On Wed, Nov 8, 2017 at 10:25 PM, Holden Karau > wrote: > > > Also, upgrading Spark 1 to 2 is generally easier than changing JVM > > versions. For folks using YARN or the hosted environments it pretty much > > trivial since you can effectively have distinct Spark clusters for each > > job. > > > > On Wed, Nov 8, 2017 at 9:19 PM, Holden Karau > wrote: > > > > > I'm +1 on dropping Spark 1. There are a lot of exciting improvements in > > > Spark 2, and trying to write efficient code that runs between Spark 1 > and > > > Spark 2 is super painful in the long term. It would be one thing if > there > > > were a lot of people available to work on the Spark runners, but it > seems > > > like we'd be better spent focusing our energy on the future. > > > > > > I don't know a lot of folks who are stuck on Spark 1, and the few that > I > > > know are planning to migrate in the next few months anyways. > > > > > > Note: this is a non-binding vote as I'm not a committer or PMC member. > > > > > > On Wed, Nov 8, 2017 at 3:43 AM, Ted Yu wrote: > > > > > >> Having both Spark1 and Spark2 modules would benefit wider user base. > > >> > > >> I would vote for that. > > >> > > >> Cheers > > >> > > >> On Wed, Nov 8, 2017 at 12:51 AM, Jean-Baptiste Onofré < > j...@nanthrax.net> > > >> wrote: > > >> > > >> > Hi Robert, > > >> > > > >> > Thanks for your feedback ! > > >> > > > >> > From an user perspective, with the current state of the PR, the same > > >> > pipelines can run on both Spark 1.x and 2.x: the only difference is > > the > > >> > dependencies set. > > >> > > > >> > I'm calling the vote to get suck kind of feedback: if we consider > > Spark > > >> > 1.x still need to be supported, no problem, I will improve the PR to > > >> have > > >> > three modules (common, spark1, spark2) and let users pick the > desired > > >> > version. > > >> > > > >> > Let's wait a bit other feedbacks, I will update the PR accordingly. > > >> > > > >> > Regards > > >> > JB > > >> > > > >> > > > >> > On 11/08/2017 09:47 AM, Robert Bradshaw wrote: > > >> > > > >> >> I'm generally a -0.5 on this change, or at least doing so hastily. > > >> >> > > >> >> As with dropping Java 7 support, I think this should at least be > > >> >> announced in release notes that we're considering dropping support > in > > >> >> the subsequent release, as this dev list likely does not reach a > > >> >> substantial portion of the userbase. > > >> >> > > >> >> How much work is it to move from a Spark 1.x cluster to a Spark 2.x > > >> >> cluster? I get the feeling it's not nearly as transparent as > > upgrading > > >> >> Java versions. Can Spark 1.x pipelines be run on Spark 2.x > clusters, > > >> >> or is a new cluster (and/or upgrading all pipelines) required (e.g. > > >> >> for those who operate spark clusters shared among their many > users)? > > >> >> > > >> >> Looks like the latest release of Spark 1.x was about a year ago, > > >> >> overlapping a bit with the 2.x series which is coming up on 1.5 > years > > >> >> old, so I could see a lot of people still using 1.x even if 2.x is > > >> >> clearly the future. But it sure doesn't seem very backwards > > >> >> compatible. > > >> >> > > >> >> Mostly I'm not comfortable with dropping 1.x in the same release as > > >> >> adding support for 2.x, giving no transition period, but could be > > >> >> convinced if this transition is mostly a no-op or no one's still > > using > > >> >> 1.x. If there's non-trivial code complexity issues, I would perhaps > > >> >> revisit the issue of having a single Spark Runner that does chooses > > >> >> the backend implicitly in favor of simply having two runners which > > >> >> share the code that's easy to share and diverge otherwise (which > > seems > > >> >> it would be much simpler both to implement and explain to users). I > > >> >> would be OK with even letting the Spark 1.x runner be somewhat > > >> >> stagnant (e.g. few or no new features) until we decide we can kill > it > > >> >> off. > > >> >> > > >> >> On Tue, Nov 7, 2017 at 11:27 PM, Jean-Baptiste Onofr
Re: [VOTE] Drop Spark 1.x support to focus on Spark 2.x
I don't know if ditching Spark 1 out right right now would be a great move given that a lot of the main support applications around spark haven't yet fully moved to Spark 2 yet. Yet alone have support for having a cluster with both. Oozie for example is still pre stable release for their Spark 1 and can't support a cluster with mixed Spark version. I think maybe doing as suggested above with the common, spark1, spark2 packaging might be best during this carry over phase. Maybe even just flag spark 1 as deprecated and just being maintained might be enough. On Wed, Nov 8, 2017 at 10:25 PM, Holden Karau wrote: > Also, upgrading Spark 1 to 2 is generally easier than changing JVM > versions. For folks using YARN or the hosted environments it pretty much > trivial since you can effectively have distinct Spark clusters for each > job. > > On Wed, Nov 8, 2017 at 9:19 PM, Holden Karau wrote: > > > I'm +1 on dropping Spark 1. There are a lot of exciting improvements in > > Spark 2, and trying to write efficient code that runs between Spark 1 and > > Spark 2 is super painful in the long term. It would be one thing if there > > were a lot of people available to work on the Spark runners, but it seems > > like we'd be better spent focusing our energy on the future. > > > > I don't know a lot of folks who are stuck on Spark 1, and the few that I > > know are planning to migrate in the next few months anyways. > > > > Note: this is a non-binding vote as I'm not a committer or PMC member. > > > > On Wed, Nov 8, 2017 at 3:43 AM, Ted Yu wrote: > > > >> Having both Spark1 and Spark2 modules would benefit wider user base. > >> > >> I would vote for that. > >> > >> Cheers > >> > >> On Wed, Nov 8, 2017 at 12:51 AM, Jean-Baptiste Onofré > >> wrote: > >> > >> > Hi Robert, > >> > > >> > Thanks for your feedback ! > >> > > >> > From an user perspective, with the current state of the PR, the same > >> > pipelines can run on both Spark 1.x and 2.x: the only difference is > the > >> > dependencies set. > >> > > >> > I'm calling the vote to get suck kind of feedback: if we consider > Spark > >> > 1.x still need to be supported, no problem, I will improve the PR to > >> have > >> > three modules (common, spark1, spark2) and let users pick the desired > >> > version. > >> > > >> > Let's wait a bit other feedbacks, I will update the PR accordingly. > >> > > >> > Regards > >> > JB > >> > > >> > > >> > On 11/08/2017 09:47 AM, Robert Bradshaw wrote: > >> > > >> >> I'm generally a -0.5 on this change, or at least doing so hastily. > >> >> > >> >> As with dropping Java 7 support, I think this should at least be > >> >> announced in release notes that we're considering dropping support in > >> >> the subsequent release, as this dev list likely does not reach a > >> >> substantial portion of the userbase. > >> >> > >> >> How much work is it to move from a Spark 1.x cluster to a Spark 2.x > >> >> cluster? I get the feeling it's not nearly as transparent as > upgrading > >> >> Java versions. Can Spark 1.x pipelines be run on Spark 2.x clusters, > >> >> or is a new cluster (and/or upgrading all pipelines) required (e.g. > >> >> for those who operate spark clusters shared among their many users)? > >> >> > >> >> Looks like the latest release of Spark 1.x was about a year ago, > >> >> overlapping a bit with the 2.x series which is coming up on 1.5 years > >> >> old, so I could see a lot of people still using 1.x even if 2.x is > >> >> clearly the future. But it sure doesn't seem very backwards > >> >> compatible. > >> >> > >> >> Mostly I'm not comfortable with dropping 1.x in the same release as > >> >> adding support for 2.x, giving no transition period, but could be > >> >> convinced if this transition is mostly a no-op or no one's still > using > >> >> 1.x. If there's non-trivial code complexity issues, I would perhaps > >> >> revisit the issue of having a single Spark Runner that does chooses > >> >> the backend implicitly in favor of simply having two runners which > >> >> share the code that's easy to share and diverge otherwise (which > seems > >> >> it would be much simpler both to implement and explain to users). I > >> >> would be OK with even letting the Spark 1.x runner be somewhat > >> >> stagnant (e.g. few or no new features) until we decide we can kill it > >> >> off. > >> >> > >> >> On Tue, Nov 7, 2017 at 11:27 PM, Jean-Baptiste Onofré < > j...@nanthrax.net > >> > > >> >> wrote: > >> >> > >> >>> Hi all, > >> >>> > >> >>> as you might know, we are working on Spark 2.x support in the Spark > >> >>> runner. > >> >>> > >> >>> I'm working on a PR about that: > >> >>> > >> >>> https://github.com/apache/beam/pull/3808 > >> >>> > >> >>> Today, we have something working with both Spark 1.x and 2.x from a > >> code > >> >>> standpoint, but I have to deal with dependencies. It's the first > step > >> of > >> >>> the > >> >>> update as I'm still using RDD, the second step would be to support > >> >>> dataframe > >> >>> (but for that, I woul
Re: [VOTE] Drop Spark 1.x support to focus on Spark 2.x
Also, upgrading Spark 1 to 2 is generally easier than changing JVM versions. For folks using YARN or the hosted environments it pretty much trivial since you can effectively have distinct Spark clusters for each job. On Wed, Nov 8, 2017 at 9:19 PM, Holden Karau wrote: > I'm +1 on dropping Spark 1. There are a lot of exciting improvements in > Spark 2, and trying to write efficient code that runs between Spark 1 and > Spark 2 is super painful in the long term. It would be one thing if there > were a lot of people available to work on the Spark runners, but it seems > like we'd be better spent focusing our energy on the future. > > I don't know a lot of folks who are stuck on Spark 1, and the few that I > know are planning to migrate in the next few months anyways. > > Note: this is a non-binding vote as I'm not a committer or PMC member. > > On Wed, Nov 8, 2017 at 3:43 AM, Ted Yu wrote: > >> Having both Spark1 and Spark2 modules would benefit wider user base. >> >> I would vote for that. >> >> Cheers >> >> On Wed, Nov 8, 2017 at 12:51 AM, Jean-Baptiste Onofré >> wrote: >> >> > Hi Robert, >> > >> > Thanks for your feedback ! >> > >> > From an user perspective, with the current state of the PR, the same >> > pipelines can run on both Spark 1.x and 2.x: the only difference is the >> > dependencies set. >> > >> > I'm calling the vote to get suck kind of feedback: if we consider Spark >> > 1.x still need to be supported, no problem, I will improve the PR to >> have >> > three modules (common, spark1, spark2) and let users pick the desired >> > version. >> > >> > Let's wait a bit other feedbacks, I will update the PR accordingly. >> > >> > Regards >> > JB >> > >> > >> > On 11/08/2017 09:47 AM, Robert Bradshaw wrote: >> > >> >> I'm generally a -0.5 on this change, or at least doing so hastily. >> >> >> >> As with dropping Java 7 support, I think this should at least be >> >> announced in release notes that we're considering dropping support in >> >> the subsequent release, as this dev list likely does not reach a >> >> substantial portion of the userbase. >> >> >> >> How much work is it to move from a Spark 1.x cluster to a Spark 2.x >> >> cluster? I get the feeling it's not nearly as transparent as upgrading >> >> Java versions. Can Spark 1.x pipelines be run on Spark 2.x clusters, >> >> or is a new cluster (and/or upgrading all pipelines) required (e.g. >> >> for those who operate spark clusters shared among their many users)? >> >> >> >> Looks like the latest release of Spark 1.x was about a year ago, >> >> overlapping a bit with the 2.x series which is coming up on 1.5 years >> >> old, so I could see a lot of people still using 1.x even if 2.x is >> >> clearly the future. But it sure doesn't seem very backwards >> >> compatible. >> >> >> >> Mostly I'm not comfortable with dropping 1.x in the same release as >> >> adding support for 2.x, giving no transition period, but could be >> >> convinced if this transition is mostly a no-op or no one's still using >> >> 1.x. If there's non-trivial code complexity issues, I would perhaps >> >> revisit the issue of having a single Spark Runner that does chooses >> >> the backend implicitly in favor of simply having two runners which >> >> share the code that's easy to share and diverge otherwise (which seems >> >> it would be much simpler both to implement and explain to users). I >> >> would be OK with even letting the Spark 1.x runner be somewhat >> >> stagnant (e.g. few or no new features) until we decide we can kill it >> >> off. >> >> >> >> On Tue, Nov 7, 2017 at 11:27 PM, Jean-Baptiste Onofré > > >> >> wrote: >> >> >> >>> Hi all, >> >>> >> >>> as you might know, we are working on Spark 2.x support in the Spark >> >>> runner. >> >>> >> >>> I'm working on a PR about that: >> >>> >> >>> https://github.com/apache/beam/pull/3808 >> >>> >> >>> Today, we have something working with both Spark 1.x and 2.x from a >> code >> >>> standpoint, but I have to deal with dependencies. It's the first step >> of >> >>> the >> >>> update as I'm still using RDD, the second step would be to support >> >>> dataframe >> >>> (but for that, I would need PCollection elements with schemas, that's >> >>> another topic on which Eugene, Reuven and I are discussing). >> >>> >> >>> However, as all major distributions now ship Spark 2.x, I don't think >> >>> it's >> >>> required anymore to support Spark 1.x. >> >>> >> >>> If we agree, I will update and cleanup the PR to only support and >> focus >> >>> on >> >>> Spark 2.x. >> >>> >> >>> So, that's why I'm calling for a vote: >> >>> >> >>>[ ] +1 to drop Spark 1.x support and upgrade to Spark 2.x only >> >>>[ ] 0 (I don't care ;)) >> >>>[ ] -1, I would like to still support Spark 1.x, and so having >> >>> support of >> >>> both Spark 1.x and 2.x (please provide specific comment) >> >>> >> >>> This vote is open for 48 hours (I have the commits ready, just waiting >> >>> the >> >>> end of the vote to push on the PR). >> >>> >> >>> Thanks
Re: [VOTE] Drop Spark 1.x support to focus on Spark 2.x
I'm +1 on dropping Spark 1. There are a lot of exciting improvements in Spark 2, and trying to write efficient code that runs between Spark 1 and Spark 2 is super painful in the long term. It would be one thing if there were a lot of people available to work on the Spark runners, but it seems like we'd be better spent focusing our energy on the future. I don't know a lot of folks who are stuck on Spark 1, and the few that I know are planning to migrate in the next few months anyways. Note: this is a non-binding vote as I'm not a committer or PMC member. On Wed, Nov 8, 2017 at 3:43 AM, Ted Yu wrote: > Having both Spark1 and Spark2 modules would benefit wider user base. > > I would vote for that. > > Cheers > > On Wed, Nov 8, 2017 at 12:51 AM, Jean-Baptiste Onofré > wrote: > > > Hi Robert, > > > > Thanks for your feedback ! > > > > From an user perspective, with the current state of the PR, the same > > pipelines can run on both Spark 1.x and 2.x: the only difference is the > > dependencies set. > > > > I'm calling the vote to get suck kind of feedback: if we consider Spark > > 1.x still need to be supported, no problem, I will improve the PR to have > > three modules (common, spark1, spark2) and let users pick the desired > > version. > > > > Let's wait a bit other feedbacks, I will update the PR accordingly. > > > > Regards > > JB > > > > > > On 11/08/2017 09:47 AM, Robert Bradshaw wrote: > > > >> I'm generally a -0.5 on this change, or at least doing so hastily. > >> > >> As with dropping Java 7 support, I think this should at least be > >> announced in release notes that we're considering dropping support in > >> the subsequent release, as this dev list likely does not reach a > >> substantial portion of the userbase. > >> > >> How much work is it to move from a Spark 1.x cluster to a Spark 2.x > >> cluster? I get the feeling it's not nearly as transparent as upgrading > >> Java versions. Can Spark 1.x pipelines be run on Spark 2.x clusters, > >> or is a new cluster (and/or upgrading all pipelines) required (e.g. > >> for those who operate spark clusters shared among their many users)? > >> > >> Looks like the latest release of Spark 1.x was about a year ago, > >> overlapping a bit with the 2.x series which is coming up on 1.5 years > >> old, so I could see a lot of people still using 1.x even if 2.x is > >> clearly the future. But it sure doesn't seem very backwards > >> compatible. > >> > >> Mostly I'm not comfortable with dropping 1.x in the same release as > >> adding support for 2.x, giving no transition period, but could be > >> convinced if this transition is mostly a no-op or no one's still using > >> 1.x. If there's non-trivial code complexity issues, I would perhaps > >> revisit the issue of having a single Spark Runner that does chooses > >> the backend implicitly in favor of simply having two runners which > >> share the code that's easy to share and diverge otherwise (which seems > >> it would be much simpler both to implement and explain to users). I > >> would be OK with even letting the Spark 1.x runner be somewhat > >> stagnant (e.g. few or no new features) until we decide we can kill it > >> off. > >> > >> On Tue, Nov 7, 2017 at 11:27 PM, Jean-Baptiste Onofré > >> wrote: > >> > >>> Hi all, > >>> > >>> as you might know, we are working on Spark 2.x support in the Spark > >>> runner. > >>> > >>> I'm working on a PR about that: > >>> > >>> https://github.com/apache/beam/pull/3808 > >>> > >>> Today, we have something working with both Spark 1.x and 2.x from a > code > >>> standpoint, but I have to deal with dependencies. It's the first step > of > >>> the > >>> update as I'm still using RDD, the second step would be to support > >>> dataframe > >>> (but for that, I would need PCollection elements with schemas, that's > >>> another topic on which Eugene, Reuven and I are discussing). > >>> > >>> However, as all major distributions now ship Spark 2.x, I don't think > >>> it's > >>> required anymore to support Spark 1.x. > >>> > >>> If we agree, I will update and cleanup the PR to only support and focus > >>> on > >>> Spark 2.x. > >>> > >>> So, that's why I'm calling for a vote: > >>> > >>>[ ] +1 to drop Spark 1.x support and upgrade to Spark 2.x only > >>>[ ] 0 (I don't care ;)) > >>>[ ] -1, I would like to still support Spark 1.x, and so having > >>> support of > >>> both Spark 1.x and 2.x (please provide specific comment) > >>> > >>> This vote is open for 48 hours (I have the commits ready, just waiting > >>> the > >>> end of the vote to push on the PR). > >>> > >>> Thanks ! > >>> Regards > >>> JB > >>> -- > >>> Jean-Baptiste Onofré > >>> jbono...@apache.org > >>> http://blog.nanthrax.net > >>> Talend - http://www.talend.com > >>> > >> > > -- > > Jean-Baptiste Onofré > > jbono...@apache.org > > http://blog.nanthrax.net > > Talend - http://www.talend.com > > > -- Twitter: https://twitter.com/holdenkarau
Re: [VOTE] Drop Spark 1.x support to focus on Spark 2.x
Having both Spark1 and Spark2 modules would benefit wider user base. I would vote for that. Cheers On Wed, Nov 8, 2017 at 12:51 AM, Jean-Baptiste Onofré wrote: > Hi Robert, > > Thanks for your feedback ! > > From an user perspective, with the current state of the PR, the same > pipelines can run on both Spark 1.x and 2.x: the only difference is the > dependencies set. > > I'm calling the vote to get suck kind of feedback: if we consider Spark > 1.x still need to be supported, no problem, I will improve the PR to have > three modules (common, spark1, spark2) and let users pick the desired > version. > > Let's wait a bit other feedbacks, I will update the PR accordingly. > > Regards > JB > > > On 11/08/2017 09:47 AM, Robert Bradshaw wrote: > >> I'm generally a -0.5 on this change, or at least doing so hastily. >> >> As with dropping Java 7 support, I think this should at least be >> announced in release notes that we're considering dropping support in >> the subsequent release, as this dev list likely does not reach a >> substantial portion of the userbase. >> >> How much work is it to move from a Spark 1.x cluster to a Spark 2.x >> cluster? I get the feeling it's not nearly as transparent as upgrading >> Java versions. Can Spark 1.x pipelines be run on Spark 2.x clusters, >> or is a new cluster (and/or upgrading all pipelines) required (e.g. >> for those who operate spark clusters shared among their many users)? >> >> Looks like the latest release of Spark 1.x was about a year ago, >> overlapping a bit with the 2.x series which is coming up on 1.5 years >> old, so I could see a lot of people still using 1.x even if 2.x is >> clearly the future. But it sure doesn't seem very backwards >> compatible. >> >> Mostly I'm not comfortable with dropping 1.x in the same release as >> adding support for 2.x, giving no transition period, but could be >> convinced if this transition is mostly a no-op or no one's still using >> 1.x. If there's non-trivial code complexity issues, I would perhaps >> revisit the issue of having a single Spark Runner that does chooses >> the backend implicitly in favor of simply having two runners which >> share the code that's easy to share and diverge otherwise (which seems >> it would be much simpler both to implement and explain to users). I >> would be OK with even letting the Spark 1.x runner be somewhat >> stagnant (e.g. few or no new features) until we decide we can kill it >> off. >> >> On Tue, Nov 7, 2017 at 11:27 PM, Jean-Baptiste Onofré >> wrote: >> >>> Hi all, >>> >>> as you might know, we are working on Spark 2.x support in the Spark >>> runner. >>> >>> I'm working on a PR about that: >>> >>> https://github.com/apache/beam/pull/3808 >>> >>> Today, we have something working with both Spark 1.x and 2.x from a code >>> standpoint, but I have to deal with dependencies. It's the first step of >>> the >>> update as I'm still using RDD, the second step would be to support >>> dataframe >>> (but for that, I would need PCollection elements with schemas, that's >>> another topic on which Eugene, Reuven and I are discussing). >>> >>> However, as all major distributions now ship Spark 2.x, I don't think >>> it's >>> required anymore to support Spark 1.x. >>> >>> If we agree, I will update and cleanup the PR to only support and focus >>> on >>> Spark 2.x. >>> >>> So, that's why I'm calling for a vote: >>> >>>[ ] +1 to drop Spark 1.x support and upgrade to Spark 2.x only >>>[ ] 0 (I don't care ;)) >>>[ ] -1, I would like to still support Spark 1.x, and so having >>> support of >>> both Spark 1.x and 2.x (please provide specific comment) >>> >>> This vote is open for 48 hours (I have the commits ready, just waiting >>> the >>> end of the vote to push on the PR). >>> >>> Thanks ! >>> Regards >>> JB >>> -- >>> Jean-Baptiste Onofré >>> jbono...@apache.org >>> http://blog.nanthrax.net >>> Talend - http://www.talend.com >>> >> > -- > Jean-Baptiste Onofré > jbono...@apache.org > http://blog.nanthrax.net > Talend - http://www.talend.com >
Re: [VOTE] Drop Spark 1.x support to focus on Spark 2.x
Hi Robert, Thanks for your feedback ! From an user perspective, with the current state of the PR, the same pipelines can run on both Spark 1.x and 2.x: the only difference is the dependencies set. I'm calling the vote to get suck kind of feedback: if we consider Spark 1.x still need to be supported, no problem, I will improve the PR to have three modules (common, spark1, spark2) and let users pick the desired version. Let's wait a bit other feedbacks, I will update the PR accordingly. Regards JB On 11/08/2017 09:47 AM, Robert Bradshaw wrote: I'm generally a -0.5 on this change, or at least doing so hastily. As with dropping Java 7 support, I think this should at least be announced in release notes that we're considering dropping support in the subsequent release, as this dev list likely does not reach a substantial portion of the userbase. How much work is it to move from a Spark 1.x cluster to a Spark 2.x cluster? I get the feeling it's not nearly as transparent as upgrading Java versions. Can Spark 1.x pipelines be run on Spark 2.x clusters, or is a new cluster (and/or upgrading all pipelines) required (e.g. for those who operate spark clusters shared among their many users)? Looks like the latest release of Spark 1.x was about a year ago, overlapping a bit with the 2.x series which is coming up on 1.5 years old, so I could see a lot of people still using 1.x even if 2.x is clearly the future. But it sure doesn't seem very backwards compatible. Mostly I'm not comfortable with dropping 1.x in the same release as adding support for 2.x, giving no transition period, but could be convinced if this transition is mostly a no-op or no one's still using 1.x. If there's non-trivial code complexity issues, I would perhaps revisit the issue of having a single Spark Runner that does chooses the backend implicitly in favor of simply having two runners which share the code that's easy to share and diverge otherwise (which seems it would be much simpler both to implement and explain to users). I would be OK with even letting the Spark 1.x runner be somewhat stagnant (e.g. few or no new features) until we decide we can kill it off. On Tue, Nov 7, 2017 at 11:27 PM, Jean-Baptiste Onofré wrote: Hi all, as you might know, we are working on Spark 2.x support in the Spark runner. I'm working on a PR about that: https://github.com/apache/beam/pull/3808 Today, we have something working with both Spark 1.x and 2.x from a code standpoint, but I have to deal with dependencies. It's the first step of the update as I'm still using RDD, the second step would be to support dataframe (but for that, I would need PCollection elements with schemas, that's another topic on which Eugene, Reuven and I are discussing). However, as all major distributions now ship Spark 2.x, I don't think it's required anymore to support Spark 1.x. If we agree, I will update and cleanup the PR to only support and focus on Spark 2.x. So, that's why I'm calling for a vote: [ ] +1 to drop Spark 1.x support and upgrade to Spark 2.x only [ ] 0 (I don't care ;)) [ ] -1, I would like to still support Spark 1.x, and so having support of both Spark 1.x and 2.x (please provide specific comment) This vote is open for 48 hours (I have the commits ready, just waiting the end of the vote to push on the PR). Thanks ! Regards JB -- Jean-Baptiste Onofré jbono...@apache.org http://blog.nanthrax.net Talend - http://www.talend.com -- Jean-Baptiste Onofré jbono...@apache.org http://blog.nanthrax.net Talend - http://www.talend.com
Re: [VOTE] Drop Spark 1.x support to focus on Spark 2.x
I'm generally a -0.5 on this change, or at least doing so hastily. As with dropping Java 7 support, I think this should at least be announced in release notes that we're considering dropping support in the subsequent release, as this dev list likely does not reach a substantial portion of the userbase. How much work is it to move from a Spark 1.x cluster to a Spark 2.x cluster? I get the feeling it's not nearly as transparent as upgrading Java versions. Can Spark 1.x pipelines be run on Spark 2.x clusters, or is a new cluster (and/or upgrading all pipelines) required (e.g. for those who operate spark clusters shared among their many users)? Looks like the latest release of Spark 1.x was about a year ago, overlapping a bit with the 2.x series which is coming up on 1.5 years old, so I could see a lot of people still using 1.x even if 2.x is clearly the future. But it sure doesn't seem very backwards compatible. Mostly I'm not comfortable with dropping 1.x in the same release as adding support for 2.x, giving no transition period, but could be convinced if this transition is mostly a no-op or no one's still using 1.x. If there's non-trivial code complexity issues, I would perhaps revisit the issue of having a single Spark Runner that does chooses the backend implicitly in favor of simply having two runners which share the code that's easy to share and diverge otherwise (which seems it would be much simpler both to implement and explain to users). I would be OK with even letting the Spark 1.x runner be somewhat stagnant (e.g. few or no new features) until we decide we can kill it off. On Tue, Nov 7, 2017 at 11:27 PM, Jean-Baptiste Onofré wrote: > Hi all, > > as you might know, we are working on Spark 2.x support in the Spark runner. > > I'm working on a PR about that: > > https://github.com/apache/beam/pull/3808 > > Today, we have something working with both Spark 1.x and 2.x from a code > standpoint, but I have to deal with dependencies. It's the first step of the > update as I'm still using RDD, the second step would be to support dataframe > (but for that, I would need PCollection elements with schemas, that's > another topic on which Eugene, Reuven and I are discussing). > > However, as all major distributions now ship Spark 2.x, I don't think it's > required anymore to support Spark 1.x. > > If we agree, I will update and cleanup the PR to only support and focus on > Spark 2.x. > > So, that's why I'm calling for a vote: > > [ ] +1 to drop Spark 1.x support and upgrade to Spark 2.x only > [ ] 0 (I don't care ;)) > [ ] -1, I would like to still support Spark 1.x, and so having support of > both Spark 1.x and 2.x (please provide specific comment) > > This vote is open for 48 hours (I have the commits ready, just waiting the > end of the vote to push on the PR). > > Thanks ! > Regards > JB > -- > Jean-Baptiste Onofré > jbono...@apache.org > http://blog.nanthrax.net > Talend - http://www.talend.com