I'm generally a -0.5 on this change, or at least doing so hastily. As with dropping Java 7 support, I think this should at least be announced in release notes that we're considering dropping support in the subsequent release, as this dev list likely does not reach a substantial portion of the userbase.
How much work is it to move from a Spark 1.x cluster to a Spark 2.x cluster? I get the feeling it's not nearly as transparent as upgrading Java versions. Can Spark 1.x pipelines be run on Spark 2.x clusters, or is a new cluster (and/or upgrading all pipelines) required (e.g. for those who operate spark clusters shared among their many users)? Looks like the latest release of Spark 1.x was about a year ago, overlapping a bit with the 2.x series which is coming up on 1.5 years old, so I could see a lot of people still using 1.x even if 2.x is clearly the future. But it sure doesn't seem very backwards compatible. Mostly I'm not comfortable with dropping 1.x in the same release as adding support for 2.x, giving no transition period, but could be convinced if this transition is mostly a no-op or no one's still using 1.x. If there's non-trivial code complexity issues, I would perhaps revisit the issue of having a single Spark Runner that does chooses the backend implicitly in favor of simply having two runners which share the code that's easy to share and diverge otherwise (which seems it would be much simpler both to implement and explain to users). I would be OK with even letting the Spark 1.x runner be somewhat stagnant (e.g. few or no new features) until we decide we can kill it off. On Tue, Nov 7, 2017 at 11:27 PM, Jean-Baptiste Onofré <j...@nanthrax.net> wrote: > Hi all, > > as you might know, we are working on Spark 2.x support in the Spark runner. > > I'm working on a PR about that: > > https://github.com/apache/beam/pull/3808 > > Today, we have something working with both Spark 1.x and 2.x from a code > standpoint, but I have to deal with dependencies. It's the first step of the > update as I'm still using RDD, the second step would be to support dataframe > (but for that, I would need PCollection elements with schemas, that's > another topic on which Eugene, Reuven and I are discussing). > > However, as all major distributions now ship Spark 2.x, I don't think it's > required anymore to support Spark 1.x. > > If we agree, I will update and cleanup the PR to only support and focus on > Spark 2.x. > > So, that's why I'm calling for a vote: > > [ ] +1 to drop Spark 1.x support and upgrade to Spark 2.x only > [ ] 0 (I don't care ;)) > [ ] -1, I would like to still support Spark 1.x, and so having support of > both Spark 1.x and 2.x (please provide specific comment) > > This vote is open for 48 hours (I have the commits ready, just waiting the > end of the vote to push on the PR). > > Thanks ! > Regards > JB > -- > Jean-Baptiste Onofré > jbono...@apache.org > http://blog.nanthrax.net > Talend - http://www.talend.com