Having both Spark1 and Spark2 modules would benefit wider user base. I would vote for that.
Cheers On Wed, Nov 8, 2017 at 12:51 AM, Jean-Baptiste Onofré <j...@nanthrax.net> wrote: > Hi Robert, > > Thanks for your feedback ! > > From an user perspective, with the current state of the PR, the same > pipelines can run on both Spark 1.x and 2.x: the only difference is the > dependencies set. > > I'm calling the vote to get suck kind of feedback: if we consider Spark > 1.x still need to be supported, no problem, I will improve the PR to have > three modules (common, spark1, spark2) and let users pick the desired > version. > > Let's wait a bit other feedbacks, I will update the PR accordingly. > > Regards > JB > > > On 11/08/2017 09:47 AM, Robert Bradshaw wrote: > >> I'm generally a -0.5 on this change, or at least doing so hastily. >> >> As with dropping Java 7 support, I think this should at least be >> announced in release notes that we're considering dropping support in >> the subsequent release, as this dev list likely does not reach a >> substantial portion of the userbase. >> >> How much work is it to move from a Spark 1.x cluster to a Spark 2.x >> cluster? I get the feeling it's not nearly as transparent as upgrading >> Java versions. Can Spark 1.x pipelines be run on Spark 2.x clusters, >> or is a new cluster (and/or upgrading all pipelines) required (e.g. >> for those who operate spark clusters shared among their many users)? >> >> Looks like the latest release of Spark 1.x was about a year ago, >> overlapping a bit with the 2.x series which is coming up on 1.5 years >> old, so I could see a lot of people still using 1.x even if 2.x is >> clearly the future. But it sure doesn't seem very backwards >> compatible. >> >> Mostly I'm not comfortable with dropping 1.x in the same release as >> adding support for 2.x, giving no transition period, but could be >> convinced if this transition is mostly a no-op or no one's still using >> 1.x. If there's non-trivial code complexity issues, I would perhaps >> revisit the issue of having a single Spark Runner that does chooses >> the backend implicitly in favor of simply having two runners which >> share the code that's easy to share and diverge otherwise (which seems >> it would be much simpler both to implement and explain to users). I >> would be OK with even letting the Spark 1.x runner be somewhat >> stagnant (e.g. few or no new features) until we decide we can kill it >> off. >> >> On Tue, Nov 7, 2017 at 11:27 PM, Jean-Baptiste Onofré <j...@nanthrax.net> >> wrote: >> >>> Hi all, >>> >>> as you might know, we are working on Spark 2.x support in the Spark >>> runner. >>> >>> I'm working on a PR about that: >>> >>> https://github.com/apache/beam/pull/3808 >>> >>> Today, we have something working with both Spark 1.x and 2.x from a code >>> standpoint, but I have to deal with dependencies. It's the first step of >>> the >>> update as I'm still using RDD, the second step would be to support >>> dataframe >>> (but for that, I would need PCollection elements with schemas, that's >>> another topic on which Eugene, Reuven and I are discussing). >>> >>> However, as all major distributions now ship Spark 2.x, I don't think >>> it's >>> required anymore to support Spark 1.x. >>> >>> If we agree, I will update and cleanup the PR to only support and focus >>> on >>> Spark 2.x. >>> >>> So, that's why I'm calling for a vote: >>> >>> [ ] +1 to drop Spark 1.x support and upgrade to Spark 2.x only >>> [ ] 0 (I don't care ;)) >>> [ ] -1, I would like to still support Spark 1.x, and so having >>> support of >>> both Spark 1.x and 2.x (please provide specific comment) >>> >>> This vote is open for 48 hours (I have the commits ready, just waiting >>> the >>> end of the vote to push on the PR). >>> >>> Thanks ! >>> Regards >>> JB >>> -- >>> Jean-Baptiste Onofré >>> jbono...@apache.org >>> http://blog.nanthrax.net >>> Talend - http://www.talend.com >>> >> > -- > Jean-Baptiste Onofré > jbono...@apache.org > http://blog.nanthrax.net > Talend - http://www.talend.com >