Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends
Just want to provide a quick update that we have submitted the "Spark Extras" proposal for review by the Apache board (see link below with the contents). https://docs.google.com/document/d/1zRFGG4414LhbKlGbYncZ13nyX34Rw4sfWhZRA5YBtIE/edit?usp=sharing Note that we are in the quest for a project name that does not have "Spark" as part of it, and we will provide an update here when we find a suitable name. Suggestions are welcome (please send them directly to my inbox to avoid flooding the mailing list). Thanks On Sun, Apr 17, 2016 at 9:16 AM, Luciano Resende wrote: > > > On Sat, Apr 16, 2016 at 11:12 PM, Reynold Xin wrote: > >> First, really thank you for leading the discussion. >> >> I am concerned that it'd hurt Spark more than it helps. As many others >> have pointed out, this unnecessarily creates a new tier of connectors or >> 3rd party libraries appearing to be endorsed by the Spark PMC or the ASF. >> We can alleviate this concern by not having "Spark" in the name, and the >> project proposal and documentation should label clearly that this is not >> affiliated with Spark. >> > > I really thought we could use the Spark name (e.g. similar to > spark-packages) as this project is really aligned and dedicated to curating > extensions to Apache Spark and that's why we were inviting Spark PMC > members to join the new project PMC so that Apache Spark has the necessary > oversight and influence on the project direction. I understand folks have > concerns with the name, and thus we will start looking into name > alternatives unless there is any way I could address the community concerns > around this. > > >> >> Also Luciano - assuming you are interested in creating a project like >> this and find a home for the connectors that were removed, I find it >> surprising that few of the initially proposed PMC members have actually >> contributed much to the connectors, and people that have contributed a lot >> were left out. I am sure that is just an oversight. >> >> > Reynold, thanks for your concern, we are not leaving anyone out, we took > the following criteria to identify initial PMC/Committers list as described > on the first e-mail on this thread: > >- Spark Committers and Apache Members can request to participate as PMC > members >- All active spark committers (committed on the last one year) will > have write access to the project (committer access) >- Other committers can request to become committers. >- Non committers would be added based on meritocracy after the start of > the project. > > Based on this criteria, all people that have expressed interest in joining > the project PMC has been added to it, but I don't feel comfortable adding > names to it at my will. And I have updated the list of committers and > currently we have the following on the draft proposal: > > > Initial PMC > > >- > >Luciano Resende (lresende AT apache DOT org) (Apache Member) >- > >Chris Mattmann (mattmann AT apache DOT org) (Apache Member, Apache >board member) >- > >Steve Loughran (stevel AT apache DOT org) (Apache Member) >- > >Jean-Baptiste Onofré (jbonofre AT apache DOT org) (Apache Member) >- > >Marcelo Masiero Vanzin (vanzin AT apache DOT org) (Apache Spark >committer) >- > >Sean R. Owen (srowen AT apache DOT org) (Apache Member and Spark PMC) >- > >Mridul Muralidharan (mridulm80 AT apache DOT org) (Apache Spark PMC) > > > Initial Committers (write access to active Spark committers that have > committed in the last one year) > > >- > >Andy Konwinski (andrew AT apache DOT org) (Apache Spark) >- > >Andrew Or (andrewor14 AT apache DOT org) (Apache Spark) >- > >Ankur Dave (ankurdave AT apache DOT org) (Apache Spark) >- > >Davies Liu (davies AT apache DOT org) (Apache Spark) >- > >DB Tsai (dbtsai AT apache DOT org) (Apache Spark) >- > >Haoyuan Li (haoyuan AT apache DOT org) (Apache Spark) >- > >Ram Sriharsha (harsha AT apache DOT org) (Apache Spark) >- > >Herman van Hövell (hvanhovell AT apache DOT org) (Apache Spark) >- > >Imran Rashid (irashid AT apache DOT org) (Apache Spark) >- > >Joseph Kurata Bradley (jkbradley AT apache DOT org) (Apache Spark) >- > >Josh Rosen (joshrosen AT apache DOT org) (Apache Spark) >- > >Kay Ousterhout (kayousterhout AT apache DOT org) (Apache Spark) >- > >Cheng Lian (lian AT apache DOT org) (Apache Spark) >- > >Mark Hamstra (markhamstra AT apache DOT org) (Apache Spark) >- > >Michael Armbrust (marmbrus AT apache DOT org) (Apache Spark) >- > >Matei Alexandru Zaharia (matei AT apache DOT org) (Apache Spark) >- > >Xiangrui Meng (meng AT apache DOT org) (Apache Spark) >- > >Prashant Sharma (prashant AT apache DOT org) (Apache Spark) >- > >Patrick Wendell (pwendell AT apache DOT org) (Apache Spark) >- > >Reynold Xin (rxin AT apache DOT org) (Apache Spark) >
Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends
Evan, As long as you meet the criteria we discussed on this thread, you are welcome to join. Having said that, I have already seen other contributors that are very active on some of connectors but are not Apache Committers yet, and i wanted to be fair, and also avoid using the project as an avenue to bring new committers to Apache. On Sun, Apr 17, 2016 at 10:07 PM, Evan Chan wrote: > Hi Luciano, > > I see that you are inviting all the Spark committers to this new project. > What about the chief maintainers of important Spark ecosystem projects, > which are not on the Spark PMC? > > For example, I am the chief maintainer of the Spark Job Server, which is > one of the most active projects in the larger Spark ecosystem. Would > projects like this be part of your vision? If so, it would be a good step > of faith to reach out to us that maintain the active ecosystem projects. > (I’m not saying you should put me in :) but rather suggesting that if > this is your aim, it would be good to reach out beyond just the Spark PMC > members. > > thanks, > Evan > > On Apr 17, 2016, at 9:16 AM, Luciano Resende wrote: > > > > On Sat, Apr 16, 2016 at 11:12 PM, Reynold Xin wrote: > >> First, really thank you for leading the discussion. >> >> I am concerned that it'd hurt Spark more than it helps. As many others >> have pointed out, this unnecessarily creates a new tier of connectors or >> 3rd party libraries appearing to be endorsed by the Spark PMC or the ASF. >> We can alleviate this concern by not having "Spark" in the name, and the >> project proposal and documentation should label clearly that this is not >> affiliated with Spark. >> > > I really thought we could use the Spark name (e.g. similar to > spark-packages) as this project is really aligned and dedicated to curating > extensions to Apache Spark and that's why we were inviting Spark PMC > members to join the new project PMC so that Apache Spark has the necessary > oversight and influence on the project direction. I understand folks have > concerns with the name, and thus we will start looking into name > alternatives unless there is any way I could address the community concerns > around this. > > >> >> Also Luciano - assuming you are interested in creating a project like >> this and find a home for the connectors that were removed, I find it >> surprising that few of the initially proposed PMC members have actually >> contributed much to the connectors, and people that have contributed a lot >> were left out. I am sure that is just an oversight. >> >> > Reynold, thanks for your concern, we are not leaving anyone out, we took > the following criteria to identify initial PMC/Committers list as described > on the first e-mail on this thread: > >- Spark Committers and Apache Members can request to participate as PMC > members >- All active spark committers (committed on the last one year) will > have write access to the project (committer access) >- Other committers can request to become committers. >- Non committers would be added based on meritocracy after the start of > the project. > > Based on this criteria, all people that have expressed interest in joining > the project PMC has been added to it, but I don't feel comfortable adding > names to it at my will. And I have updated the list of committers and > currently we have the following on the draft proposal: > > > Initial PMC > > >- Luciano Resende (lresende AT apache DOT org) (Apache Member) >- Chris Mattmann (mattmann AT apache DOT org) (Apache Member, Apache >board member) >- Steve Loughran (stevel AT apache DOT org) (Apache Member) >- Jean-Baptiste Onofré (jbonofre AT apache DOT org) (Apache Member) >- Marcelo Masiero Vanzin (vanzin AT apache DOT org) (Apache Spark >committer) >- Sean R. Owen (srowen AT apache DOT org) (Apache Member and Spark PMC) >- Mridul Muralidharan (mridulm80 AT apache DOT org) (Apache Spark PMC) > > > Initial Committers (write access to active Spark committers that have > committed in the last one year) > > >- Andy Konwinski (andrew AT apache DOT org) (Apache Spark) >- Andrew Or (andrewor14 AT apache DOT org) (Apache Spark) >- Ankur Dave (ankurdave AT apache DOT org) (Apache Spark) >- Davies Liu (davies AT apache DOT org) (Apache Spark) >- DB Tsai (dbtsai AT apache DOT org) (Apache Spark) >- Haoyuan Li (haoyuan AT apache DOT org) (Apache Spark) >- Ram Sriharsha (harsha AT apache DOT org) (Apache Spark) >- Herman van Hövell (hvanhovell AT apache DOT org) (Apache Spark) >- Imran Rashid (irashid AT apache DOT org) (Apache Spark) >- Joseph Kurata Bradley (jkbradley AT apache DOT org) (Apache Spark) >- Josh Rosen (joshrosen AT apache DOT org) (Apache Spark) >- Kay Ousterhout (kayousterhout AT apache DOT org) (Apache Spark) >- Cheng Lian (lian AT apache DOT org) (Apache Spark) >- Mark Hamstra (markhamstra AT apache DOT org) (Apache Spark) >- Michael Armbrust (marm
Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends
On Sat, Apr 16, 2016 at 11:12 PM, Reynold Xin wrote: > First, really thank you for leading the discussion. > > I am concerned that it'd hurt Spark more than it helps. As many others > have pointed out, this unnecessarily creates a new tier of connectors or > 3rd party libraries appearing to be endorsed by the Spark PMC or the ASF. > We can alleviate this concern by not having "Spark" in the name, and the > project proposal and documentation should label clearly that this is not > affiliated with Spark. > I really thought we could use the Spark name (e.g. similar to spark-packages) as this project is really aligned and dedicated to curating extensions to Apache Spark and that's why we were inviting Spark PMC members to join the new project PMC so that Apache Spark has the necessary oversight and influence on the project direction. I understand folks have concerns with the name, and thus we will start looking into name alternatives unless there is any way I could address the community concerns around this. > > Also Luciano - assuming you are interested in creating a project like this > and find a home for the connectors that were removed, I find it surprising > that few of the initially proposed PMC members have actually contributed > much to the connectors, and people that have contributed a lot were left > out. I am sure that is just an oversight. > > Reynold, thanks for your concern, we are not leaving anyone out, we took the following criteria to identify initial PMC/Committers list as described on the first e-mail on this thread: - Spark Committers and Apache Members can request to participate as PMC members - All active spark committers (committed on the last one year) will have write access to the project (committer access) - Other committers can request to become committers. - Non committers would be added based on meritocracy after the start of the project. Based on this criteria, all people that have expressed interest in joining the project PMC has been added to it, but I don't feel comfortable adding names to it at my will. And I have updated the list of committers and currently we have the following on the draft proposal: Initial PMC - Luciano Resende (lresende AT apache DOT org) (Apache Member) - Chris Mattmann (mattmann AT apache DOT org) (Apache Member, Apache board member) - Steve Loughran (stevel AT apache DOT org) (Apache Member) - Jean-Baptiste Onofré (jbonofre AT apache DOT org) (Apache Member) - Marcelo Masiero Vanzin (vanzin AT apache DOT org) (Apache Spark committer) - Sean R. Owen (srowen AT apache DOT org) (Apache Member and Spark PMC) - Mridul Muralidharan (mridulm80 AT apache DOT org) (Apache Spark PMC) Initial Committers (write access to active Spark committers that have committed in the last one year) - Andy Konwinski (andrew AT apache DOT org) (Apache Spark) - Andrew Or (andrewor14 AT apache DOT org) (Apache Spark) - Ankur Dave (ankurdave AT apache DOT org) (Apache Spark) - Davies Liu (davies AT apache DOT org) (Apache Spark) - DB Tsai (dbtsai AT apache DOT org) (Apache Spark) - Haoyuan Li (haoyuan AT apache DOT org) (Apache Spark) - Ram Sriharsha (harsha AT apache DOT org) (Apache Spark) - Herman van Hövell (hvanhovell AT apache DOT org) (Apache Spark) - Imran Rashid (irashid AT apache DOT org) (Apache Spark) - Joseph Kurata Bradley (jkbradley AT apache DOT org) (Apache Spark) - Josh Rosen (joshrosen AT apache DOT org) (Apache Spark) - Kay Ousterhout (kayousterhout AT apache DOT org) (Apache Spark) - Cheng Lian (lian AT apache DOT org) (Apache Spark) - Mark Hamstra (markhamstra AT apache DOT org) (Apache Spark) - Michael Armbrust (marmbrus AT apache DOT org) (Apache Spark) - Matei Alexandru Zaharia (matei AT apache DOT org) (Apache Spark) - Xiangrui Meng (meng AT apache DOT org) (Apache Spark) - Prashant Sharma (prashant AT apache DOT org) (Apache Spark) - Patrick Wendell (pwendell AT apache DOT org) (Apache Spark) - Reynold Xin (rxin AT apache DOT org) (Apache Spark) - Sanford Ryza (sandy AT apache DOT org) (Apache Spark) - Kousuke Saruta (sarutak AT apache DOT org) (Apache Spark) - Shivaram Venkataraman (shivaram AT apache DOT org) (Apache Spark) - Tathagata Das (tdas AT apache DOT org) (Apache Spark) - Thomas Graves (tgraves AT apache DOT org) (Apache Spark) - Wenchen Fan (wenchen AT apache DOT org) (Apache Spark) - Yin Huai (yhuai AT apache DOT org) (Apache Spark) - Shixiong Zhu (zsxwing AT apache DOT org) (Apache Spark) BTW, It would be really good to have you on the PMC as well, and any others that volunteer based on the criteria above. May I add you as PMC to the new project proposal ? -- Luciano Resende http://twitter.com/lresende1975 http://lresende.blogspot.com/
Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends
First, really thank you for leading the discussion. I am concerned that it'd hurt Spark more than it helps. As many others have pointed out, this unnecessarily creates a new tier of connectors or 3rd party libraries appearing to be endorsed by the Spark PMC or the ASF. We can alleviate this concern by not having "Spark" in the name, and the project proposal and documentation should label clearly that this is not affiliated with Spark. Also Luciano - assuming you are interested in creating a project like this and find a home for the connectors that were removed, I find it surprising that few of the initially proposed PMC members have actually contributed much to the connectors, and people that have contributed a lot were left out. I am sure that is just an oversight. On Sat, Apr 16, 2016 at 10:42 PM, Luciano Resende wrote: > > > On Sat, Apr 16, 2016 at 5:38 PM, Evan Chan > wrote: > >> Hi folks, >> >> Sorry to join the discussion late. I had a look at the design doc >> earlier in this thread, and it was not mentioned what types of >> projects are the targets of this new "spark extras" ASF umbrella >> >> Is the desire to have a maintained set of spark-related projects that >> keep pace with the main Spark development schedule? Is it just for >> streaming connectors? what about data sources, and other important >> projects in the Spark ecosystem? >> > > The proposal draft below has some more details on what type of projects, > but in summary, "Spark-Extras" would be a good place for any of these > components you mentioned. > > > https://docs.google.com/document/d/1zRFGG4414LhbKlGbYncZ13nyX34Rw4sfWhZRA5YBtIE/edit?usp=sharing > > >> >> I'm worried that this would relegate spark-packages to third tier >> status, > > > Owen answered a similar question about spark-packages earlier on this > thread, but while "Spark-Extras" would a place in Apache for collaboration > on the development of these extensions, they might still be published to > spark-packages as they existing streaming connectors are today. > > >> and the promotion of a select set of committers, and the >> project itself, to top level ASF status (a la Arrow) would create a >> further split in the community. >> >> > As for the select set of committers, we have invited all Spark committers > to be committers on the project, and I have updated the project proposal > with the existing set of active Spark committers ( that have committed in > the last one year) > > >> >> -Evan >> >> On Sat, Apr 16, 2016 at 4:46 AM, Steve Loughran >> wrote: >> > >> > >> > >> > >> > >> > On 15/04/2016, 17:41, "Mattmann, Chris A (3980)" < >> chris.a.mattm...@jpl.nasa.gov> wrote: >> > >> >>Yeah in support of this statement I think that my primary interest in >> >>this Spark Extras and the good work by Luciano here is that anytime we >> >>take bits out of a code base and “move it to GitHub” I see a bad >> precedent >> >>being set. >> >> >> >>Creating this project at the ASF creates a synergy between *Apache >> Spark* >> >>which is *at the ASF*. >> >> >> >>We welcome comments and as Luciano said, this is meant to invite and be >> >>open to those in the Apache Spark PMC to join and help. >> >> >> >>Cheers, >> >>Chris >> > >> > As one of the people named, here's my rationale: >> > >> > Throwing stuff into github creates that world of branches, and its no >> longer something that could be managed through the ASF, where managed is: >> governance, participation and a release process that includes auditing >> dependencies, code-signoff, etc, >> > >> > >> > As an example, there's a mutant hive JAR which spark uses, that's >> something which currently evolved between my repo and Patrick Wendell's; >> now that Josh Rosen has taken on the bold task of "trying to move spark and >> twill to Kryo 3", he's going to own that code, and now the reference branch >> will move somewhere else. >> > >> > In contrast, if there was an ASF location for this, then it'd be >> something anyone with commit rights could maintain and publish >> > >> > (actually, I've just realised life is hard here as the hive is a fork >> of ASF hive —really the spark branch should be a separate branch in Hive's >> own repo ... But the concept is the same: those bits of the codebase which >> are core parts of the spark project should really live in or near it) >> > >> > >> > If everyone on the spark commit list gets write access to this extras >> repo, moving things is straightforward. Release wise, things could/should >> be in sync. >> > >> > If there's a risk, its the eternal problem of the contrib/ dir >> Stuff ends up there that never gets maintained. I don't see that being any >> worse than if things were thrown to the wind of a thousand github repos: at >> least now there'd be a central issue tracking location. >> > > > > -- > Luciano Resende > http://twitter.com/lresende1975 > http://lresende.blogspot.com/ >
Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends
On Sat, Apr 16, 2016 at 5:38 PM, Evan Chan wrote: > Hi folks, > > Sorry to join the discussion late. I had a look at the design doc > earlier in this thread, and it was not mentioned what types of > projects are the targets of this new "spark extras" ASF umbrella > > Is the desire to have a maintained set of spark-related projects that > keep pace with the main Spark development schedule? Is it just for > streaming connectors? what about data sources, and other important > projects in the Spark ecosystem? > The proposal draft below has some more details on what type of projects, but in summary, "Spark-Extras" would be a good place for any of these components you mentioned. https://docs.google.com/document/d/1zRFGG4414LhbKlGbYncZ13nyX34Rw4sfWhZRA5YBtIE/edit?usp=sharing > > I'm worried that this would relegate spark-packages to third tier > status, Owen answered a similar question about spark-packages earlier on this thread, but while "Spark-Extras" would a place in Apache for collaboration on the development of these extensions, they might still be published to spark-packages as they existing streaming connectors are today. > and the promotion of a select set of committers, and the > project itself, to top level ASF status (a la Arrow) would create a > further split in the community. > > As for the select set of committers, we have invited all Spark committers to be committers on the project, and I have updated the project proposal with the existing set of active Spark committers ( that have committed in the last one year) > > -Evan > > On Sat, Apr 16, 2016 at 4:46 AM, Steve Loughran > wrote: > > > > > > > > > > > > On 15/04/2016, 17:41, "Mattmann, Chris A (3980)" < > chris.a.mattm...@jpl.nasa.gov> wrote: > > > >>Yeah in support of this statement I think that my primary interest in > >>this Spark Extras and the good work by Luciano here is that anytime we > >>take bits out of a code base and “move it to GitHub” I see a bad > precedent > >>being set. > >> > >>Creating this project at the ASF creates a synergy between *Apache Spark* > >>which is *at the ASF*. > >> > >>We welcome comments and as Luciano said, this is meant to invite and be > >>open to those in the Apache Spark PMC to join and help. > >> > >>Cheers, > >>Chris > > > > As one of the people named, here's my rationale: > > > > Throwing stuff into github creates that world of branches, and its no > longer something that could be managed through the ASF, where managed is: > governance, participation and a release process that includes auditing > dependencies, code-signoff, etc, > > > > > > As an example, there's a mutant hive JAR which spark uses, that's > something which currently evolved between my repo and Patrick Wendell's; > now that Josh Rosen has taken on the bold task of "trying to move spark and > twill to Kryo 3", he's going to own that code, and now the reference branch > will move somewhere else. > > > > In contrast, if there was an ASF location for this, then it'd be > something anyone with commit rights could maintain and publish > > > > (actually, I've just realised life is hard here as the hive is a fork of > ASF hive —really the spark branch should be a separate branch in Hive's own > repo ... But the concept is the same: those bits of the codebase which are > core parts of the spark project should really live in or near it) > > > > > > If everyone on the spark commit list gets write access to this extras > repo, moving things is straightforward. Release wise, things could/should > be in sync. > > > > If there's a risk, its the eternal problem of the contrib/ dir > Stuff ends up there that never gets maintained. I don't see that being any > worse than if things were thrown to the wind of a thousand github repos: at > least now there'd be a central issue tracking location. > -- Luciano Resende http://twitter.com/lresende1975 http://lresende.blogspot.com/
Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends
Hi folks, Sorry to join the discussion late. I had a look at the design doc earlier in this thread, and it was not mentioned what types of projects are the targets of this new "spark extras" ASF umbrella Is the desire to have a maintained set of spark-related projects that keep pace with the main Spark development schedule? Is it just for streaming connectors? what about data sources, and other important projects in the Spark ecosystem? I'm worried that this would relegate spark-packages to third tier status, and the promotion of a select set of committers, and the project itself, to top level ASF status (a la Arrow) would create a further split in the community. -Evan On Sat, Apr 16, 2016 at 4:46 AM, Steve Loughran wrote: > > > > > > On 15/04/2016, 17:41, "Mattmann, Chris A (3980)" > wrote: > >>Yeah in support of this statement I think that my primary interest in >>this Spark Extras and the good work by Luciano here is that anytime we >>take bits out of a code base and “move it to GitHub” I see a bad precedent >>being set. >> >>Creating this project at the ASF creates a synergy between *Apache Spark* >>which is *at the ASF*. >> >>We welcome comments and as Luciano said, this is meant to invite and be >>open to those in the Apache Spark PMC to join and help. >> >>Cheers, >>Chris > > As one of the people named, here's my rationale: > > Throwing stuff into github creates that world of branches, and its no longer > something that could be managed through the ASF, where managed is: > governance, participation and a release process that includes auditing > dependencies, code-signoff, etc, > > > As an example, there's a mutant hive JAR which spark uses, that's something > which currently evolved between my repo and Patrick Wendell's; now that Josh > Rosen has taken on the bold task of "trying to move spark and twill to Kryo > 3", he's going to own that code, and now the reference branch will move > somewhere else. > > In contrast, if there was an ASF location for this, then it'd be something > anyone with commit rights could maintain and publish > > (actually, I've just realised life is hard here as the hive is a fork of ASF > hive —really the spark branch should be a separate branch in Hive's own repo > ... But the concept is the same: those bits of the codebase which are core > parts of the spark project should really live in or near it) > > > If everyone on the spark commit list gets write access to this extras repo, > moving things is straightforward. Release wise, things could/should be in > sync. > > If there's a risk, its the eternal problem of the contrib/ dir Stuff > ends up there that never gets maintained. I don't see that being any worse > than if things were thrown to the wind of a thousand github repos: at least > now there'd be a central issue tracking location. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends
On 15/04/2016, 17:41, "Mattmann, Chris A (3980)" wrote: >Yeah in support of this statement I think that my primary interest in >this Spark Extras and the good work by Luciano here is that anytime we >take bits out of a code base and “move it to GitHub” I see a bad precedent >being set. > >Creating this project at the ASF creates a synergy between *Apache Spark* >which is *at the ASF*. > >We welcome comments and as Luciano said, this is meant to invite and be >open to those in the Apache Spark PMC to join and help. > >Cheers, >Chris As one of the people named, here's my rationale: Throwing stuff into github creates that world of branches, and its no longer something that could be managed through the ASF, where managed is: governance, participation and a release process that includes auditing dependencies, code-signoff, etc, As an example, there's a mutant hive JAR which spark uses, that's something which currently evolved between my repo and Patrick Wendell's; now that Josh Rosen has taken on the bold task of "trying to move spark and twill to Kryo 3", he's going to own that code, and now the reference branch will move somewhere else. In contrast, if there was an ASF location for this, then it'd be something anyone with commit rights could maintain and publish (actually, I've just realised life is hard here as the hive is a fork of ASF hive —really the spark branch should be a separate branch in Hive's own repo ... But the concept is the same: those bits of the codebase which are core parts of the spark project should really live in or near it) If everyone on the spark commit list gets write access to this extras repo, moving things is straightforward. Release wise, things could/should be in sync. If there's a risk, its the eternal problem of the contrib/ dir Stuff ends up there that never gets maintained. I don't see that being any worse than if things were thrown to the wind of a thousand github repos: at least now there'd be a central issue tracking location.
Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends
On Friday, April 15, 2016, Mattmann, Chris A (3980) < chris.a.mattm...@jpl.nasa.gov> wrote: > Yeah in support of this statement I think that my primary interest in > this Spark Extras and the good work by Luciano here is that anytime we > take bits out of a code base and “move it to GitHub” I see a bad precedent > being set. Can't agree more ! > > Creating this project at the ASF creates a synergy between *Apache Spark* > which is *at the ASF*. In addition, this will give all the "goodness " of being an Apache project from a user/consumer point of view compared to a general github project. > > We welcome comments and as Luciano said, this is meant to invite and be > open to those in the Apache Spark PMC to join and help. > > This would definitely be something worthwhile to explore. +1 Regards Mridul > Cheers, > Chris > > ++ > Chris Mattmann, Ph.D. > Chief Architect > Instrument Software and Science Data Systems Section (398) > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 168-519, Mailstop: 168-527 > Email: chris.a.mattm...@nasa.gov > WWW: http://sunset.usc.edu/~mattmann/ > ++ > Director, Information Retrieval and Data Science Group (IRDS) > Adjunct Associate Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > WWW: http://irds.usc.edu/ > ++ > > > > > > > > > > > On 4/15/16, 9:39 AM, "Luciano Resende" > wrote: > > > > > > >On Fri, Apr 15, 2016 at 9:34 AM, Cody Koeninger > >> wrote: > > > >Given that not all of the connectors were removed, I think this > >creates a weird / confusing three tier system > > > >1. connectors in the official project's spark/extras or spark/external > >2. connectors in "Spark Extras" > >3. connectors in some random organization's github > > > > > > > > > > > > > > > >Agree Cody, and I think this is one of the goals of "Spark Extras", > centralize the development of these connectors under one central place at > Apache, and that's why one of our asks is to invite the Spark PMC to > continue developing the remaining connectors > > that stayed in Spark proper, in "Spark Extras". We will also discuss > some process policies on enabling lowering the bar to allow proposal of > these other github extensions to be part of "Spark Extras" while also > considering a way to move code to a maintenance > > mode location. > > > > > > > > > >-- > >Luciano Resende > >http://twitter.com/lresende1975 > >http://lresende.blogspot.com/ > > > > > > > > >
Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends
100% agree with Sean & Reynold's comments on this. Adding this as a TLP would just cause more confusion as to "official" endorsement. On Fri, Apr 15, 2016 at 11:50 AM, Sean Owen wrote: > On Fri, Apr 15, 2016 at 5:34 PM, Luciano Resende wrote: >> I know the name might be confusing, but I also think that the projects have >> a very big synergy, more like sibling projects, where "Spark Extras" extends >> the Spark community and develop/maintain components for, and pretty much >> only for, Apache Spark. Based on your comment above, if making the project >> "Spark-Extras" a more acceptable name, I believe this is ok as well. > > This also grants special status to a third-party project. It's not > clear this should be *the* official unofficial third-party Spark > project over some other one. If something's to be blessed, it should > be in the Spark project. > > And why isn't it in the Spark project? the argument was that these > bits were not used and pretty de minimis as code. It's not up to me or > anyone else to tell you code X isn't useful to you. But arguing X > should be a TLP asserts it is substantial and of broad interest, since > there's non-zero effort for volunteers to deal with it. I am not sure > I've heard anyone argue that -- or did I miss it? because removing > bits of unused code happens all the time and isn't a bad precedent or > even unusual. > > It doesn't actually enable any more cooperation than is already > possible with any other project (like Kafka, Mesos, etc). You can run > the same governance model anywhere you like. I realize literally being > operated under the ASF banner is something different. > > What I hear here is a proposal to make an unofficial official Spark > project as a TLP, that begins with these fairly inconsequential > extras. I question the value of that on its face. Example: what goes > into this project? deleted Spark code only? or is this a glorified > "contrib" folder with a lower and somehow different bar determined by > different people? > > And at that stage... is it really helping to give that special status? > > - > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends
Yeah in support of this statement I think that my primary interest in this Spark Extras and the good work by Luciano here is that anytime we take bits out of a code base and “move it to GitHub” I see a bad precedent being set. Creating this project at the ASF creates a synergy between *Apache Spark* which is *at the ASF*. We welcome comments and as Luciano said, this is meant to invite and be open to those in the Apache Spark PMC to join and help. Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Director, Information Retrieval and Data Science Group (IRDS) Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA WWW: http://irds.usc.edu/ ++ On 4/15/16, 9:39 AM, "Luciano Resende" wrote: > > >On Fri, Apr 15, 2016 at 9:34 AM, Cody Koeninger > wrote: > >Given that not all of the connectors were removed, I think this >creates a weird / confusing three tier system > >1. connectors in the official project's spark/extras or spark/external >2. connectors in "Spark Extras" >3. connectors in some random organization's github > > > > > > > >Agree Cody, and I think this is one of the goals of "Spark Extras", centralize >the development of these connectors under one central place at Apache, and >that's why one of our asks is to invite the Spark PMC to continue developing >the remaining connectors > that stayed in Spark proper, in "Spark Extras". We will also discuss some > process policies on enabling lowering the bar to allow proposal of these > other github extensions to be part of "Spark Extras" while also considering a > way to move code to a maintenance > mode location. > > > > >-- >Luciano Resende >http://twitter.com/lresende1975 >http://lresende.blogspot.com/ > > > >
Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends
Hey Reynold, Thanks. Getting to the heart of this, I think that this project would be successful if the Apache Spark PMC decided to participate and there was some overlap. As much as I think it would be great to stand up another project, the goal here from Luciano and crew (myself included) would be to suggest it’s just as easy to start an Apache Incubator project to manage “extra” pieces of Apache Spark code outside of the release cycle and the other reasons stated that it made sense to move this code out of the code base. This isn’t a competing effort to some code on GitHub that was moved out of Apache source control from Apache Spark - it’s meant to be an enabler to suggest that code could be managed here just as easily (see the difference?) Let me know what you think thanks Reynold. Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Director, Information Retrieval and Data Science Group (IRDS) Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA WWW: http://irds.usc.edu/ ++ On 4/15/16, 9:47 AM, "Reynold Xin" wrote: > > > >Anybody is free and welcomed to create another ASF project, but I don't think >"Spark extras" is a good name. It unnecessarily creates another tier of code >that ASF is "endorsing". >On Friday, April 15, 2016, Mattmann, Chris A (3980) > wrote: > >Yeah in support of this statement I think that my primary interest in >this Spark Extras and the good work by Luciano here is that anytime we >take bits out of a code base and “move it to GitHub” I see a bad precedent >being set. > >Creating this project at the ASF creates a synergy between *Apache Spark* >which is *at the ASF*. > >We welcome comments and as Luciano said, this is meant to invite and be >open to those in the Apache Spark PMC to join and help. > >Cheers, >Chris > >++ >Chris Mattmann, Ph.D. >Chief Architect >Instrument Software and Science Data Systems Section (398) >NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >Office: 168-519, Mailstop: 168-527 >Email: >chris.a.mattm...@nasa.gov >WWW: http://sunset.usc.edu/~mattmann/ >++ >Director, Information Retrieval and Data Science Group (IRDS) >Adjunct Associate Professor, Computer Science Department >University of Southern California, Los Angeles, CA 90089 USA >WWW: http://irds.usc.edu/ >++ > > > > > > > > > > >On 4/15/16, 9:39 AM, "Luciano Resende" > >wrote: > >> >> >>On Fri, Apr 15, 2016 at 9:34 AM, Cody Koeninger >>> wrote: >> >>Given that not all of the connectors were removed, I think this >>creates a weird / confusing three tier system >> >>1. connectors in the official project's spark/extras or spark/external >>2. connectors in "Spark Extras" >>3. connectors in some random organization's github >> >> >> >> >> >> >> >>Agree Cody, and I think this is one of the goals of "Spark Extras", >>centralize the development of these connectors under one central place at >>Apache, and that's why one of our asks is to invite the Spark PMC to continue >>developing the remaining connectors >> that stayed in Spark proper, in "Spark Extras". We will also discuss some >> process policies on enabling lowering the bar to allow proposal of these >> other github extensions to be part of "Spark Extras" while also considering >> a way to move code to a maintenance >> mode location. >> >> >> >> >>-- >>Luciano Resende >>http://twitter.com/lresende1975 >>http://lresende.blogspot.com/ >> >> >> >> > > >
Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends
Yeah, so it’s the *Apache Spark* project. Just to clarify. Not once did you say Apache Spark below. On 4/15/16, 9:50 AM, "Sean Owen" wrote: >On Fri, Apr 15, 2016 at 5:34 PM, Luciano Resende wrote: >> I know the name might be confusing, but I also think that the projects have >> a very big synergy, more like sibling projects, where "Spark Extras" extends >> the Spark community and develop/maintain components for, and pretty much >> only for, Apache Spark. Based on your comment above, if making the project >> "Spark-Extras" a more acceptable name, I believe this is ok as well. > >This also grants special status to a third-party project. It's not >clear this should be *the* official unofficial third-party Spark >project over some other one. If something's to be blessed, it should >be in the Spark project. > >And why isn't it in the Spark project? the argument was that these >bits were not used and pretty de minimis as code. It's not up to me or >anyone else to tell you code X isn't useful to you. But arguing X >should be a TLP asserts it is substantial and of broad interest, since >there's non-zero effort for volunteers to deal with it. I am not sure >I've heard anyone argue that -- or did I miss it? because removing >bits of unused code happens all the time and isn't a bad precedent or >even unusual. > >It doesn't actually enable any more cooperation than is already >possible with any other project (like Kafka, Mesos, etc). You can run >the same governance model anywhere you like. I realize literally being >operated under the ASF banner is something different. > >What I hear here is a proposal to make an unofficial official Spark >project as a TLP, that begins with these fairly inconsequential >extras. I question the value of that on its face. Example: what goes >into this project? deleted Spark code only? or is this a glorified >"contrib" folder with a lower and somehow different bar determined by >different people? > >And at that stage... is it really helping to give that special status? - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends
+1 Regards JB On 04/15/2016 06:41 PM, Mattmann, Chris A (3980) wrote: Yeah in support of this statement I think that my primary interest in this Spark Extras and the good work by Luciano here is that anytime we take bits out of a code base and “move it to GitHub” I see a bad precedent being set. Creating this project at the ASF creates a synergy between *Apache Spark* which is *at the ASF*. We welcome comments and as Luciano said, this is meant to invite and be open to those in the Apache Spark PMC to join and help. Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Director, Information Retrieval and Data Science Group (IRDS) Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA WWW: http://irds.usc.edu/ ++ On 4/15/16, 9:39 AM, "Luciano Resende" wrote: On Fri, Apr 15, 2016 at 9:34 AM, Cody Koeninger wrote: Given that not all of the connectors were removed, I think this creates a weird / confusing three tier system 1. connectors in the official project's spark/extras or spark/external 2. connectors in "Spark Extras" 3. connectors in some random organization's github Agree Cody, and I think this is one of the goals of "Spark Extras", centralize the development of these connectors under one central place at Apache, and that's why one of our asks is to invite the Spark PMC to continue developing the remaining connectors that stayed in Spark proper, in "Spark Extras". We will also discuss some process policies on enabling lowering the bar to allow proposal of these other github extensions to be part of "Spark Extras" while also considering a way to move code to a maintenance mode location. -- Luciano Resende http://twitter.com/lresende1975 http://lresende.blogspot.com/ - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org -- Jean-Baptiste Onofré jbono...@apache.org http://blog.nanthrax.net Talend - http://www.talend.com - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends
On Fri, Apr 15, 2016 at 5:34 PM, Luciano Resende wrote: > I know the name might be confusing, but I also think that the projects have > a very big synergy, more like sibling projects, where "Spark Extras" extends > the Spark community and develop/maintain components for, and pretty much > only for, Apache Spark. Based on your comment above, if making the project > "Spark-Extras" a more acceptable name, I believe this is ok as well. This also grants special status to a third-party project. It's not clear this should be *the* official unofficial third-party Spark project over some other one. If something's to be blessed, it should be in the Spark project. And why isn't it in the Spark project? the argument was that these bits were not used and pretty de minimis as code. It's not up to me or anyone else to tell you code X isn't useful to you. But arguing X should be a TLP asserts it is substantial and of broad interest, since there's non-zero effort for volunteers to deal with it. I am not sure I've heard anyone argue that -- or did I miss it? because removing bits of unused code happens all the time and isn't a bad precedent or even unusual. It doesn't actually enable any more cooperation than is already possible with any other project (like Kafka, Mesos, etc). You can run the same governance model anywhere you like. I realize literally being operated under the ASF banner is something different. What I hear here is a proposal to make an unofficial official Spark project as a TLP, that begins with these fairly inconsequential extras. I question the value of that on its face. Example: what goes into this project? deleted Spark code only? or is this a glorified "contrib" folder with a lower and somehow different bar determined by different people? And at that stage... is it really helping to give that special status? - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends
Anybody is free and welcomed to create another ASF project, but I don't think "Spark extras" is a good name. It unnecessarily creates another tier of code that ASF is "endorsing". On Friday, April 15, 2016, Mattmann, Chris A (3980) < chris.a.mattm...@jpl.nasa.gov> wrote: > Yeah in support of this statement I think that my primary interest in > this Spark Extras and the good work by Luciano here is that anytime we > take bits out of a code base and “move it to GitHub” I see a bad precedent > being set. > > Creating this project at the ASF creates a synergy between *Apache Spark* > which is *at the ASF*. > > We welcome comments and as Luciano said, this is meant to invite and be > open to those in the Apache Spark PMC to join and help. > > Cheers, > Chris > > ++ > Chris Mattmann, Ph.D. > Chief Architect > Instrument Software and Science Data Systems Section (398) > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 168-519, Mailstop: 168-527 > Email: chris.a.mattm...@nasa.gov > WWW: http://sunset.usc.edu/~mattmann/ > ++ > Director, Information Retrieval and Data Science Group (IRDS) > Adjunct Associate Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > WWW: http://irds.usc.edu/ > ++ > > > > > > > > > > > On 4/15/16, 9:39 AM, "Luciano Resende" > wrote: > > > > > > >On Fri, Apr 15, 2016 at 9:34 AM, Cody Koeninger > >> wrote: > > > >Given that not all of the connectors were removed, I think this > >creates a weird / confusing three tier system > > > >1. connectors in the official project's spark/extras or spark/external > >2. connectors in "Spark Extras" > >3. connectors in some random organization's github > > > > > > > > > > > > > > > >Agree Cody, and I think this is one of the goals of "Spark Extras", > centralize the development of these connectors under one central place at > Apache, and that's why one of our asks is to invite the Spark PMC to > continue developing the remaining connectors > > that stayed in Spark proper, in "Spark Extras". We will also discuss > some process policies on enabling lowering the bar to allow proposal of > these other github extensions to be part of "Spark Extras" while also > considering a way to move code to a maintenance > > mode location. > > > > > > > > > >-- > >Luciano Resende > >http://twitter.com/lresende1975 > >http://lresende.blogspot.com/ > > > > > > > > >
Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends
I think this meant to be understood as a community site, and as a directory listing pointers to third-party projects. It's not a project of its own, and not part of Spark itself, with no special status. At least, I think that's how it should be presented and pretty much seems to come across that way. On Fri, Apr 15, 2016 at 5:33 PM, Chris Fregly wrote: > and how does this all relate to the existing 1-and-a-half-class citizen > known as spark-packages.org? > > support for this citizen is buried deep in the Spark source (which was > always a bit odd, in my opinion): > > https://github.com/apache/spark/search?utf8=%E2%9C%93&q=spark-packages > - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends
On Fri, Apr 15, 2016 at 9:34 AM, Cody Koeninger wrote: > Given that not all of the connectors were removed, I think this > creates a weird / confusing three tier system > > 1. connectors in the official project's spark/extras or spark/external > 2. connectors in "Spark Extras" > 3. connectors in some random organization's github > > Agree Cody, and I think this is one of the goals of "Spark Extras", centralize the development of these connectors under one central place at Apache, and that's why one of our asks is to invite the Spark PMC to continue developing the remaining connectors that stayed in Spark proper, in "Spark Extras". We will also discuss some process policies on enabling lowering the bar to allow proposal of these other github extensions to be part of "Spark Extras" while also considering a way to move code to a maintenance mode location. -- Luciano Resende http://twitter.com/lresende1975 http://lresende.blogspot.com/
Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends
On Fri, Apr 15, 2016 at 9:18 AM, Sean Owen wrote: > Why would this need to be an ASF project of its own? I don't think > it's possible to have a yet another separate "Spark Extras" TLP (?) > > There is already a project to manage these bits of code on Github. How > about all of the interested parties manage the code there, under the > same process, under the same license, etc? > This whole discussion started when some of the connectors were moved from Apache to Github, which makes a statement that The "Spark Governance" of the bits is something very valuable by the community, consumers, and other companies that are consuming open source code. Being an Apache project also allows the project to use and share the Apache infrastructure to run the project. > > I'm not against calling it Spark Extras myself but I wonder if that > needlessly confuses the situation. They aren't part of the Spark TLP > on purpose, so trying to give it some special middle-ground status > might just be confusing. The thing that comes to mind immediately is > "Connectors for Apache Spark", spark-connectors, etc. > > I know the name might be confusing, but I also think that the projects have a very big synergy, more like sibling projects, where "Spark Extras" extends the Spark community and develop/maintain components for, and pretty much only for, Apache Spark. Based on your comment above, if making the project "Spark-Extras" a more acceptable name, I believe this is ok as well. I also understand that the Spark PMC might have concerns with branding, and that's why we are inviting all members of the Spark PMC to join the project and help oversee and manage the project. > > On Fri, Apr 15, 2016 at 5:01 PM, Luciano Resende > wrote: > > After some collaboration with other community members, we have created a > > initial draft for Spark Extras which is available for review at > > > > > https://docs.google.com/document/d/1zRFGG4414LhbKlGbYncZ13nyX34Rw4sfWhZRA5YBtIE/edit?usp=sharing > > > > We would like to invite other community members to participate in the > > project, particularly the Spark Committers and PMC (feel free to express > > interest and I will update the proposal). Another option here is just to > > give ALL Spark committers write access to "Spark Extras". > > > > > > We also have couple asks from the Spark PMC : > > > > - Permission to use "Spark Extras" as the project name. We already > checked > > this with Apache Brand Management, and the recommendation was to discuss > and > > reach consensus with the Spark PMC. > > > > - We would also want to check with the Spark PMC that, in case of > > successfully creation of "Spark Extras", if the PMC would be willing to > > continue the development of the remaining connectors that stayed in Spark > > 2.0 codebase in the "Spark Extras" project. > > > > > > Thanks in advance, and we welcome any feedback around this proposal > before > > we present to the Apache Board for consideration. > > > > > > > > On Sat, Mar 26, 2016 at 10:07 AM, Luciano Resende > > wrote: > >> > >> I believe some of this has been resolved in the context of some parts > that > >> had interest in one extra connector, but we still have a few removed, > and as > >> you mentioned, we still don't have a simple way or willingness to > manage and > >> be current on new packages like kafka. And based on the fact that this > >> thread is still alive, I believe that other community members might have > >> other concerns as well. > >> > >> After some thought, I believe having a separate project (what was > >> mentioned here as Spark Extras) to handle Spark Connectors and Spark > add-ons > >> in general could be very beneficial to Spark and the overall Spark > >> community, which would have a central place in Apache to collaborate > around > >> related Spark components. > >> > >> Some of the benefits on this approach > >> > >> - Enables maintaining the connectors inside Apache, following the Apache > >> governance and release rules, while allowing Spark proper to focus on > the > >> core runtime. > >> - Provides more flexibility in controlling the direction (currency) of > the > >> existing connectors (e.g. willing to find a solution and maintain > multiple > >> versions of same connectors like kafka 0.8x and 0.9x) > >> - Becomes a home for other types of Spark related connectors helping > >> expanding the community around Spark (e.g. Zeppelin see most of it's > current > >> contribution around new/enhanced connectors) > >> > >> What are some requirements for Spark Extras to be successful: > >> > >> - Be up to date with Spark Trunk APIs (based on daily CIs against > >> SNAPSHOT) > >> - Adhere to Spark release cycles (have a very little window compared to > >> Spark release) > >> - Be more open and flexible to the set of connectors it will accept and > >> maintain (e.g. also handle multiple versions like the kafka 0.9 issue we > >> have today) > >> > >> Where to start Spark Extras > >> > >> Depending on the inter
Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends
Given that not all of the connectors were removed, I think this creates a weird / confusing three tier system 1. connectors in the official project's spark/extras or spark/external 2. connectors in "Spark Extras" 3. connectors in some random organization's github On Fri, Apr 15, 2016 at 11:18 AM, Sean Owen wrote: > Why would this need to be an ASF project of its own? I don't think > it's possible to have a yet another separate "Spark Extras" TLP (?) > > There is already a project to manage these bits of code on Github. How > about all of the interested parties manage the code there, under the > same process, under the same license, etc? > > I'm not against calling it Spark Extras myself but I wonder if that > needlessly confuses the situation. They aren't part of the Spark TLP > on purpose, so trying to give it some special middle-ground status > might just be confusing. The thing that comes to mind immediately is > "Connectors for Apache Spark", spark-connectors, etc. > > > On Fri, Apr 15, 2016 at 5:01 PM, Luciano Resende wrote: >> After some collaboration with other community members, we have created a >> initial draft for Spark Extras which is available for review at >> >> https://docs.google.com/document/d/1zRFGG4414LhbKlGbYncZ13nyX34Rw4sfWhZRA5YBtIE/edit?usp=sharing >> >> We would like to invite other community members to participate in the >> project, particularly the Spark Committers and PMC (feel free to express >> interest and I will update the proposal). Another option here is just to >> give ALL Spark committers write access to "Spark Extras". >> >> >> We also have couple asks from the Spark PMC : >> >> - Permission to use "Spark Extras" as the project name. We already checked >> this with Apache Brand Management, and the recommendation was to discuss and >> reach consensus with the Spark PMC. >> >> - We would also want to check with the Spark PMC that, in case of >> successfully creation of "Spark Extras", if the PMC would be willing to >> continue the development of the remaining connectors that stayed in Spark >> 2.0 codebase in the "Spark Extras" project. >> >> >> Thanks in advance, and we welcome any feedback around this proposal before >> we present to the Apache Board for consideration. >> >> >> >> On Sat, Mar 26, 2016 at 10:07 AM, Luciano Resende >> wrote: >>> >>> I believe some of this has been resolved in the context of some parts that >>> had interest in one extra connector, but we still have a few removed, and as >>> you mentioned, we still don't have a simple way or willingness to manage and >>> be current on new packages like kafka. And based on the fact that this >>> thread is still alive, I believe that other community members might have >>> other concerns as well. >>> >>> After some thought, I believe having a separate project (what was >>> mentioned here as Spark Extras) to handle Spark Connectors and Spark add-ons >>> in general could be very beneficial to Spark and the overall Spark >>> community, which would have a central place in Apache to collaborate around >>> related Spark components. >>> >>> Some of the benefits on this approach >>> >>> - Enables maintaining the connectors inside Apache, following the Apache >>> governance and release rules, while allowing Spark proper to focus on the >>> core runtime. >>> - Provides more flexibility in controlling the direction (currency) of the >>> existing connectors (e.g. willing to find a solution and maintain multiple >>> versions of same connectors like kafka 0.8x and 0.9x) >>> - Becomes a home for other types of Spark related connectors helping >>> expanding the community around Spark (e.g. Zeppelin see most of it's current >>> contribution around new/enhanced connectors) >>> >>> What are some requirements for Spark Extras to be successful: >>> >>> - Be up to date with Spark Trunk APIs (based on daily CIs against >>> SNAPSHOT) >>> - Adhere to Spark release cycles (have a very little window compared to >>> Spark release) >>> - Be more open and flexible to the set of connectors it will accept and >>> maintain (e.g. also handle multiple versions like the kafka 0.9 issue we >>> have today) >>> >>> Where to start Spark Extras >>> >>> Depending on the interest here, we could follow the steps of (Apache >>> Arrow) and start this directly as a TLP, or start as an incubator project. I >>> would consider the first option first. >>> >>> Who would participate >>> >>> Have thought about this for a bit, and if we go to the direction of TLP, I >>> would say Spark Committers and Apache Members can request to participate as >>> PMC members, while other committers can request to become committers. Non >>> committers would be added based on meritocracy after the start of the >>> project. >>> >>> Project Name >>> >>> It would be ideal if we could have a project name that shows close ties to >>> Spark (e.g. Spark Extras or Spark Connectors) but we will need permission >>> and support from whoever is going to evaluate the project proposal (e
Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends
and how does this all relate to the existing 1-and-a-half-class citizen known as spark-packages.org? support for this citizen is buried deep in the Spark source (which was always a bit odd, in my opinion): https://github.com/apache/spark/search?utf8=%E2%9C%93&q=spark-packages On Fri, Apr 15, 2016 at 12:18 PM, Sean Owen wrote: > Why would this need to be an ASF project of its own? I don't think > it's possible to have a yet another separate "Spark Extras" TLP (?) > > There is already a project to manage these bits of code on Github. How > about all of the interested parties manage the code there, under the > same process, under the same license, etc? > > I'm not against calling it Spark Extras myself but I wonder if that > needlessly confuses the situation. They aren't part of the Spark TLP > on purpose, so trying to give it some special middle-ground status > might just be confusing. The thing that comes to mind immediately is > "Connectors for Apache Spark", spark-connectors, etc. > > > On Fri, Apr 15, 2016 at 5:01 PM, Luciano Resende > wrote: > > After some collaboration with other community members, we have created a > > initial draft for Spark Extras which is available for review at > > > > > https://docs.google.com/document/d/1zRFGG4414LhbKlGbYncZ13nyX34Rw4sfWhZRA5YBtIE/edit?usp=sharing > > > > We would like to invite other community members to participate in the > > project, particularly the Spark Committers and PMC (feel free to express > > interest and I will update the proposal). Another option here is just to > > give ALL Spark committers write access to "Spark Extras". > > > > > > We also have couple asks from the Spark PMC : > > > > - Permission to use "Spark Extras" as the project name. We already > checked > > this with Apache Brand Management, and the recommendation was to discuss > and > > reach consensus with the Spark PMC. > > > > - We would also want to check with the Spark PMC that, in case of > > successfully creation of "Spark Extras", if the PMC would be willing to > > continue the development of the remaining connectors that stayed in Spark > > 2.0 codebase in the "Spark Extras" project. > > > > > > Thanks in advance, and we welcome any feedback around this proposal > before > > we present to the Apache Board for consideration. > > > > > > > > On Sat, Mar 26, 2016 at 10:07 AM, Luciano Resende > > wrote: > >> > >> I believe some of this has been resolved in the context of some parts > that > >> had interest in one extra connector, but we still have a few removed, > and as > >> you mentioned, we still don't have a simple way or willingness to > manage and > >> be current on new packages like kafka. And based on the fact that this > >> thread is still alive, I believe that other community members might have > >> other concerns as well. > >> > >> After some thought, I believe having a separate project (what was > >> mentioned here as Spark Extras) to handle Spark Connectors and Spark > add-ons > >> in general could be very beneficial to Spark and the overall Spark > >> community, which would have a central place in Apache to collaborate > around > >> related Spark components. > >> > >> Some of the benefits on this approach > >> > >> - Enables maintaining the connectors inside Apache, following the Apache > >> governance and release rules, while allowing Spark proper to focus on > the > >> core runtime. > >> - Provides more flexibility in controlling the direction (currency) of > the > >> existing connectors (e.g. willing to find a solution and maintain > multiple > >> versions of same connectors like kafka 0.8x and 0.9x) > >> - Becomes a home for other types of Spark related connectors helping > >> expanding the community around Spark (e.g. Zeppelin see most of it's > current > >> contribution around new/enhanced connectors) > >> > >> What are some requirements for Spark Extras to be successful: > >> > >> - Be up to date with Spark Trunk APIs (based on daily CIs against > >> SNAPSHOT) > >> - Adhere to Spark release cycles (have a very little window compared to > >> Spark release) > >> - Be more open and flexible to the set of connectors it will accept and > >> maintain (e.g. also handle multiple versions like the kafka 0.9 issue we > >> have today) > >> > >> Where to start Spark Extras > >> > >> Depending on the interest here, we could follow the steps of (Apache > >> Arrow) and start this directly as a TLP, or start as an incubator > project. I > >> would consider the first option first. > >> > >> Who would participate > >> > >> Have thought about this for a bit, and if we go to the direction of > TLP, I > >> would say Spark Committers and Apache Members can request to > participate as > >> PMC members, while other committers can request to become committers. > Non > >> committers would be added based on meritocracy after the start of the > >> project. > >> > >> Project Name > >> > >> It would be ideal if we could have a project name that shows close ties > to > >> Spark (e.g.
Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends
Why would this need to be an ASF project of its own? I don't think it's possible to have a yet another separate "Spark Extras" TLP (?) There is already a project to manage these bits of code on Github. How about all of the interested parties manage the code there, under the same process, under the same license, etc? I'm not against calling it Spark Extras myself but I wonder if that needlessly confuses the situation. They aren't part of the Spark TLP on purpose, so trying to give it some special middle-ground status might just be confusing. The thing that comes to mind immediately is "Connectors for Apache Spark", spark-connectors, etc. On Fri, Apr 15, 2016 at 5:01 PM, Luciano Resende wrote: > After some collaboration with other community members, we have created a > initial draft for Spark Extras which is available for review at > > https://docs.google.com/document/d/1zRFGG4414LhbKlGbYncZ13nyX34Rw4sfWhZRA5YBtIE/edit?usp=sharing > > We would like to invite other community members to participate in the > project, particularly the Spark Committers and PMC (feel free to express > interest and I will update the proposal). Another option here is just to > give ALL Spark committers write access to "Spark Extras". > > > We also have couple asks from the Spark PMC : > > - Permission to use "Spark Extras" as the project name. We already checked > this with Apache Brand Management, and the recommendation was to discuss and > reach consensus with the Spark PMC. > > - We would also want to check with the Spark PMC that, in case of > successfully creation of "Spark Extras", if the PMC would be willing to > continue the development of the remaining connectors that stayed in Spark > 2.0 codebase in the "Spark Extras" project. > > > Thanks in advance, and we welcome any feedback around this proposal before > we present to the Apache Board for consideration. > > > > On Sat, Mar 26, 2016 at 10:07 AM, Luciano Resende > wrote: >> >> I believe some of this has been resolved in the context of some parts that >> had interest in one extra connector, but we still have a few removed, and as >> you mentioned, we still don't have a simple way or willingness to manage and >> be current on new packages like kafka. And based on the fact that this >> thread is still alive, I believe that other community members might have >> other concerns as well. >> >> After some thought, I believe having a separate project (what was >> mentioned here as Spark Extras) to handle Spark Connectors and Spark add-ons >> in general could be very beneficial to Spark and the overall Spark >> community, which would have a central place in Apache to collaborate around >> related Spark components. >> >> Some of the benefits on this approach >> >> - Enables maintaining the connectors inside Apache, following the Apache >> governance and release rules, while allowing Spark proper to focus on the >> core runtime. >> - Provides more flexibility in controlling the direction (currency) of the >> existing connectors (e.g. willing to find a solution and maintain multiple >> versions of same connectors like kafka 0.8x and 0.9x) >> - Becomes a home for other types of Spark related connectors helping >> expanding the community around Spark (e.g. Zeppelin see most of it's current >> contribution around new/enhanced connectors) >> >> What are some requirements for Spark Extras to be successful: >> >> - Be up to date with Spark Trunk APIs (based on daily CIs against >> SNAPSHOT) >> - Adhere to Spark release cycles (have a very little window compared to >> Spark release) >> - Be more open and flexible to the set of connectors it will accept and >> maintain (e.g. also handle multiple versions like the kafka 0.9 issue we >> have today) >> >> Where to start Spark Extras >> >> Depending on the interest here, we could follow the steps of (Apache >> Arrow) and start this directly as a TLP, or start as an incubator project. I >> would consider the first option first. >> >> Who would participate >> >> Have thought about this for a bit, and if we go to the direction of TLP, I >> would say Spark Committers and Apache Members can request to participate as >> PMC members, while other committers can request to become committers. Non >> committers would be added based on meritocracy after the start of the >> project. >> >> Project Name >> >> It would be ideal if we could have a project name that shows close ties to >> Spark (e.g. Spark Extras or Spark Connectors) but we will need permission >> and support from whoever is going to evaluate the project proposal (e.g. >> Apache Board) >> >> >> Thoughts ? >> >> Does anyone have any big disagreement or objection to moving into this >> direction ? >> >> Otherwise, who would be interested in joining the project, so I can start >> working on some concrete proposal ? >> >> > > > > > -- > Luciano Resende > http://twitter.com/lresende1975 > http://lresende.blogspot.com/ - To un
Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends
After some collaboration with other community members, we have created a initial draft for Spark Extras which is available for review at https://docs.google.com/document/d/1zRFGG4414LhbKlGbYncZ13nyX34Rw4sfWhZRA5YBtIE/edit?usp=sharing We would like to invite other community members to participate in the project, particularly the Spark Committers and PMC (feel free to express interest and I will update the proposal). Another option here is just to give ALL Spark committers write access to "Spark Extras". We also have couple asks from the Spark PMC : - Permission to use "Spark Extras" as the project name. We already checked this with Apache Brand Management, and the recommendation was to discuss and reach consensus with the Spark PMC. - We would also want to check with the Spark PMC that, in case of successfully creation of "Spark Extras", if the PMC would be willing to continue the development of the remaining connectors that stayed in Spark 2.0 codebase in the "Spark Extras" project. Thanks in advance, and we welcome any feedback around this proposal before we present to the Apache Board for consideration. On Sat, Mar 26, 2016 at 10:07 AM, Luciano Resende wrote: > I believe some of this has been resolved in the context of some parts that > had interest in one extra connector, but we still have a few removed, and > as you mentioned, we still don't have a simple way or willingness to manage > and be current on new packages like kafka. And based on the fact that this > thread is still alive, I believe that other community members might have > other concerns as well. > > After some thought, I believe having a separate project (what was > mentioned here as Spark Extras) to handle Spark Connectors and Spark > add-ons in general could be very beneficial to Spark and the overall Spark > community, which would have a central place in Apache to collaborate around > related Spark components. > > Some of the benefits on this approach > > - Enables maintaining the connectors inside Apache, following the Apache > governance and release rules, while allowing Spark proper to focus on the > core runtime. > - Provides more flexibility in controlling the direction (currency) of the > existing connectors (e.g. willing to find a solution and maintain multiple > versions of same connectors like kafka 0.8x and 0.9x) > - Becomes a home for other types of Spark related connectors helping > expanding the community around Spark (e.g. Zeppelin see most of it's > current contribution around new/enhanced connectors) > > What are some requirements for Spark Extras to be successful: > > - Be up to date with Spark Trunk APIs (based on daily CIs against SNAPSHOT) > - Adhere to Spark release cycles (have a very little window compared to > Spark release) > - Be more open and flexible to the set of connectors it will accept and > maintain (e.g. also handle multiple versions like the kafka 0.9 issue we > have today) > > Where to start Spark Extras > > Depending on the interest here, we could follow the steps of (Apache > Arrow) and start this directly as a TLP, or start as an incubator project. > I would consider the first option first. > > Who would participate > > Have thought about this for a bit, and if we go to the direction of TLP, I > would say Spark Committers and Apache Members can request to participate as > PMC members, while other committers can request to become committers. Non > committers would be added based on meritocracy after the start of the > project. > > Project Name > > It would be ideal if we could have a project name that shows close ties to > Spark (e.g. Spark Extras or Spark Connectors) but we will need permission > and support from whoever is going to evaluate the project proposal (e.g. > Apache Board) > > > Thoughts ? > > Does anyone have any big disagreement or objection to moving into this > direction ? > > Otherwise, who would be interested in joining the project, so I can start > working on some concrete proposal ? > > > -- Luciano Resende http://twitter.com/lresende1975 http://lresende.blogspot.com/
Re: SPARK-13843 and future of streaming backends
Are you talking about group/identifier name, or contained classes? Because there are plenty of org.apache.* classes distributed via maven with non-apache group / identifiers. On Fri, Mar 25, 2016 at 6:54 PM, David Nalley wrote: > >> As far as group / artifact name compatibility, at least in the case of >> Kafka we need different artifact names anyway, and people are going to >> have to make changes to their build files for spark 2.0 anyway. As >> far as keeping the actual classes in org.apache.spark to not break >> code despite the group name being different, I don't know whether that >> would be enforced by maven central, just looked at as poor taste, or >> ASF suing for trademark violation :) > > > Sonatype, has strict instructions to only permit org.apache.* to originate > from repository.apache.org. Exceptions to that must be approved by VP, > Infrastructure. > -- > Sent via Pony Mail for dev@spark.apache.org. > View this email online at: > https://pony-poc.apache.org/list.html?dev@spark.apache.org > > - > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: SPARK-13843 and future of streaming backends
On Saturday, March 26, 2016, Sean Owen wrote: > This has been resolved; see the JIRA and related PRs but also > > http://apache-spark-developers-list.1001551.n3.nabble.com/SPARK-13843-Next-steps-td16783.html > > This change happened subsequent to current thread (thanks Marcelo) and could as well have gone unnoticed until release vote. > This is not a scenario where a [VOTE] needs to take place, and code > changes don't proceed through PMC votes. From the project perspective, > code was deleted/retired for lack of interest, and this is controlled > by the normal lazy consensus protocol which wasn't vetoed. I have not seen Apache owned artifacts moved out of it's governance without discussion - this was not refactoring or cleanup (as was suggested disingenuously) but migration of submodules/functionality (though from Reynold's clarification, looks like for good enough reasons). A vote might or might not have required but a discussion must have happened - atleast going forward, it will help us not to miss things (artifact and project namespace, license, ownership, release cycle, version compatibility, etc of the sub project could be of interest to users and developers). Regards Mridul > The subsequent discussion was in part about whether other modules > should go, or whether one should come back, which it did. The latter > suggests that change could have been left open for some discussion > longer. Ideally, you would have commented before the initial change > happened, but it sounds like several people would have liked more > time. I don't think I'd call that "improper conduct" though, no. It > was reversed via the same normal code management process. > > The rest of the question concerned what becomes of the code that was > removed. It was revived outside the project for anyone who cares to > continue collaborating. There seemed to be no disagreement about that, > mostly because the code in question was of minimal interest. PMC > doesn't need to rule on anything. There may still be some loose ends > there like namespace changes. I'll add to the other thread about this. > > > > On Sat, Mar 26, 2016 at 1:17 PM, Jacek Laskowski > wrote: > > Hi, > > > > Although I'm not that much experienced member of ASF, I share your > > concerns. I haven't looked at the issue from this point of view, but > > after having read the thread I think PMC should've signed off the > > migration of ASF-owned code to a non-ASF repo. At least a vote is > > required (and this discussion is a sign that the process has not been > > conducted properly as people have concerns, me including). > > > > Thanks Mridul! > > > > Pozdrawiam, > > Jacek Laskowski > > > > https://medium.com/@jaceklaskowski/ > > Mastering Apache Spark http://bit.ly/mastering-apache-spark > > Follow me at https://twitter.com/jaceklaskowski > > > > > > On Thu, Mar 17, 2016 at 9:13 PM, Mridul Muralidharan > wrote: > >> I am not referring to code edits - but to migrating submodules and > >> code currently in Apache Spark to 'outside' of it. > >> If I understand correctly, assets from Apache Spark are being moved > >> out of it into thirdparty external repositories - not owned by Apache. > >> > >> At a minimum, dev@ discussion (like this one) should be initiated. > >> As PMC is responsible for the project assets (including code), signoff > >> is required for it IMO. > >> > >> More experienced Apache members might be opine better in case I got it > wrong ! > >> > >> > >> Regards, > >> Mridul > >> > >> > >> On Thu, Mar 17, 2016 at 12:55 PM, Cody Koeninger > wrote: > >>> Why would a PMC vote be necessary on every code deletion? > >>> > >>> There was a Jira and pull request discussion about the submodules that > >>> have been removed so far. > >>> > >>> https://issues.apache.org/jira/browse/SPARK-13843 > >>> > >>> There's another ongoing one about Kafka specifically > >>> > >>> https://issues.apache.org/jira/browse/SPARK-13877 > >>> > >>> > >>> On Thu, Mar 17, 2016 at 2:49 PM, Mridul Muralidharan > wrote: > > I was not aware of a discussion in Dev list about this - agree with > most of > the observations. > In addition, I did not see PMC signoff on moving (sub-)modules out. > > Regards > Mridul > > > > On Thursday, March 17, 2016, Marcelo Vanzin > wrote: > > > > Hello all, > > > > Recently a lot of the streaming backends were moved to a separate > > project on github and removed from the main Spark repo. > > > > While I think the idea is great, I'm a little worried about the > > execution. Some concerns were already raised on the bug mentioned > > above, but I'd like to have a more explicit discussion about this so > > things don't fall through the cracks. > > > > Mainly I have three concerns. > > > > i. Ownership > > > > That code used to be run by the ASF, but now it's hosted in a github > > repo owned not by the ASF. That sounds a lit
Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends
Hi Luciano, I didn't mean Spark proper, but more something like you proposed. Regards JB On 03/26/2016 06:38 PM, Luciano Resende wrote: On Sat, Mar 26, 2016 at 10:20 AM, Jean-Baptiste Onofré mailto:j...@nanthrax.net>> wrote: Hi Luciano, If we take the "pure" technical vision, there's pros and cons of having spark-extra (or whatever the name we give) still as an Apache project: Pro: - Governance & Quality Insurance: we follow the Apache rules, meaning that a release has to be staged and voted by the PMC. It's a form of governance of the project and quality (as the releases are reviewed). - Software origin: users know where the connector comes from, and they have the guarantee in term of licensing, etc. - IP/ICLA: We know the committers of this project, and we know they agree with the ICL agreement. Cons: - Third licenses support. As an Apache project, the "connectors" will be allowed to use only Apache or Category B licensed dependencies. For instance, if I would like to create a Spark connector for couchbase, I can't do it at Apache. Yes, this is not solving the incompatible license problems - Release cycle. As an Apache project, it means we have to follow the rules, meaning that the release cycle can appear strict and long due to the staging and vote process. For me, it's a huge benefit but some can see as too strict ;) IMHO, This is the small price we pay for all the good stuff you mentioned in pro Maybe, we can imagine both, as we have in ServiceMix or Camel: - all modules/connectors matching the Apache rule (especially in term of licensing) should be in the Apache Spark-Modules (or Spark-Extensions, or whatever). It's like the ServiceMix Bundles. If you are talking here about Spark proper, then we are currently seeing that this is going to be hard. If there was a way to have more flexibility to host these directly into Spark proper, I would never be creating this thread as we would have all the pros you mentioned hosting them directly into Spark. - all modules/connectors that can't fit into the Apache rule (due to licensing issue) can go into GitHub Spark-Extra (or Spark-Package). It's like the ServiceMix Extra or Camel Extra on github. We could look into this, but it might be a "Spark Extra discussion" on how we can help foster a community around the non-compatible licensed connectors. My $0.01. Regards JB On 03/26/2016 06:07 PM, Luciano Resende wrote: I believe some of this has been resolved in the context of some parts that had interest in one extra connector, but we still have a few removed, and as you mentioned, we still don't have a simple way or willingness to manage and be current on new packages like kafka. And based on the fact that this thread is still alive, I believe that other community members might have other concerns as well. After some thought, I believe having a separate project (what was mentioned here as Spark Extras) to handle Spark Connectors and Spark add-ons in general could be very beneficial to Spark and the overall Spark community, which would have a central place in Apache to collaborate around related Spark components. Some of the benefits on this approach - Enables maintaining the connectors inside Apache, following the Apache governance and release rules, while allowing Spark proper to focus on the core runtime. - Provides more flexibility in controlling the direction (currency) of the existing connectors (e.g. willing to find a solution and maintain multiple versions of same connectors like kafka 0.8x and 0.9x) - Becomes a home for other types of Spark related connectors helping expanding the community around Spark (e.g. Zeppelin see most of it's current contribution around new/enhanced connectors) What are some requirements for Spark Extras to be successful: - Be up to date with Spark Trunk APIs (based on daily CIs against SNAPSHOT) - Adhere to Spark release cycles (have a very little window compared to Spark release) - Be more open and flexible to the set of connectors it will accept and maintain (e.g. also handle multiple versions like the kafka 0.9 issue we have today) Where to start Spark Extras Depending on the interest here, we could follow the steps of (Apache Arrow) and start this directly as a TLP, or start as an incubator project. I would consider the first option first. Who would participate Have thought about this for a bit, and if we go to the direction of TLP, I would say Spark Committers and Apache Members can reques
Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends
On Sat, Mar 26, 2016 at 10:20 AM, Jean-Baptiste Onofré wrote: > Hi Luciano, > > If we take the "pure" technical vision, there's pros and cons of having > spark-extra (or whatever the name we give) still as an Apache project: > > Pro: > - Governance & Quality Insurance: we follow the Apache rules, meaning > that a release has to be staged and voted by the PMC. It's a form of > governance of the project and quality (as the releases are reviewed). > - Software origin: users know where the connector comes from, and they > have the guarantee in term of licensing, etc. > - IP/ICLA: We know the committers of this project, and we know they agree > with the ICL agreement. > > Cons: > - Third licenses support. As an Apache project, the "connectors" will be > allowed to use only Apache or Category B licensed dependencies. For > instance, if I would like to create a Spark connector for couchbase, I > can't do it at Apache. > Yes, this is not solving the incompatible license problems > - Release cycle. As an Apache project, it means we have to follow the > rules, meaning that the release cycle can appear strict and long due to the > staging and vote process. For me, it's a huge benefit but some can see as > too strict ;) > IMHO, This is the small price we pay for all the good stuff you mentioned in pro > > Maybe, we can imagine both, as we have in ServiceMix or Camel: > - all modules/connectors matching the Apache rule (especially in term of > licensing) should be in the Apache Spark-Modules (or Spark-Extensions, or > whatever). It's like the ServiceMix Bundles. > If you are talking here about Spark proper, then we are currently seeing that this is going to be hard. If there was a way to have more flexibility to host these directly into Spark proper, I would never be creating this thread as we would have all the pros you mentioned hosting them directly into Spark. > - all modules/connectors that can't fit into the Apache rule (due to > licensing issue) can go into GitHub Spark-Extra (or Spark-Package). It's > like the ServiceMix Extra or Camel Extra on github. > > We could look into this, but it might be a "Spark Extra discussion" on how we can help foster a community around the non-compatible licensed connectors. > My $0.01. > > Regards > JB > > > On 03/26/2016 06:07 PM, Luciano Resende wrote: > >> I believe some of this has been resolved in the context of some parts >> that had interest in one extra connector, but we still have a few >> removed, and as you mentioned, we still don't have a simple way or >> willingness to manage and be current on new packages like kafka. And >> based on the fact that this thread is still alive, I believe that other >> community members might have other concerns as well. >> >> After some thought, I believe having a separate project (what was >> mentioned here as Spark Extras) to handle Spark Connectors and Spark >> add-ons in general could be very beneficial to Spark and the overall >> Spark community, which would have a central place in Apache to >> collaborate around related Spark components. >> >> Some of the benefits on this approach >> >> - Enables maintaining the connectors inside Apache, following the Apache >> governance and release rules, while allowing Spark proper to focus on >> the core runtime. >> - Provides more flexibility in controlling the direction (currency) of >> the existing connectors (e.g. willing to find a solution and maintain >> multiple versions of same connectors like kafka 0.8x and 0.9x) >> - Becomes a home for other types of Spark related connectors helping >> expanding the community around Spark (e.g. Zeppelin see most of it's >> current contribution around new/enhanced connectors) >> >> What are some requirements for Spark Extras to be successful: >> >> - Be up to date with Spark Trunk APIs (based on daily CIs against >> SNAPSHOT) >> - Adhere to Spark release cycles (have a very little window compared to >> Spark release) >> - Be more open and flexible to the set of connectors it will accept and >> maintain (e.g. also handle multiple versions like the kafka 0.9 issue we >> have today) >> >> Where to start Spark Extras >> >> Depending on the interest here, we could follow the steps of (Apache >> Arrow) and start this directly as a TLP, or start as an incubator >> project. I would consider the first option first. >> >> Who would participate >> >> Have thought about this for a bit, and if we go to the direction of TLP, >> I would say Spark Committers and Apache Members can request to >> participate as PMC members, while other committers can request to become >> committers. Non committers would be added based on meritocracy after the >> start of the project. >> >> Project Name >> >> It would be ideal if we could have a project name that shows close ties >> to Spark (e.g. Spark Extras or Spark Connectors) but we will need >> permission and support from whoever is going to evaluate the project >> proposal (e.g. Apache Board) >> >> >> Thoughts
Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends
Hi Luciano, If we take the "pure" technical vision, there's pros and cons of having spark-extra (or whatever the name we give) still as an Apache project: Pro: - Governance & Quality Insurance: we follow the Apache rules, meaning that a release has to be staged and voted by the PMC. It's a form of governance of the project and quality (as the releases are reviewed). - Software origin: users know where the connector comes from, and they have the guarantee in term of licensing, etc. - IP/ICLA: We know the committers of this project, and we know they agree with the ICL agreement. Cons: - Third licenses support. As an Apache project, the "connectors" will be allowed to use only Apache or Category B licensed dependencies. For instance, if I would like to create a Spark connector for couchbase, I can't do it at Apache. - Release cycle. As an Apache project, it means we have to follow the rules, meaning that the release cycle can appear strict and long due to the staging and vote process. For me, it's a huge benefit but some can see as too strict ;) Maybe, we can imagine both, as we have in ServiceMix or Camel: - all modules/connectors matching the Apache rule (especially in term of licensing) should be in the Apache Spark-Modules (or Spark-Extensions, or whatever). It's like the ServiceMix Bundles. - all modules/connectors that can't fit into the Apache rule (due to licensing issue) can go into GitHub Spark-Extra (or Spark-Package). It's like the ServiceMix Extra or Camel Extra on github. My $0.01. Regards JB On 03/26/2016 06:07 PM, Luciano Resende wrote: I believe some of this has been resolved in the context of some parts that had interest in one extra connector, but we still have a few removed, and as you mentioned, we still don't have a simple way or willingness to manage and be current on new packages like kafka. And based on the fact that this thread is still alive, I believe that other community members might have other concerns as well. After some thought, I believe having a separate project (what was mentioned here as Spark Extras) to handle Spark Connectors and Spark add-ons in general could be very beneficial to Spark and the overall Spark community, which would have a central place in Apache to collaborate around related Spark components. Some of the benefits on this approach - Enables maintaining the connectors inside Apache, following the Apache governance and release rules, while allowing Spark proper to focus on the core runtime. - Provides more flexibility in controlling the direction (currency) of the existing connectors (e.g. willing to find a solution and maintain multiple versions of same connectors like kafka 0.8x and 0.9x) - Becomes a home for other types of Spark related connectors helping expanding the community around Spark (e.g. Zeppelin see most of it's current contribution around new/enhanced connectors) What are some requirements for Spark Extras to be successful: - Be up to date with Spark Trunk APIs (based on daily CIs against SNAPSHOT) - Adhere to Spark release cycles (have a very little window compared to Spark release) - Be more open and flexible to the set of connectors it will accept and maintain (e.g. also handle multiple versions like the kafka 0.9 issue we have today) Where to start Spark Extras Depending on the interest here, we could follow the steps of (Apache Arrow) and start this directly as a TLP, or start as an incubator project. I would consider the first option first. Who would participate Have thought about this for a bit, and if we go to the direction of TLP, I would say Spark Committers and Apache Members can request to participate as PMC members, while other committers can request to become committers. Non committers would be added based on meritocracy after the start of the project. Project Name It would be ideal if we could have a project name that shows close ties to Spark (e.g. Spark Extras or Spark Connectors) but we will need permission and support from whoever is going to evaluate the project proposal (e.g. Apache Board) Thoughts ? Does anyone have any big disagreement or objection to moving into this direction ? Otherwise, who would be interested in joining the project, so I can start working on some concrete proposal ? On Sat, Mar 26, 2016 at 6:58 AM, Sean Owen mailto:so...@cloudera.com>> wrote: This has been resolved; see the JIRA and related PRs but also http://apache-spark-developers-list.1001551.n3.nabble.com/SPARK-13843-Next-steps-td16783.html This is not a scenario where a [VOTE] needs to take place, and code changes don't proceed through PMC votes. From the project perspective, code was deleted/retired for lack of interest, and this is controlled by the normal lazy consensus protocol which wasn't vetoed. The subsequent discussion was in part about whether other modules should go, or whether one should come back, which it did. The latter suggests that chan
Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends
I believe some of this has been resolved in the context of some parts that had interest in one extra connector, but we still have a few removed, and as you mentioned, we still don't have a simple way or willingness to manage and be current on new packages like kafka. And based on the fact that this thread is still alive, I believe that other community members might have other concerns as well. After some thought, I believe having a separate project (what was mentioned here as Spark Extras) to handle Spark Connectors and Spark add-ons in general could be very beneficial to Spark and the overall Spark community, which would have a central place in Apache to collaborate around related Spark components. Some of the benefits on this approach - Enables maintaining the connectors inside Apache, following the Apache governance and release rules, while allowing Spark proper to focus on the core runtime. - Provides more flexibility in controlling the direction (currency) of the existing connectors (e.g. willing to find a solution and maintain multiple versions of same connectors like kafka 0.8x and 0.9x) - Becomes a home for other types of Spark related connectors helping expanding the community around Spark (e.g. Zeppelin see most of it's current contribution around new/enhanced connectors) What are some requirements for Spark Extras to be successful: - Be up to date with Spark Trunk APIs (based on daily CIs against SNAPSHOT) - Adhere to Spark release cycles (have a very little window compared to Spark release) - Be more open and flexible to the set of connectors it will accept and maintain (e.g. also handle multiple versions like the kafka 0.9 issue we have today) Where to start Spark Extras Depending on the interest here, we could follow the steps of (Apache Arrow) and start this directly as a TLP, or start as an incubator project. I would consider the first option first. Who would participate Have thought about this for a bit, and if we go to the direction of TLP, I would say Spark Committers and Apache Members can request to participate as PMC members, while other committers can request to become committers. Non committers would be added based on meritocracy after the start of the project. Project Name It would be ideal if we could have a project name that shows close ties to Spark (e.g. Spark Extras or Spark Connectors) but we will need permission and support from whoever is going to evaluate the project proposal (e.g. Apache Board) Thoughts ? Does anyone have any big disagreement or objection to moving into this direction ? Otherwise, who would be interested in joining the project, so I can start working on some concrete proposal ? On Sat, Mar 26, 2016 at 6:58 AM, Sean Owen wrote: > This has been resolved; see the JIRA and related PRs but also > > http://apache-spark-developers-list.1001551.n3.nabble.com/SPARK-13843-Next-steps-td16783.html > > This is not a scenario where a [VOTE] needs to take place, and code > changes don't proceed through PMC votes. From the project perspective, > code was deleted/retired for lack of interest, and this is controlled > by the normal lazy consensus protocol which wasn't vetoed. > > The subsequent discussion was in part about whether other modules > should go, or whether one should come back, which it did. The latter > suggests that change could have been left open for some discussion > longer. Ideally, you would have commented before the initial change > happened, but it sounds like several people would have liked more > time. I don't think I'd call that "improper conduct" though, no. It > was reversed via the same normal code management process. > > The rest of the question concerned what becomes of the code that was > removed. It was revived outside the project for anyone who cares to > continue collaborating. There seemed to be no disagreement about that, > mostly because the code in question was of minimal interest. PMC > doesn't need to rule on anything. There may still be some loose ends > there like namespace changes. I'll add to the other thread about this. > > > > On Sat, Mar 26, 2016 at 1:17 PM, Jacek Laskowski wrote: > > Hi, > > > > Although I'm not that much experienced member of ASF, I share your > > concerns. I haven't looked at the issue from this point of view, but > > after having read the thread I think PMC should've signed off the > > migration of ASF-owned code to a non-ASF repo. At least a vote is > > required (and this discussion is a sign that the process has not been > > conducted properly as people have concerns, me including). > > > > Thanks Mridul! > > > > Pozdrawiam, > > Jacek Laskowski > > > > https://medium.com/@jaceklaskowski/ > > Mastering Apache Spark http://bit.ly/mastering-apache-spark > > Follow me at https://twitter.com/jaceklaskowski > > > > > > On Thu, Mar 17, 2016 at 9:13 PM, Mridul Muralidharan > wrote: > >> I am not referring to code edits - but to migrating submodules and > >> code currently in Apache Spar
Re: SPARK-13843 and future of streaming backends
This has been resolved; see the JIRA and related PRs but also http://apache-spark-developers-list.1001551.n3.nabble.com/SPARK-13843-Next-steps-td16783.html This is not a scenario where a [VOTE] needs to take place, and code changes don't proceed through PMC votes. From the project perspective, code was deleted/retired for lack of interest, and this is controlled by the normal lazy consensus protocol which wasn't vetoed. The subsequent discussion was in part about whether other modules should go, or whether one should come back, which it did. The latter suggests that change could have been left open for some discussion longer. Ideally, you would have commented before the initial change happened, but it sounds like several people would have liked more time. I don't think I'd call that "improper conduct" though, no. It was reversed via the same normal code management process. The rest of the question concerned what becomes of the code that was removed. It was revived outside the project for anyone who cares to continue collaborating. There seemed to be no disagreement about that, mostly because the code in question was of minimal interest. PMC doesn't need to rule on anything. There may still be some loose ends there like namespace changes. I'll add to the other thread about this. On Sat, Mar 26, 2016 at 1:17 PM, Jacek Laskowski wrote: > Hi, > > Although I'm not that much experienced member of ASF, I share your > concerns. I haven't looked at the issue from this point of view, but > after having read the thread I think PMC should've signed off the > migration of ASF-owned code to a non-ASF repo. At least a vote is > required (and this discussion is a sign that the process has not been > conducted properly as people have concerns, me including). > > Thanks Mridul! > > Pozdrawiam, > Jacek Laskowski > > https://medium.com/@jaceklaskowski/ > Mastering Apache Spark http://bit.ly/mastering-apache-spark > Follow me at https://twitter.com/jaceklaskowski > > > On Thu, Mar 17, 2016 at 9:13 PM, Mridul Muralidharan wrote: >> I am not referring to code edits - but to migrating submodules and >> code currently in Apache Spark to 'outside' of it. >> If I understand correctly, assets from Apache Spark are being moved >> out of it into thirdparty external repositories - not owned by Apache. >> >> At a minimum, dev@ discussion (like this one) should be initiated. >> As PMC is responsible for the project assets (including code), signoff >> is required for it IMO. >> >> More experienced Apache members might be opine better in case I got it wrong >> ! >> >> >> Regards, >> Mridul >> >> >> On Thu, Mar 17, 2016 at 12:55 PM, Cody Koeninger wrote: >>> Why would a PMC vote be necessary on every code deletion? >>> >>> There was a Jira and pull request discussion about the submodules that >>> have been removed so far. >>> >>> https://issues.apache.org/jira/browse/SPARK-13843 >>> >>> There's another ongoing one about Kafka specifically >>> >>> https://issues.apache.org/jira/browse/SPARK-13877 >>> >>> >>> On Thu, Mar 17, 2016 at 2:49 PM, Mridul Muralidharan >>> wrote: I was not aware of a discussion in Dev list about this - agree with most of the observations. In addition, I did not see PMC signoff on moving (sub-)modules out. Regards Mridul On Thursday, March 17, 2016, Marcelo Vanzin wrote: > > Hello all, > > Recently a lot of the streaming backends were moved to a separate > project on github and removed from the main Spark repo. > > While I think the idea is great, I'm a little worried about the > execution. Some concerns were already raised on the bug mentioned > above, but I'd like to have a more explicit discussion about this so > things don't fall through the cracks. > > Mainly I have three concerns. > > i. Ownership > > That code used to be run by the ASF, but now it's hosted in a github > repo owned not by the ASF. That sounds a little sub-optimal, if not > problematic. > > ii. Governance > > Similar to the above; who has commit access to the above repos? Will > all the Spark committers, present and future, have commit access to > all of those repos? Are they still going to be considered part of > Spark and have release management done through the Spark community? > > > For both of the questions above, why are they not turned into > sub-projects of Spark and hosted on the ASF repos? I believe there is > a mechanism to do that, without the need to keep the code in the main > Spark repo, right? > > iii. Usability > > This is another thing I don't see discussed. For Scala-based code > things don't change much, I guess, if the artifact names don't change > (another reason to keep things in the ASF?), but what about python? > How are pyspark users expected to get that code going forward, since > it's not in Spa
Re: SPARK-13843 and future of streaming backends
Hi, Although I'm not that much experienced member of ASF, I share your concerns. I haven't looked at the issue from this point of view, but after having read the thread I think PMC should've signed off the migration of ASF-owned code to a non-ASF repo. At least a vote is required (and this discussion is a sign that the process has not been conducted properly as people have concerns, me including). Thanks Mridul! Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Thu, Mar 17, 2016 at 9:13 PM, Mridul Muralidharan wrote: > I am not referring to code edits - but to migrating submodules and > code currently in Apache Spark to 'outside' of it. > If I understand correctly, assets from Apache Spark are being moved > out of it into thirdparty external repositories - not owned by Apache. > > At a minimum, dev@ discussion (like this one) should be initiated. > As PMC is responsible for the project assets (including code), signoff > is required for it IMO. > > More experienced Apache members might be opine better in case I got it wrong ! > > > Regards, > Mridul > > > On Thu, Mar 17, 2016 at 12:55 PM, Cody Koeninger wrote: >> Why would a PMC vote be necessary on every code deletion? >> >> There was a Jira and pull request discussion about the submodules that >> have been removed so far. >> >> https://issues.apache.org/jira/browse/SPARK-13843 >> >> There's another ongoing one about Kafka specifically >> >> https://issues.apache.org/jira/browse/SPARK-13877 >> >> >> On Thu, Mar 17, 2016 at 2:49 PM, Mridul Muralidharan >> wrote: >>> >>> I was not aware of a discussion in Dev list about this - agree with most of >>> the observations. >>> In addition, I did not see PMC signoff on moving (sub-)modules out. >>> >>> Regards >>> Mridul >>> >>> >>> >>> On Thursday, March 17, 2016, Marcelo Vanzin wrote: Hello all, Recently a lot of the streaming backends were moved to a separate project on github and removed from the main Spark repo. While I think the idea is great, I'm a little worried about the execution. Some concerns were already raised on the bug mentioned above, but I'd like to have a more explicit discussion about this so things don't fall through the cracks. Mainly I have three concerns. i. Ownership That code used to be run by the ASF, but now it's hosted in a github repo owned not by the ASF. That sounds a little sub-optimal, if not problematic. ii. Governance Similar to the above; who has commit access to the above repos? Will all the Spark committers, present and future, have commit access to all of those repos? Are they still going to be considered part of Spark and have release management done through the Spark community? For both of the questions above, why are they not turned into sub-projects of Spark and hosted on the ASF repos? I believe there is a mechanism to do that, without the need to keep the code in the main Spark repo, right? iii. Usability This is another thing I don't see discussed. For Scala-based code things don't change much, I guess, if the artifact names don't change (another reason to keep things in the ASF?), but what about python? How are pyspark users expected to get that code going forward, since it's not in Spark's pyspark.zip anymore? Is there an easy way of keeping these things within the ASF Spark project? I think that would be better for everybody. -- Marcelo - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org >>> > > - > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: SPARK-13843 and future of streaming backends
> As far as group / artifact name compatibility, at least in the case of > Kafka we need different artifact names anyway, and people are going to > have to make changes to their build files for spark 2.0 anyway. As > far as keeping the actual classes in org.apache.spark to not break > code despite the group name being different, I don't know whether that > would be enforced by maven central, just looked at as poor taste, or > ASF suing for trademark violation :) Sonatype, has strict instructions to only permit org.apache.* to originate from repository.apache.org. Exceptions to that must be approved by VP, Infrastructure. -- Sent via Pony Mail for dev@spark.apache.org. View this email online at: https://pony-poc.apache.org/list.html?dev@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: SPARK-13843 and future of streaming backends
Hi Reynold, thanks for the info. On Thu, Mar 17, 2016 at 2:18 PM, Reynold Xin wrote: > If one really feels strongly that we should go through all the overhead to > setup an ASF subproject for these modules that won't work with the new > structured streaming, and want to spearhead to setup separate repos > (preferably one subproject per connector), CI, separate JIRA, governance, > READMEs, voting, we can discuss that. Until then, I'd keep the github option > open because IMHO it is what works the best for end users (including > discoverability, issue tracking, release publishing, ...). For those of us who are not exactly familiar with the inner workings of administrating ASF projects, would you mind explaining in more detail what this overhead is? >From my naive point of view, when I say "sub project" I assume that it's a simple as having a separate git repo for it, tied to the same parent project. Everything else - JIRA, committers, bylaws, etc - remains the same. And since the project we're talking about are very small, CI should be very simple (Travis?) and, assuming sporadic releases, things overall should not be that expensive to maintain. -- Marcelo - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: SPARK-13843 and future of streaming backends
i. An ASF project can clearly decide that some of its code is no longer worth maintaining and delete it. This isn't really any different. It's still apache licensed so ultimately whoever wants the code can get it. ii. I think part of the rationale is to not tie release management to Spark, so it can proceed on a schedule that makes sense. I'm fine with helping out with release management for the Kafka subproject, for instance. I agree that practical governance questions need to be worked out. iii. How is this any different from how python users get access to any other third party Spark package? On Thu, Mar 17, 2016 at 1:14 PM, Marcelo Vanzin wrote: > Hello all, > > Recently a lot of the streaming backends were moved to a separate > project on github and removed from the main Spark repo. > > While I think the idea is great, I'm a little worried about the > execution. Some concerns were already raised on the bug mentioned > above, but I'd like to have a more explicit discussion about this so > things don't fall through the cracks. > > Mainly I have three concerns. > > i. Ownership > > That code used to be run by the ASF, but now it's hosted in a github > repo owned not by the ASF. That sounds a little sub-optimal, if not > problematic. > > ii. Governance > > Similar to the above; who has commit access to the above repos? Will > all the Spark committers, present and future, have commit access to > all of those repos? Are they still going to be considered part of > Spark and have release management done through the Spark community? > > > For both of the questions above, why are they not turned into > sub-projects of Spark and hosted on the ASF repos? I believe there is > a mechanism to do that, without the need to keep the code in the main > Spark repo, right? > > iii. Usability > > This is another thing I don't see discussed. For Scala-based code > things don't change much, I guess, if the artifact names don't change > (another reason to keep things in the ASF?), but what about python? > How are pyspark users expected to get that code going forward, since > it's not in Spark's pyspark.zip anymore? > > > Is there an easy way of keeping these things within the ASF Spark > project? I think that would be better for everybody. > > -- > Marcelo > > - > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: SPARK-13843 and future of streaming backends
Spark has hit one of the enternal problems of OSS projects, one hit by: ant, maven, hadoop, ... anything with a plugin model. Take in the plugin: you're in control, but also down for maintenance Leave out the plugin: other people can maintain it, be more agile, etc. But you've lost control, and you can't even manage the links. Here I think maven suffered the most by keeping stuff in codehaus; migrating off there is still hard —not only did they lose the links: they lost the JIRA. Maven's relationship with codehaus was very tightly coupled, lots of committers on both; I don't know how that relationship was handled at a higher level. On 17 Mar 2016, at 20:51, Hari Shreedharan mailto:hshreedha...@cloudera.com>> wrote: I have worked with various ASF projects for 4+ years now. Sure, ASF projects can delete code as they feel fit. But this is the first time I have really seen code being "moved out" of a project without discussion. I am sure you can do this without violating ASF policy, but the explanation for that would be convoluted (someone decided to make a copy and then the ASF project deleted it?). +1 for discussion. Dev changes should -> dev list; PMC for process in general. Don't think the ASF will overlook stuff like that. Might want to raise this issue on the next broad report FWIW, it may be better to just see if you can have committers to work on these projects: recruit the people and say 'please, only work in this area —for now". That gets developers on your team, which is generally considered a metric of health in a project. Or, as Cody Koeniger suggests, having a spark-extras project in the ASF with a focus on extras with their own support channel. Also, moving the code out would break compatibility. AFAIK, there is no way to push org.apache.* artifacts directly to maven central. That happens via mirroring from the ASF maven repos. Even if it you could somehow directly push the artifacts to mvn, you really can push to org.apache.* groups only if you are part of the repo and acting as an agent of that project (which in this case would be Apache Spark). Once you move the code out, even a committer/PMC member would not be representing the ASF when pushing the code. I am not sure if there is a way to fix this issue. This topic has cropped up in the general context of third party repos publishing artifacts with org.apache names but vendor specfic suffixes (e.g org.apache.hadoop/hadoop-common.5.3-cdh.jar Some people were pretty unhappy about this, but the conclusion reached was "maven doesn't let you do anything else and still let downstream people use it". Futhermore, as all ASF releases are nominally the source releases *not the binaries*, you can look at the POMs and say "we've released source code designed to publish artifacts to repos —this is 'use as intended'. People are also free to cut their own full project distributions, etc, etc. For example, I stick up the binaries of Windows builds independent of the ASF releases; these were originally just those from HDP on windows installs, now I check out the commit of the specific ASF release on a windows 2012 VM, do the build, copy the binaries. Free for all to use. But I do suspect that the ASF legal protections get a bit blurred here. These aren't ASF binaries, but binaries built directly from unmodified ASF releases. In contrast to sticking stuff into a github repo, the moved artifacts cannot be published as org.apache artfacts on maven central. That's non-negotiable as far as the ASF are concerned. The process for releasing ASF artifacts there goes downstream of the ASF public release process: you stage the artifacts, they are part of the vote process, everything with org.apache goes through it. That said: there is nothing to stop a set of shell org.apache artifacts being written which do nothing but contain transitive dependencies on artifacts in different groups, such as org.spark-project. The shells would be released by the ASF; they pull in the new stuff. And, therefore, it'd be possible to build a spark-assembly with the files. (I'm ignoring a loop in the build DAG here, playing with git submodules would let someone eliminate this by adding the removed libraries under a modified project. I think there might some issues related to package names; you could make a case for having public APIs with the original names —they're the API, after all, and that's exactly what Apache Harmony did with the java.* packages. Thanks, Hari On Thu, Mar 17, 2016 at 1:13 PM, Mridul Muralidharan mailto:mri...@gmail.com>> wrote: I am not referring to code edits - but to migrating submodules and code currently in Apache Spark to 'outside' of it. If I understand correctly, assets from Apache Spark are being moved out of it into thirdparty external repositories - not owned by Apache. At a minimum, dev@ discussion (like this one) should be initiated. As PMC is responsible for the project assets (in
Re: SPARK-13843 and future of streaming backends
I have worked with various ASF projects for 4+ years now. Sure, ASF projects can delete code as they feel fit. But this is the first time I have really seen code being "moved out" of a project without discussion. I am sure you can do this without violating ASF policy, but the explanation for that would be convoluted (someone decided to make a copy and then the ASF project deleted it?). Also, moving the code out would break compatibility. AFAIK, there is no way to push org.apache.* artifacts directly to maven central. That happens via mirroring from the ASF maven repos. Even if it you could somehow directly push the artifacts to mvn, you really can push to org.apache.* groups only if you are part of the repo and acting as an agent of that project (which in this case would be Apache Spark). Once you move the code out, even a committer/PMC member would not be representing the ASF when pushing the code. I am not sure if there is a way to fix this issue. Thanks, Hari On Thu, Mar 17, 2016 at 1:13 PM, Mridul Muralidharan wrote: > I am not referring to code edits - but to migrating submodules and > code currently in Apache Spark to 'outside' of it. > If I understand correctly, assets from Apache Spark are being moved > out of it into thirdparty external repositories - not owned by Apache. > > At a minimum, dev@ discussion (like this one) should be initiated. > As PMC is responsible for the project assets (including code), signoff > is required for it IMO. > > More experienced Apache members might be opine better in case I got it > wrong ! > > > Regards, > Mridul > > > On Thu, Mar 17, 2016 at 12:55 PM, Cody Koeninger > wrote: > > Why would a PMC vote be necessary on every code deletion? > > > > There was a Jira and pull request discussion about the submodules that > > have been removed so far. > > > > https://issues.apache.org/jira/browse/SPARK-13843 > > > > There's another ongoing one about Kafka specifically > > > > https://issues.apache.org/jira/browse/SPARK-13877 > > > > > > On Thu, Mar 17, 2016 at 2:49 PM, Mridul Muralidharan > wrote: > >> > >> I was not aware of a discussion in Dev list about this - agree with > most of > >> the observations. > >> In addition, I did not see PMC signoff on moving (sub-)modules out. > >> > >> Regards > >> Mridul > >> > >> > >> > >> On Thursday, March 17, 2016, Marcelo Vanzin > wrote: > >>> > >>> Hello all, > >>> > >>> Recently a lot of the streaming backends were moved to a separate > >>> project on github and removed from the main Spark repo. > >>> > >>> While I think the idea is great, I'm a little worried about the > >>> execution. Some concerns were already raised on the bug mentioned > >>> above, but I'd like to have a more explicit discussion about this so > >>> things don't fall through the cracks. > >>> > >>> Mainly I have three concerns. > >>> > >>> i. Ownership > >>> > >>> That code used to be run by the ASF, but now it's hosted in a github > >>> repo owned not by the ASF. That sounds a little sub-optimal, if not > >>> problematic. > >>> > >>> ii. Governance > >>> > >>> Similar to the above; who has commit access to the above repos? Will > >>> all the Spark committers, present and future, have commit access to > >>> all of those repos? Are they still going to be considered part of > >>> Spark and have release management done through the Spark community? > >>> > >>> > >>> For both of the questions above, why are they not turned into > >>> sub-projects of Spark and hosted on the ASF repos? I believe there is > >>> a mechanism to do that, without the need to keep the code in the main > >>> Spark repo, right? > >>> > >>> iii. Usability > >>> > >>> This is another thing I don't see discussed. For Scala-based code > >>> things don't change much, I guess, if the artifact names don't change > >>> (another reason to keep things in the ASF?), but what about python? > >>> How are pyspark users expected to get that code going forward, since > >>> it's not in Spark's pyspark.zip anymore? > >>> > >>> > >>> Is there an easy way of keeping these things within the ASF Spark > >>> project? I think that would be better for everybody. > >>> > >>> -- > >>> Marcelo > >>> > >>> - > >>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > >>> For additional commands, e-mail: dev-h...@spark.apache.org > >>> > >> > > - > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > >
Re: SPARK-13843 and future of streaming backends
On Thu, Mar 17, 2016 at 2:55 PM, Cody Koeninger wrote: > Why would a PMC vote be necessary on every code deletion? > Certainly PMC votes are not necessary on *every* code deletion. I dont' think there is a very clear rule on when such discussion is warranted, just a soft expectation that committers understand which changes require more discussion before getting merged. I believe the only formal requirement for a PMC vote is when there is a release. But I think as a community we'd much rather deal with these issues ahead of time, rather than having contentious discussions around releases because some are strongly opposed to changes that have already been merged. I'm all for the idea of removing these modules in general (for all of the reasons already mentioned), but it seems that there are important questions about how the new packages get distributed and how they are managed that merit further discussion. I'm somewhat torn on the question of the sub-project vs independent, and how its governed. I think Steve has summarized the tradeoffs very well. I do want to emphasize, though, that if they are entirely external from the ASF, the artifact ids and the package names must change at the very least.
Re: SPARK-13843 and future of streaming backends
There's a difference between "without discussion" and "without as much discussion as I would have liked to have a chance to notice it". There are plenty of PRs that got merged before I noticed them that I would rather have not gotten merged. As far as group / artifact name compatibility, at least in the case of Kafka we need different artifact names anyway, and people are going to have to make changes to their build files for spark 2.0 anyway. As far as keeping the actual classes in org.apache.spark to not break code despite the group name being different, I don't know whether that would be enforced by maven central, just looked at as poor taste, or ASF suing for trademark violation :) For people who would rather the problem be solved with official asf subprojects, which committers are volunteering to help do that work? Reynold already said he doesn't want to mess with that overhead. I'm fine with continuing to help work on the Kafka integration wherever it ends up, I'd just like the color of the bikeshed to get decided so we can build a decent bike... On Thu, Mar 17, 2016 at 3:51 PM, Hari Shreedharan wrote: > I have worked with various ASF projects for 4+ years now. Sure, ASF projects > can delete code as they feel fit. But this is the first time I have really > seen code being "moved out" of a project without discussion. I am sure you > can do this without violating ASF policy, but the explanation for that would > be convoluted (someone decided to make a copy and then the ASF project > deleted it?). > > Also, moving the code out would break compatibility. AFAIK, there is no way > to push org.apache.* artifacts directly to maven central. That happens via > mirroring from the ASF maven repos. Even if it you could somehow directly > push the artifacts to mvn, you really can push to org.apache.* groups only > if you are part of the repo and acting as an agent of that project (which in > this case would be Apache Spark). Once you move the code out, even a > committer/PMC member would not be representing the ASF when pushing the > code. I am not sure if there is a way to fix this issue. > > > Thanks, > Hari > > On Thu, Mar 17, 2016 at 1:13 PM, Mridul Muralidharan > wrote: >> >> I am not referring to code edits - but to migrating submodules and >> code currently in Apache Spark to 'outside' of it. >> If I understand correctly, assets from Apache Spark are being moved >> out of it into thirdparty external repositories - not owned by Apache. >> >> At a minimum, dev@ discussion (like this one) should be initiated. >> As PMC is responsible for the project assets (including code), signoff >> is required for it IMO. >> >> More experienced Apache members might be opine better in case I got it >> wrong ! >> >> >> Regards, >> Mridul >> >> >> On Thu, Mar 17, 2016 at 12:55 PM, Cody Koeninger >> wrote: >> > Why would a PMC vote be necessary on every code deletion? >> > >> > There was a Jira and pull request discussion about the submodules that >> > have been removed so far. >> > >> > https://issues.apache.org/jira/browse/SPARK-13843 >> > >> > There's another ongoing one about Kafka specifically >> > >> > https://issues.apache.org/jira/browse/SPARK-13877 >> > >> > >> > On Thu, Mar 17, 2016 at 2:49 PM, Mridul Muralidharan >> > wrote: >> >> >> >> I was not aware of a discussion in Dev list about this - agree with >> >> most of >> >> the observations. >> >> In addition, I did not see PMC signoff on moving (sub-)modules out. >> >> >> >> Regards >> >> Mridul >> >> >> >> >> >> >> >> On Thursday, March 17, 2016, Marcelo Vanzin >> >> wrote: >> >>> >> >>> Hello all, >> >>> >> >>> Recently a lot of the streaming backends were moved to a separate >> >>> project on github and removed from the main Spark repo. >> >>> >> >>> While I think the idea is great, I'm a little worried about the >> >>> execution. Some concerns were already raised on the bug mentioned >> >>> above, but I'd like to have a more explicit discussion about this so >> >>> things don't fall through the cracks. >> >>> >> >>> Mainly I have three concerns. >> >>> >> >>> i. Ownership >> >>> >> >>> That code used to be run by the ASF, but now it's hosted in a github >> >>> repo owned not by the ASF. That sounds a little sub-optimal, if not >> >>> problematic. >> >>> >> >>> ii. Governance >> >>> >> >>> Similar to the above; who has commit access to the above repos? Will >> >>> all the Spark committers, present and future, have commit access to >> >>> all of those repos? Are they still going to be considered part of >> >>> Spark and have release management done through the Spark community? >> >>> >> >>> >> >>> For both of the questions above, why are they not turned into >> >>> sub-projects of Spark and hosted on the ASF repos? I believe there is >> >>> a mechanism to do that, without the need to keep the code in the main >> >>> Spark repo, right? >> >>> >> >>> iii. Usability >> >>> >> >>> This is another thing I don't see discussed. For Scala-based code >> >>> thi
Re: SPARK-13843 and future of streaming backends
On Thu, Mar 17, 2016 at 12:01 PM, Cody Koeninger wrote: > i. An ASF project can clearly decide that some of its code is no > longer worth maintaining and delete it. This isn't really any > different. It's still apache licensed so ultimately whoever wants the > code can get it. Absolutely. But I don't remember this being discussed either way. Was the intention, as you mention later, just to decouple the release of those components from the main Spark release, or to completely disown that code? If the latter, is the ASF ok with it still retaining the current package and artifact names? Changing those would break backwards compatibility. Which is why I believe that keeping them as a sub-project, even if their release cadence is much slower, would be a better solution for both developers and users. > ii. I think part of the rationale is to not tie release management to > Spark, so it can proceed on a schedule that makes sense. I'm fine > with helping out with release management for the Kafka subproject, for > instance. I agree that practical governance questions need to be > worked out. > > iii. How is this any different from how python users get access to > any other third party Spark package? True, but that requires the modules to be published somewhere, not just to live as a bunch of .py files in a gitbub repo. Basically, I'm worried that there's work to be done to keep those modules working in this new environment - how to build, test, and publish things, remove potential uses of internal Spark APIs, just to cite a couple of things. -- Marcelo - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: SPARK-13843 and future of streaming backends
On Fri, Mar 18, 2016 at 10:09 AM, Jean-Baptiste Onofré wrote: > a project can have multiple repos: it's what we have in ServiceMix, in > Karaf. > For the *-extra on github, if the code has been in the ASF, the PMC members > have to vote to move the code on *-extra. That's good to know. To me that sounds like the best solution. I've heard that top-level projects have some requirements with regards to have active development, and these components probably will not see that much activity. And top-level does sound like too much bureaucracy for this. -- Marcelo - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
SPARK-13843 and future of streaming backends
Hello all, Recently a lot of the streaming backends were moved to a separate project on github and removed from the main Spark repo. While I think the idea is great, I'm a little worried about the execution. Some concerns were already raised on the bug mentioned above, but I'd like to have a more explicit discussion about this so things don't fall through the cracks. Mainly I have three concerns. i. Ownership That code used to be run by the ASF, but now it's hosted in a github repo owned not by the ASF. That sounds a little sub-optimal, if not problematic. ii. Governance Similar to the above; who has commit access to the above repos? Will all the Spark committers, present and future, have commit access to all of those repos? Are they still going to be considered part of Spark and have release management done through the Spark community? For both of the questions above, why are they not turned into sub-projects of Spark and hosted on the ASF repos? I believe there is a mechanism to do that, without the need to keep the code in the main Spark repo, right? iii. Usability This is another thing I don't see discussed. For Scala-based code things don't change much, I guess, if the artifact names don't change (another reason to keep things in the ASF?), but what about python? How are pyspark users expected to get that code going forward, since it's not in Spark's pyspark.zip anymore? Is there an easy way of keeping these things within the ASF Spark project? I think that would be better for everybody. -- Marcelo - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: SPARK-13843 and future of streaming backends
I was not aware of a discussion in Dev list about this - agree with most of the observations. In addition, I did not see PMC signoff on moving (sub-)modules out. Regards Mridul On Thursday, March 17, 2016, Marcelo Vanzin wrote: > Hello all, > > Recently a lot of the streaming backends were moved to a separate > project on github and removed from the main Spark repo. > > While I think the idea is great, I'm a little worried about the > execution. Some concerns were already raised on the bug mentioned > above, but I'd like to have a more explicit discussion about this so > things don't fall through the cracks. > > Mainly I have three concerns. > > i. Ownership > > That code used to be run by the ASF, but now it's hosted in a github > repo owned not by the ASF. That sounds a little sub-optimal, if not > problematic. > > ii. Governance > > Similar to the above; who has commit access to the above repos? Will > all the Spark committers, present and future, have commit access to > all of those repos? Are they still going to be considered part of > Spark and have release management done through the Spark community? > > > For both of the questions above, why are they not turned into > sub-projects of Spark and hosted on the ASF repos? I believe there is > a mechanism to do that, without the need to keep the code in the main > Spark repo, right? > > iii. Usability > > This is another thing I don't see discussed. For Scala-based code > things don't change much, I guess, if the artifact names don't change > (another reason to keep things in the ASF?), but what about python? > How are pyspark users expected to get that code going forward, since > it's not in Spark's pyspark.zip anymore? > > > Is there an easy way of keeping these things within the ASF Spark > project? I think that would be better for everybody. > > -- > Marcelo > > - > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > >
Re: SPARK-13843 and future of streaming backends
> On Mar 19, 2016, at 8:32 AM, Steve Loughran wrote: > > >> On 18 Mar 2016, at 17:07, Marcelo Vanzin wrote: >> >> Hi Steve, thanks for the write up. >> >> On Fri, Mar 18, 2016 at 3:12 AM, Steve Loughran >> wrote: >>> If you want a separate project, eg. SPARK-EXTRAS, then it *generally* needs >>> to go through incubation. While normally its the incubator PMC which >>> sponsors/oversees the incubating project, it doesn't have to be the case: >>> the spark project can do it. >>> >>> Also Apache Arrow managed to make it straight to toplevel without that >>> process. Given that the spark extras are already ASF source files, you >>> could try the same thing, add all the existing committers, then look for >>> volunteers to keep things. >> >> Am I to understand from your reply that it's not possible for a single >> project to have multiple repos? >> > > > I don't know. there's generally a 1 project -> 1x issue, 1x JIRA. > > but: hadoop core has 3x JIRA, 1x repo, and one set of write permissions to > that repo, with the special exception of branches (encryption, ipv6) that > have their own committers. > > oh, and I know that hadoop site is on SVN, as are other projects, just to > integrate with asf site publishing, so you can certainly have 1x git + 1 x svn > > ASF won't normally let you have 1 repo with different bits of the tree having > different access rights, so you couldn't open up spark-extras to people with > less permissions/rights than others. > > A separate repo will, separate issue tracking helps you isolate stuff Multiple repositories per project are certainly allowed without incurring the overhead of a subproject; Cordova and CouchDB are two projects that have taken this approach: https://github.com/apache?utf8=✓&query=cordova- https://github.com/apache?utf8=✓&query=couchdb- I believe Cordova also generates independent release artifacts in different cycles (e.g. cordova-ios releases independently from cordova-android). If the goal is to enable a divergent set of committers to spark-extras then an independent project makes sense. If you’re just looking to streamline the main repo and decouple some of these other streaming “backends” from the normal release cycle then there are low impact ways to accomplish this inside a single Apache Spark project. Cheers, Adam - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: SPARK-13843 and future of streaming backends
Marcelo Vanzin wrote earlier: > Recently a lot of the streaming backends were moved to a separate > project on github and removed from the main Spark repo. Question: why was the code removed from the Spark repo? What's the harm in keeping it available here? The ASF is perfectly happy if anyone wants to fork our code - that's one of the core tenets of the Apache license. You just can't take the name or trademarks, so you may need to change some package names or the like. So it's fine if some people want to work on the code outside the project. But it's puzzling as to why the Spark PMC shouldn't keep the code in the project as well, even if it might not have the same release cycles or whatnot. - Shane - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: SPARK-13843 and future of streaming backends
Hi Marcelo, a project can have multiple repos: it's what we have in ServiceMix, in Karaf. For the *-extra on github, if the code has been in the ASF, the PMC members have to vote to move the code on *-extra. Regards JB On 03/18/2016 06:07 PM, Marcelo Vanzin wrote: Hi Steve, thanks for the write up. On Fri, Mar 18, 2016 at 3:12 AM, Steve Loughran wrote: If you want a separate project, eg. SPARK-EXTRAS, then it *generally* needs to go through incubation. While normally its the incubator PMC which sponsors/oversees the incubating project, it doesn't have to be the case: the spark project can do it. Also Apache Arrow managed to make it straight to toplevel without that process. Given that the spark extras are already ASF source files, you could try the same thing, add all the existing committers, then look for volunteers to keep things. Am I to understand from your reply that it's not possible for a single project to have multiple repos? -- Jean-Baptiste Onofré jbono...@apache.org http://blog.nanthrax.net Talend - http://www.talend.com - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: SPARK-13843 and future of streaming backends
> On 18 Mar 2016, at 22:24, Marcelo Vanzin wrote: > > On Fri, Mar 18, 2016 at 2:12 PM, chrismattmann wrote: >> So, my comment here is that any code *cannot* be removed from an Apache >> project if there is a VETO issued which so far I haven't seen, though maybe >> Marcelo can clarify that. > > No, my intention was not to veto the change. I'm actually for the > removal of components if the community thinks they don't add much to > the project. (I'm also not sure I can even veto things, not being a > PMC member.) > > I mainly wanted to know what was the path forward for those components > because, with Cloudera's hat on, we care about one of them (streaming > integration with flume), and we'd prefer if that code remained under > the ASF umbrella in some way. > I'd be supportive of a spark-extras project; it'd actually be place to keep stuff I've worked on -the yarn ATS 1/1.5 integration -that mutant hive JAR which has the consistent kryo dependency and different shadings ... etc There's also the fact that the twitter streaming is a common example to play with, flume is popular in places too. If you want to set up a new incubator with a goal of graduating fast, I'd help. As a key metric of getting out of incubator is active development, you just need to "recruit" contributors and keep them engaged. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: SPARK-13843 and future of streaming backends
> On 18 Mar 2016, at 17:07, Marcelo Vanzin wrote: > > Hi Steve, thanks for the write up. > > On Fri, Mar 18, 2016 at 3:12 AM, Steve Loughran > wrote: >> If you want a separate project, eg. SPARK-EXTRAS, then it *generally* needs >> to go through incubation. While normally its the incubator PMC which >> sponsors/oversees the incubating project, it doesn't have to be the case: >> the spark project can do it. >> >> Also Apache Arrow managed to make it straight to toplevel without that >> process. Given that the spark extras are already ASF source files, you could >> try the same thing, add all the existing committers, then look for >> volunteers to keep things. > > Am I to understand from your reply that it's not possible for a single > project to have multiple repos? > I don't know. there's generally a 1 project -> 1x issue, 1x JIRA. but: hadoop core has 3x JIRA, 1x repo, and one set of write permissions to that repo, with the special exception of branches (encryption, ipv6) that have their own committers. oh, and I know that hadoop site is on SVN, as are other projects, just to integrate with asf site publishing, so you can certainly have 1x git + 1 x svn ASF won't normally let you have 1 repo with different bits of the tree having different access rights, so you couldn't open up spark-extras to people with less permissions/rights than others. A separate repo will, separate issue tracking helps you isolate stuff - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: SPARK-13843 and future of streaming backends
> On 17 Mar 2016, at 21:33, Marcelo Vanzin wrote: > > Hi Reynold, thanks for the info. > > On Thu, Mar 17, 2016 at 2:18 PM, Reynold Xin wrote: >> If one really feels strongly that we should go through all the overhead to >> setup an ASF subproject for these modules that won't work with the new >> structured streaming, and want to spearhead to setup separate repos >> (preferably one subproject per connector), CI, separate JIRA, governance, >> READMEs, voting, we can discuss that. Until then, I'd keep the github option >> open because IMHO it is what works the best for end users (including >> discoverability, issue tracking, release publishing, ...). > > For those of us who are not exactly familiar with the inner workings > of administrating ASF projects, would you mind explaining in more > detail what this overhead is? > > From my naive point of view, when I say "sub project" I assume that > it's a simple as having a separate git repo for it, tied to the same > parent project. Everything else - JIRA, committers, bylaws, etc - > remains the same. And since the project we're talking about are very > small, CI should be very simple (Travis?) and, assuming sporadic > releases, things overall should not be that expensive to maintain. > If you want a separate project, eg. SPARK-EXTRAS, then it *generally* needs to go through incubation. While normally its the incubator PMC which sponsors/oversees the incubating project, it doesn't have to be the case: the spark project can do it. Also Apache Arrow managed to make it straight to toplevel without that process. Given that the spark extras are already ASF source files, you could try the same thing, add all the existing committers, then look for volunteers to keep things. You'd get -a JIRA entry of your own, easy to reassign bugs from SPARK to SPARK-EXTRAS -a bit of git -ability to set up builds on ASF Jenkins. Regression testing against spark nightlies would be invaluable here. -the ability to stage and publish through ASF Nexus -Steve - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: SPARK-13843 and future of streaming backends
On Fri, Mar 18, 2016 at 7:58 AM, Cody Koeninger wrote: > > Or, as Cody Koeniger suggests, having a spark-extras project in the ASF > with a focus on extras with their own support channel. > > To be clear, I didn't suggest that and don't think that's the best > solution. I said to the people who want things done that way, which > committer is going to step up and do that organizational work? > I am currently not a committer, but If we are willing to go into the direction of having another project as spark-extras, I can help drive the bureaucratic work to make this a reality. -- Luciano Resende http://people.apache.org/~lresende http://twitter.com/lresende1975 http://lresende.blogspot.com/
Re: SPARK-13843 and future of streaming backends
Code can be removed from an ASF project. That code can live on elsewhere (in accordance with the license) It can't be presented as part of the official ASF project, like any other 3rd party project The package name certainly must change from org.apache.spark I don't know of a protocol, but common sense dictates a good-faith effort to offer equivalent access to the code (e.g. interested committers should probably be repo owners too.) This differs from "any other code deletion" in that there's an intent to keep working on the code but outside the project. More discussion -- like this one -- would have been useful beforehand but nothing's undoable Backwards-compatibility is not a good reason for things, because we're talking about Spark 2.x, and we're already talking about distributing the code differently. Is the reason for this change decoupling releases? or changing governance? Seems like the former, but we don't actually need the latter to achieve that. There's an argument for a new repo, but this is not an argument for moving X out of the project per se I'm sure doing this in the ASF is more overhead, but if changing governance is a non-goal, there's no choice. Convenience can't trump that. Kafka integration is clearly more important than the others. It seems to need to stay within the project. However this still leaves a packaging problem to solve, that might need a new repo. This is orthgonal. Here's what I think: 1. Leave the moved modules outside the project entirely (why not Kinesis though? that one was not made clear) 2. Change package names and make sure it's clearly presented as external 3. Add any committers that want to be repo owners as owners 4. Keep Kafka within the project 5. Add some subproject within the current project as needed to accomplish distribution goals On Thu, Mar 17, 2016 at 6:14 PM, Marcelo Vanzin wrote: > Hello all, > > Recently a lot of the streaming backends were moved to a separate > project on github and removed from the main Spark repo. > > While I think the idea is great, I'm a little worried about the > execution. Some concerns were already raised on the bug mentioned > above, but I'd like to have a more explicit discussion about this so > things don't fall through the cracks. > > Mainly I have three concerns. > > i. Ownership > > That code used to be run by the ASF, but now it's hosted in a github > repo owned not by the ASF. That sounds a little sub-optimal, if not > problematic. > > ii. Governance > > Similar to the above; who has commit access to the above repos? Will > all the Spark committers, present and future, have commit access to > all of those repos? Are they still going to be considered part of > Spark and have release management done through the Spark community? > > > For both of the questions above, why are they not turned into > sub-projects of Spark and hosted on the ASF repos? I believe there is > a mechanism to do that, without the need to keep the code in the main > Spark repo, right? > > iii. Usability > > This is another thing I don't see discussed. For Scala-based code > things don't change much, I guess, if the artifact names don't change > (another reason to keep things in the ASF?), but what about python? > How are pyspark users expected to get that code going forward, since > it's not in Spark's pyspark.zip anymore? > > > Is there an easy way of keeping these things within the ASF Spark > project? I think that would be better for everybody. > > -- > Marcelo > > - > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: SPARK-13843 and future of streaming backends
Note the non-kafka bug was filed right before the change was pushed. So there really wasn't any discussion before the decision was made to remove that code. I'm just trying to merge both discussions here in the list where it's a little bit more dynamic than bug updates that end up getting lost in the noise. On Thu, Mar 17, 2016 at 12:55 PM, Cody Koeninger wrote: > Why would a PMC vote be necessary on every code deletion? > > There was a Jira and pull request discussion about the submodules that > have been removed so far. > > https://issues.apache.org/jira/browse/SPARK-13843 > > There's another ongoing one about Kafka specifically > > https://issues.apache.org/jira/browse/SPARK-13877 > > > On Thu, Mar 17, 2016 at 2:49 PM, Mridul Muralidharan wrote: >> >> I was not aware of a discussion in Dev list about this - agree with most of >> the observations. >> In addition, I did not see PMC signoff on moving (sub-)modules out. >> >> Regards >> Mridul >> >> >> >> On Thursday, March 17, 2016, Marcelo Vanzin wrote: >>> >>> Hello all, >>> >>> Recently a lot of the streaming backends were moved to a separate >>> project on github and removed from the main Spark repo. >>> >>> While I think the idea is great, I'm a little worried about the >>> execution. Some concerns were already raised on the bug mentioned >>> above, but I'd like to have a more explicit discussion about this so >>> things don't fall through the cracks. >>> >>> Mainly I have three concerns. >>> >>> i. Ownership >>> >>> That code used to be run by the ASF, but now it's hosted in a github >>> repo owned not by the ASF. That sounds a little sub-optimal, if not >>> problematic. >>> >>> ii. Governance >>> >>> Similar to the above; who has commit access to the above repos? Will >>> all the Spark committers, present and future, have commit access to >>> all of those repos? Are they still going to be considered part of >>> Spark and have release management done through the Spark community? >>> >>> >>> For both of the questions above, why are they not turned into >>> sub-projects of Spark and hosted on the ASF repos? I believe there is >>> a mechanism to do that, without the need to keep the code in the main >>> Spark repo, right? >>> >>> iii. Usability >>> >>> This is another thing I don't see discussed. For Scala-based code >>> things don't change much, I guess, if the artifact names don't change >>> (another reason to keep things in the ASF?), but what about python? >>> How are pyspark users expected to get that code going forward, since >>> it's not in Spark's pyspark.zip anymore? >>> >>> >>> Is there an easy way of keeping these things within the ASF Spark >>> project? I think that would be better for everybody. >>> >>> -- >>> Marcelo >>> >>> - >>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >>> For additional commands, e-mail: dev-h...@spark.apache.org >>> >> -- Marcelo - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: SPARK-13843 and future of streaming backends
Also, just wanted to point out something: On Thu, Mar 17, 2016 at 2:18 PM, Reynold Xin wrote: > Thanks for initiating this discussion. I merged the pull request because it > was unblocking another major piece of work for Spark 2.0: not requiring > assembly jars While I do agree that's more important, the streaming assemblies weren't really blocking that work. The fact that there are still streaming assemblies in the build kinda proves that point. :-) I even filed a task to look at getting rid of the streaming assemblies (SPARK-13575; just the assemblies though, not the code) but while working on it found it would be more complicated than expected, and decided against it given that it didn't really affect work on the other assemblies. -- Marcelo - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: SPARK-13843 and future of streaming backends
Anyone can fork apache licensed code. Committers can approve pull requests that delete code from asf repos. Because those two things happen near each other in time, it's somehow a process violation? I think the discussion would be better served by concentrating on how we're going to solve the problem and move forward. On Thu, Mar 17, 2016 at 3:13 PM, Mridul Muralidharan wrote: > I am not referring to code edits - but to migrating submodules and > code currently in Apache Spark to 'outside' of it. > If I understand correctly, assets from Apache Spark are being moved > out of it into thirdparty external repositories - not owned by Apache. > > At a minimum, dev@ discussion (like this one) should be initiated. > As PMC is responsible for the project assets (including code), signoff > is required for it IMO. > > More experienced Apache members might be opine better in case I got it wrong ! > > > Regards, > Mridul > > > On Thu, Mar 17, 2016 at 12:55 PM, Cody Koeninger wrote: >> Why would a PMC vote be necessary on every code deletion? >> >> There was a Jira and pull request discussion about the submodules that >> have been removed so far. >> >> https://issues.apache.org/jira/browse/SPARK-13843 >> >> There's another ongoing one about Kafka specifically >> >> https://issues.apache.org/jira/browse/SPARK-13877 >> >> >> On Thu, Mar 17, 2016 at 2:49 PM, Mridul Muralidharan >> wrote: >>> >>> I was not aware of a discussion in Dev list about this - agree with most of >>> the observations. >>> In addition, I did not see PMC signoff on moving (sub-)modules out. >>> >>> Regards >>> Mridul >>> >>> >>> >>> On Thursday, March 17, 2016, Marcelo Vanzin wrote: Hello all, Recently a lot of the streaming backends were moved to a separate project on github and removed from the main Spark repo. While I think the idea is great, I'm a little worried about the execution. Some concerns were already raised on the bug mentioned above, but I'd like to have a more explicit discussion about this so things don't fall through the cracks. Mainly I have three concerns. i. Ownership That code used to be run by the ASF, but now it's hosted in a github repo owned not by the ASF. That sounds a little sub-optimal, if not problematic. ii. Governance Similar to the above; who has commit access to the above repos? Will all the Spark committers, present and future, have commit access to all of those repos? Are they still going to be considered part of Spark and have release management done through the Spark community? For both of the questions above, why are they not turned into sub-projects of Spark and hosted on the ASF repos? I believe there is a mechanism to do that, without the need to keep the code in the main Spark repo, right? iii. Usability This is another thing I don't see discussed. For Scala-based code things don't change much, I guess, if the artifact names don't change (another reason to keep things in the ASF?), but what about python? How are pyspark users expected to get that code going forward, since it's not in Spark's pyspark.zip anymore? Is there an easy way of keeping these things within the ASF Spark project? I think that would be better for everybody. -- Marcelo - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org >>> - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: SPARK-13843 and future of streaming backends
Thanks for initiating this discussion. I merged the pull request because it was unblocking another major piece of work for Spark 2.0: not requiring assembly jars, which is arguably a lot more important than sources that are less frequently used. I take full responsibility for that. I think it's inaccurate to call them "backend" because it makes these things sound a lot more serious, when in reality they are a bunch of connectors to less frequently used streaming data sources (e.g. mqtt, flume). But that's not that important here. Another important factor is that over time, with the development of structure streaming, we'd provide a new API for streaming sources that unifies the way to connect arbitrary sources, and as a result all of these sources need to be rewritten anyway. This is similar to the RDD -> DataFrame transition for data sources, although it was initially painful, but in the long run provides much better experience for end-users because they only need to learn a single API for all sources, and it becomes trivial to transition from one source to another, without actually impacting business logic. So the truth is that in the long run, the existing connectors will be replaced by new ones, and they have been causing minor issues here and there in the code base. Now issues like these are never black and white. By moving them out, we'd require users to at least change the maven coordinate in their build file (although things can still be made binary and source compatible). So I made the call and asked the contributor to keep Kafka and Kinesis in, because those are the most widely used (and could be more contentious), and move everything else out. I have personally done enough data sources or 3rd party packages for Spark on github that I can setup a github repo with CI and maven publishing in just under an hour. I do not expect a lot of changes to these packages because the APIs have been fairly stable. So the thing I was optimizing for was to minimize the time we need to spent on these packages given the (expected) low activity and the shift to focus on structured streaming, and also minimize the chance to break user apps to provide the best user experience. Github repo seems the simplest choice to me. I also made another decision to provide separate repos (and thus issue trackers) on github for these packages. The reason is that these connectors have very disjoint communities. For example, the community that care about mqtt is likely very different from the community that care about akka. It is much easier to track all of these. Logistics wise -- things are still in flux. I think it'd make a lot of sense to give existing Spark committers (or at least the ones that have contributed to streaming) write access to the github repos. IMHO, it is not in any of the major Spark contributing organizations' strategic interest to "own" these projects, especially considering most of the activities will switch to structured streaming. If one really feels strongly that we should go through all the overhead to setup an ASF subproject for these modules that won't work with the new structured streaming, and want to spearhead to setup separate repos (preferably one subproject per connector), CI, separate JIRA, governance, READMEs, voting, we can discuss that. Until then, I'd keep the github option open because IMHO it is what works the best for end users (including discoverability, issue tracking, release publishing, ...). On Thu, Mar 17, 2016 at 1:50 PM, Cody Koeninger wrote: > Anyone can fork apache licensed code. Committers can approve pull > requests that delete code from asf repos. Because those two things > happen near each other in time, it's somehow a process violation? > > I think the discussion would be better served by concentrating on how > we're going to solve the problem and move forward. > > On Thu, Mar 17, 2016 at 3:13 PM, Mridul Muralidharan > wrote: > > I am not referring to code edits - but to migrating submodules and > > code currently in Apache Spark to 'outside' of it. > > If I understand correctly, assets from Apache Spark are being moved > > out of it into thirdparty external repositories - not owned by Apache. > > > > At a minimum, dev@ discussion (like this one) should be initiated. > > As PMC is responsible for the project assets (including code), signoff > > is required for it IMO. > > > > More experienced Apache members might be opine better in case I got it > wrong ! > > > > > > Regards, > > Mridul > > > > > > On Thu, Mar 17, 2016 at 12:55 PM, Cody Koeninger > wrote: > >> Why would a PMC vote be necessary on every code deletion? > >> > >> There was a Jira and pull request discussion about the submodules that > >> have been removed so far. > >> > >> https://issues.apache.org/jira/browse/SPARK-13843 > >> > >> There's another ongoing one about Kafka specifically > >> > >> https://issues.apache.org/jira/browse/SPARK-13877 > >> > >> > >> On Thu, Mar 17, 2016 at 2:49 PM, Mr
Re: SPARK-13843 and future of streaming backends
I am not referring to code edits - but to migrating submodules and code currently in Apache Spark to 'outside' of it. If I understand correctly, assets from Apache Spark are being moved out of it into thirdparty external repositories - not owned by Apache. At a minimum, dev@ discussion (like this one) should be initiated. As PMC is responsible for the project assets (including code), signoff is required for it IMO. More experienced Apache members might be opine better in case I got it wrong ! Regards, Mridul On Thu, Mar 17, 2016 at 12:55 PM, Cody Koeninger wrote: > Why would a PMC vote be necessary on every code deletion? > > There was a Jira and pull request discussion about the submodules that > have been removed so far. > > https://issues.apache.org/jira/browse/SPARK-13843 > > There's another ongoing one about Kafka specifically > > https://issues.apache.org/jira/browse/SPARK-13877 > > > On Thu, Mar 17, 2016 at 2:49 PM, Mridul Muralidharan wrote: >> >> I was not aware of a discussion in Dev list about this - agree with most of >> the observations. >> In addition, I did not see PMC signoff on moving (sub-)modules out. >> >> Regards >> Mridul >> >> >> >> On Thursday, March 17, 2016, Marcelo Vanzin wrote: >>> >>> Hello all, >>> >>> Recently a lot of the streaming backends were moved to a separate >>> project on github and removed from the main Spark repo. >>> >>> While I think the idea is great, I'm a little worried about the >>> execution. Some concerns were already raised on the bug mentioned >>> above, but I'd like to have a more explicit discussion about this so >>> things don't fall through the cracks. >>> >>> Mainly I have three concerns. >>> >>> i. Ownership >>> >>> That code used to be run by the ASF, but now it's hosted in a github >>> repo owned not by the ASF. That sounds a little sub-optimal, if not >>> problematic. >>> >>> ii. Governance >>> >>> Similar to the above; who has commit access to the above repos? Will >>> all the Spark committers, present and future, have commit access to >>> all of those repos? Are they still going to be considered part of >>> Spark and have release management done through the Spark community? >>> >>> >>> For both of the questions above, why are they not turned into >>> sub-projects of Spark and hosted on the ASF repos? I believe there is >>> a mechanism to do that, without the need to keep the code in the main >>> Spark repo, right? >>> >>> iii. Usability >>> >>> This is another thing I don't see discussed. For Scala-based code >>> things don't change much, I guess, if the artifact names don't change >>> (another reason to keep things in the ASF?), but what about python? >>> How are pyspark users expected to get that code going forward, since >>> it's not in Spark's pyspark.zip anymore? >>> >>> >>> Is there an easy way of keeping these things within the ASF Spark >>> project? I think that would be better for everybody. >>> >>> -- >>> Marcelo >>> >>> - >>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >>> For additional commands, e-mail: dev-h...@spark.apache.org >>> >> - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: SPARK-13843 and future of streaming backends
So, my comment here is that any code *cannot* be removed from an Apache project if there is a VETO issued which so far I haven't seen, though maybe Marcelo can clarify that. However if a VETO was issued, then the code cannot be removed and must be put back. Anyone can fork anything our license allows that, but the community itself must steward the code and part of that is hearing everyone's voice within that community before acting. Cheers, Chris -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/SPARK-13843-and-future-of-streaming-backends-tp16711p16749.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: SPARK-13843 and future of streaming backends
If the intention is to actually decouple and give a life of it's own to these connectors, I would have expected that they would still be hosted as different git repositories inside Apache even tough users will not really see much difference as they would still be mirrored in GitHub. This makes it much easier on the legal departments of the upstream consumers and customers as well because the code still follow the so well received and trusted Apache Governance and Apache Release Policies. As for implementation details, we can have multiple repositories if we see a lot of fragmented releases, or a single "connectors" repository which in our side would make administration more easily. On Thu, Mar 17, 2016 at 2:33 PM, Marcelo Vanzin wrote: > Hi Reynold, thanks for the info. > > On Thu, Mar 17, 2016 at 2:18 PM, Reynold Xin wrote: > > If one really feels strongly that we should go through all the overhead > to > > setup an ASF subproject for these modules that won't work with the new > > structured streaming, and want to spearhead to setup separate repos > > (preferably one subproject per connector), CI, separate JIRA, governance, > > READMEs, voting, we can discuss that. Until then, I'd keep the github > option > > open because IMHO it is what works the best for end users (including > > discoverability, issue tracking, release publishing, ...). > Agree that there might be a little overhead, but there are ways to minimize this, and I am sure there are volunteers willing to help in favor of having a more unifying project. Breaking things into multiple projects, and having to manage the matrix of supported versions will be hell worst overhead. > > For those of us who are not exactly familiar with the inner workings > of administrating ASF projects, would you mind explaining in more > detail what this overhead is? > > From my naive point of view, when I say "sub project" I assume that > it's a simple as having a separate git repo for it, tied to the same > parent project. Everything else - JIRA, committers, bylaws, etc - > remains the same. And since the project we're talking about are very > small, CI should be very simple (Travis?) and, assuming sporadic > releases, things overall should not be that expensive to maintain. > > Subprojects or even if we send this back to incubator as "connectors project" is better then public github per package in my opinion. Now, if with this move is signalizing to customers that the Streaming API as in 1.x is going away in favor the new structure streaming APIs , then I guess this is a complete different discussion. -- Luciano Resende http://people.apache.org/~lresende http://twitter.com/lresende1975 http://lresende.blogspot.com/
Re: SPARK-13843 and future of streaming backends
Hi Steve, thanks for the write up. On Fri, Mar 18, 2016 at 3:12 AM, Steve Loughran wrote: > If you want a separate project, eg. SPARK-EXTRAS, then it *generally* needs > to go through incubation. While normally its the incubator PMC which > sponsors/oversees the incubating project, it doesn't have to be the case: the > spark project can do it. > > Also Apache Arrow managed to make it straight to toplevel without that > process. Given that the spark extras are already ASF source files, you could > try the same thing, add all the existing committers, then look for volunteers > to keep things. Am I to understand from your reply that it's not possible for a single project to have multiple repos? -- Marcelo - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: SPARK-13843 and future of streaming backends
On Fri, Mar 18, 2016 at 3:15 PM, Shane Curcuru wrote: > Question: why was the code removed from the Spark repo? What's the harm > in keeping it available here? Assuming the Spark PMC has no plan on releasing the code, why would we keep it in our codebase? It only makes the codebase harder to navigate. It would be easy for someone to stumble on that code and expect it be part of the release. Seems like a general code-hygiene practice, eg. just like not leaving a giant commented-out block of old code. (But as you can see from the rest of the thread, I think that discussion on whether it should still be part of Apache Spark is ongoing ...)
Re: SPARK-13843 and future of streaming backends
Hi Marcelo, I quickly discussed with Reynold this morning about this. I share your concerns. I fully understand that it's painful for users to wait a Spark releases to include fix in streaming backends as it's not really related. It makes sense to provide backends "outside" of ASF, especially for legal issues: it's what we do at Camel with Camel-Extra. Don't you think it could be interesting to have another ASF git repo dedicated to streaming backends, each backend can managed its release cycle following the ASF "rules" (staging, vote, ...) ? Regards JB On 03/17/2016 07:14 PM, Marcelo Vanzin wrote: Hello all, Recently a lot of the streaming backends were moved to a separate project on github and removed from the main Spark repo. While I think the idea is great, I'm a little worried about the execution. Some concerns were already raised on the bug mentioned above, but I'd like to have a more explicit discussion about this so things don't fall through the cracks. Mainly I have three concerns. i. Ownership That code used to be run by the ASF, but now it's hosted in a github repo owned not by the ASF. That sounds a little sub-optimal, if not problematic. ii. Governance Similar to the above; who has commit access to the above repos? Will all the Spark committers, present and future, have commit access to all of those repos? Are they still going to be considered part of Spark and have release management done through the Spark community? For both of the questions above, why are they not turned into sub-projects of Spark and hosted on the ASF repos? I believe there is a mechanism to do that, without the need to keep the code in the main Spark repo, right? iii. Usability This is another thing I don't see discussed. For Scala-based code things don't change much, I guess, if the artifact names don't change (another reason to keep things in the ASF?), but what about python? How are pyspark users expected to get that code going forward, since it's not in Spark's pyspark.zip anymore? Is there an easy way of keeping these things within the ASF Spark project? I think that would be better for everybody. -- Jean-Baptiste Onofré jbono...@apache.org http://blog.nanthrax.net Talend - http://www.talend.com - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: SPARK-13843 and future of streaming backends
Hi Marcelo, Thanks for your reply. As a committer on the project, you *can* VETO code. For sure. Unfortunately you don’t have a binding vote on adding new PMC members/committers, and/or on releasing the software, but do have the ability to VETO. That said, if that’s not your intent, sorry for misreading your intent. Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Director, Information Retrieval and Data Science Group (IRDS) Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA WWW: http://irds.usc.edu/ ++ -Original Message- From: Marcelo Vanzin Date: Friday, March 18, 2016 at 3:24 PM To: jpluser Cc: "dev@spark.apache.org" Subject: Re: SPARK-13843 and future of streaming backends >On Fri, Mar 18, 2016 at 2:12 PM, chrismattmann >wrote: >> So, my comment here is that any code *cannot* be removed from an Apache >> project if there is a VETO issued which so far I haven't seen, though >>maybe >> Marcelo can clarify that. > >No, my intention was not to veto the change. I'm actually for the >removal of components if the community thinks they don't add much to >the project. (I'm also not sure I can even veto things, not being a >PMC member.) > >I mainly wanted to know what was the path forward for those components >because, with Cloudera's hat on, we care about one of them (streaming >integration with flume), and we'd prefer if that code remained under >the ASF umbrella in some way. > >-- >Marcelo
Re: SPARK-13843 and future of streaming backends
On Fri, Mar 18, 2016 at 10:07 AM, Marcelo Vanzin wrote: > Hi Steve, thanks for the write up. > > On Fri, Mar 18, 2016 at 3:12 AM, Steve Loughran > wrote: > > If you want a separate project, eg. SPARK-EXTRAS, then it *generally* > needs to go through incubation. While normally its the incubator PMC which > sponsors/oversees the incubating project, it doesn't have to be the case: > the spark project can do it. > > > > Also Apache Arrow managed to make it straight to toplevel without that > process. Given that the spark extras are already ASF source files, you > could try the same thing, add all the existing committers, then look for > volunteers to keep things. > > Am I to understand from your reply that it's not possible for a single > project to have multiple repos? > > It can have multiple repos, but this still brings the overhead into the PMC to maintain it which was brought on previously on this thread and it might not be the direction the PMC want to take (but I might be mistaken). Another approach is to make this extras, just a subproject, with it's own set of committers etc which gives less burden on the Spark PMC. Anyway, my main issue here is not who and how it's going to be managed, but that it continues under Apache governance. -- Luciano Resende http://people.apache.org/~lresende http://twitter.com/lresende1975 http://lresende.blogspot.com/
Re: SPARK-13843 and future of streaming backends
Why would a PMC vote be necessary on every code deletion? There was a Jira and pull request discussion about the submodules that have been removed so far. https://issues.apache.org/jira/browse/SPARK-13843 There's another ongoing one about Kafka specifically https://issues.apache.org/jira/browse/SPARK-13877 On Thu, Mar 17, 2016 at 2:49 PM, Mridul Muralidharan wrote: > > I was not aware of a discussion in Dev list about this - agree with most of > the observations. > In addition, I did not see PMC signoff on moving (sub-)modules out. > > Regards > Mridul > > > > On Thursday, March 17, 2016, Marcelo Vanzin wrote: >> >> Hello all, >> >> Recently a lot of the streaming backends were moved to a separate >> project on github and removed from the main Spark repo. >> >> While I think the idea is great, I'm a little worried about the >> execution. Some concerns were already raised on the bug mentioned >> above, but I'd like to have a more explicit discussion about this so >> things don't fall through the cracks. >> >> Mainly I have three concerns. >> >> i. Ownership >> >> That code used to be run by the ASF, but now it's hosted in a github >> repo owned not by the ASF. That sounds a little sub-optimal, if not >> problematic. >> >> ii. Governance >> >> Similar to the above; who has commit access to the above repos? Will >> all the Spark committers, present and future, have commit access to >> all of those repos? Are they still going to be considered part of >> Spark and have release management done through the Spark community? >> >> >> For both of the questions above, why are they not turned into >> sub-projects of Spark and hosted on the ASF repos? I believe there is >> a mechanism to do that, without the need to keep the code in the main >> Spark repo, right? >> >> iii. Usability >> >> This is another thing I don't see discussed. For Scala-based code >> things don't change much, I guess, if the artifact names don't change >> (another reason to keep things in the ASF?), but what about python? >> How are pyspark users expected to get that code going forward, since >> it's not in Spark's pyspark.zip anymore? >> >> >> Is there an easy way of keeping these things within the ASF Spark >> project? I think that would be better for everybody. >> >> -- >> Marcelo >> >> - >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >> For additional commands, e-mail: dev-h...@spark.apache.org >> > - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: SPARK-13843 and future of streaming backends
> Or, as Cody Koeniger suggests, having a spark-extras project in the ASF with > a focus on extras with their own support channel. To be clear, I didn't suggest that and don't think that's the best solution. I said to the people who want things done that way, which committer is going to step up and do that organizational work? I think there are advantages to moving everything currently in extras/ and external/ out of the spark project, but the current Kafka packaging issue can be solved straightforwardly by just adding another artifact and code tree under external/. On Fri, Mar 18, 2016 at 5:04 AM, Steve Loughran wrote: > > Spark has hit one of the enternal problems of OSS projects, one hit by: ant, > maven, hadoop, ... anything with a plugin model. > > Take in the plugin: you're in control, but also down for maintenance > > Leave out the plugin: other people can maintain it, be more agile, etc. > > But you've lost control, and you can't even manage the links. Here I think > maven suffered the most by keeping stuff in codehaus; migrating off there is > still hard —not only did they lose the links: they lost the JIRA. > > Maven's relationship with codehaus was very tightly coupled, lots of > committers on both; I don't know how that relationship was handled at a > higher level. > > > On 17 Mar 2016, at 20:51, Hari Shreedharan > wrote: > > I have worked with various ASF projects for 4+ years now. Sure, ASF projects > can delete code as they feel fit. But this is the first time I have really > seen code being "moved out" of a project without discussion. I am sure you > can do this without violating ASF policy, but the explanation for that would > be convoluted (someone decided to make a copy and then the ASF project > deleted it?). > > > +1 for discussion. Dev changes should -> dev list; PMC for process in > general. Don't think the ASF will overlook stuff like that. > > Might want to raise this issue on the next broad report > > > FWIW, it may be better to just see if you can have committers to work on > these projects: recruit the people and say 'please, only work in this area > —for now". That gets developers on your team, which is generally considered > a metric of health in a project. > > Or, as Cody Koeniger suggests, having a spark-extras project in the ASF with > a focus on extras with their own support channel. > > > Also, moving the code out would break compatibility. AFAIK, there is no way > to push org.apache.* artifacts directly to maven central. That happens via > mirroring from the ASF maven repos. Even if it you could somehow directly > push the artifacts to mvn, you really can push to org.apache.* groups only > if you are part of the repo and acting as an agent of that project (which in > this case would be Apache Spark). Once you move the code out, even a > committer/PMC member would not be representing the ASF when pushing the > code. I am not sure if there is a way to fix this issue. > > > > > This topic has cropped up in the general context of third party repos > publishing artifacts with org.apache names but vendor specfic suffixes (e.g > org.apache.hadoop/hadoop-common.5.3-cdh.jar > > Some people were pretty unhappy about this, but the conclusion reached was > "maven doesn't let you do anything else and still let downstream people use > it". Futhermore, as all ASF releases are nominally the source releases *not > the binaries*, you can look at the POMs and say "we've released source code > designed to publish artifacts to repos —this is 'use as intended'. > > People are also free to cut their own full project distributions, etc, etc. > For example, I stick up the binaries of Windows builds independent of the > ASF releases; these were originally just those from HDP on windows installs, > now I check out the commit of the specific ASF release on a windows 2012 VM, > do the build, copy the binaries. Free for all to use. But I do suspect that > the ASF legal protections get a bit blurred here. These aren't ASF binaries, > but binaries built directly from unmodified ASF releases. > > In contrast to sticking stuff into a github repo, the moved artifacts cannot > be published as org.apache artfacts on maven central. That's non-negotiable > as far as the ASF are concerned. The process for releasing ASF artifacts > there goes downstream of the ASF public release process: you stage the > artifacts, they are part of the vote process, everything with org.apache > goes through it. > > That said: there is nothing to stop a set of shell org.apache artifacts > being written which do nothing but contain transitive dependencies on > artifacts in different groups, such as org.spark-project. The shells would > be released by the ASF; they pull in the new stuff. And, therefore, it'd be > possible to build a spark-assembly with the files. (I'm ignoring a loop in > the build DAG here, playing with git submodules would let someone eliminate > this by adding the removed libraries under a modified proje
Re: SPARK-13843 and future of streaming backends
On Fri, Mar 18, 2016 at 2:12 PM, chrismattmann wrote: > So, my comment here is that any code *cannot* be removed from an Apache > project if there is a VETO issued which so far I haven't seen, though maybe > Marcelo can clarify that. No, my intention was not to veto the change. I'm actually for the removal of components if the community thinks they don't add much to the project. (I'm also not sure I can even veto things, not being a PMC member.) I mainly wanted to know what was the path forward for those components because, with Cloudera's hat on, we care about one of them (streaming integration with flume), and we'd prefer if that code remained under the ASF umbrella in some way. -- Marcelo - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org