Re: [PROPOSAL] Apache Spark for the Incubator
That makes sense. Thanks for the update - I am still catching up on my emails backed up because of the Hadoop summit. Cos On Tue, Jun 04, 2013 at 01:44AM, Mattmann, Chris A (398J) wrote: Dear Konstantin, Thanks! The incoming Spark project is excited about the relationship with Bigtop that could happen here. As for new committers, after conferring with the Spark project members, we would like to adopt a simple policy of having all new committers not add themselves to the wiki as of yet, but simply join the project mailing lists when they are created, and then from there, contribute. I and other mentors, and the Spark community are committed to being inclusive, so hopefully won't take too long for anybody to become a PPMC member/committer on the project after some demonstrated contributions. Thanks for your interest and again for your kind words. Cheers! Chris ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Konstantin Boudnik c...@apache.org Reply-To: general@incubator.apache.org general@incubator.apache.org Date: Friday, May 31, 2013 12:29 PM To: general@incubator.apache.org general@incubator.apache.org Subject: Re: [PROPOSAL] Apache Spark for the Incubator Great news! Definitely +1 (non-binding, I guess) on adding Spark to the family of ASF project! I also express the interest to contribute to the project and move it forward to the graduation! Bigtop has been packaging and providing Spark as a part of Hadoop 1.x software stacks for some time; and hopefully would be able to offer it as a part of Hadoop 2.x line in the coming days. Dr. Konstantin Boudnik Hadoop committer BigTop PMC On Fri, May 31, 2013 at 06:03PM, Mattmann, Chris A (398J) wrote: Hi Folks, I'm pleased to bring you a proposal to the Apache Incubator for the Apache Spark project: https://wiki.apache.org/incubator/SparkProposal The work originates from the Berkeley AMPLab and through a number of industry participants, and other institutions. Spark is a framework for large-scale data analysis on clusters, with a particular focus on low latency operations. The source code is written in Scala, and provides a number of APIs and bindings in various programming languages. The proposal text is copied to the bottom of this email. I'm going to leave this thread open for the next week for discussion. Once it's died down, I'll call an official VOTE. Suresh, Ross G. -- heads up -- this project may be of interest to you both and would welcome you guys as additional mentors. We currently have 3 mentors committed to the project, but would love to have more. People interested in contributing should declare their interest here on the general@incubator thread and those potential contributors will be discussed by the incoming Spark community. Questions -- let's hear em'! :) Cheers, Chris (Champion, incoming Apache Spark) === Abstract === Spark is an open source system for large-scale data analysis on clusters. === Proposal === Spark is an open source system for fast and flexible large-scale data analysis. Spark provides a general purpose runtime that supports low-latency execution in several forms. These include interactive exploration of very large datasets, near real-time stream processing, and ad-hoc SQL analytics (through higher layer extensions). Spark interfaces with HDFS, HBase, Cassandra and several other storage storage layers, and exposes APIs in Scala, Java and Python. Background Spark started as U.C. Berkeley research project, designed to efficiently run machine learning algorithms on large datasets. Over time, it has evolved into a general computing engine as outlined above. Spark╧s developer community has also grown to include additional institutions, such as universities, research labs, and corporations. Funding has been provided by various institutions including the U.S. National Science Foundation, DARPA, and a number of industry sponsors. See: https://amplab.cs.berkeley.edu/sponsors/ for full details. === Rationale === As the number of contributors to Spark has grown, we have sought for a long-term home for the project, and we believe the Apache foundation would be a great fit. Spark is a natural fit for the Apache foundation: Spark already interoperates with several existing Apache projects (HDFS, HBase, Hive
Re: [PROPOSAL] Apache Spark for the Incubator
Yes. d...@spark.incubator.apache.org On Wednesday, June 26, 2013, karthik tunga wrote: Hi, Is the mailing list setup ? Cheers, Karthik On 20 June 2013 02:38, Matei Zaharia ma...@eecs.berkeley.edu wrote: Thanks Chris! We'll get started on all the required steps. Matei On Jun 20, 2013, at 4:35 AM, Mattmann, Chris A (398J) chris.a.mattm...@jpl.nasa.gov wrote: Hi Folks, This VOTE has passed with the following tallies: +1 Chris Mattmann* Konstantin Boudnik Henry Saputra* Reynold Xin Pei Chen Roman Shaposhnik* Suresh Marru* Scott Deboy Ted Dunning* Hitesh Shah Paul Ramirez* Ralph Goers* Alan Cabrera* Thilina Gunarathne Marcel Offermans* Alex Karasulu* Chris Douglas* Andrew Hart* Deepal jayasinghe Ashish Joe Brockmeier* Mohammad Nour El-Din* Arun C Murthy* Tim Williams* Arvind Prabhakar* Matt Franklin* Matei Zaharia Andy Konwinski +0.9 Marvin Humphrey * -indicates IPMC I'll go ahead and get the JIRA tickets filed for email/issue tracking/Git, and then work with the community to get them moving on' over. Thanks for VOTE'ing! Cheers, Chris ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Mattmann, jpluser chris.a.mattm...@jpl.nasa.gov Reply-To: general@incubator.apache.org general@incubator.apache.org Date: Friday, June 7, 2013 10:34 PM To: general@incubator.apache.org general@incubator.apache.org Subject: [VOTE] Apache Spark for the Incubator Hi Folks, OK discussion has died down, time to VOTE to accept Spark into the Apache Incubator. I'll let the VOTE run for at least a week. So far I've heard +1s from the following folks, so no need for them to VOTE again unless they want to change their VOTE: +1 Chris Mattmann* Konstantin Boudnik Henry Saputra* Reynold Xin Pei Chen Roman Shaposhnik* Suresh Marru* * -indicates IPMC [ ] +1 Accept Spark into the Apache Incubator. [ ] +0 Don't care. [ ] -1 Don't accept Spark into the Apache Incubator because.. Proposal text is below. === Abstract === Spark is an open source system for large-scale data analysis on clusters.
Re: [PROPOSAL] Apache Spark for the Incubator
Hi Marvin On Sun, Jun 9, 2013 at 5:15 AM, Marvin Humphrey mar...@rectangular.comwrote: On Sat, Jun 8, 2013 at 4:55 PM, Mattmann, Chris A (398J) chris.a.mattm...@jpl.nasa.gov wrote: Note: we discussed adding Roman before the VOTE and it was fine with the incoming Spark community, so Roman is now on the wiki for the proposal. In case this changes anyone's VOTE on the VOTE thread, feel free to speak up or change your VOTE. Otherwise, nothing else to see here folks. +1 for the original proposal. +0.9 for the new proposal. Yes, I expect you to tally my vote that way. :) Next time, please be more careful when starting a VOTE and please don't change the proposal text in the middle of a vote. Personnel issues in proposals have caused significant problems in the past. That's unlikely to happen in this case, but I want to register my protest now because it might save us hundreds or thousands of emails in the future. This is *not* a [VOTE] yet, this is a [PROPOSAL] in which case the proposal can be updated and enhanced if required. So allow me to disagree about what you replied regarding *not to make changes to the proposal in such phase* Good luck, Spark! Marvin Humphrey - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org -- Thanks - Mohammad Nour Life is like riding a bicycle. To keep your balance you must keep moving - Albert Einstein
Re: [PROPOSAL] Apache Spark for the Incubator
Note: we discussed adding Roman before the VOTE and it was fine with the incoming Spark community, so Roman is now on the wiki for the proposal. In case this changes anyone's VOTE on the VOTE thread, feel free to speak up or change your VOTE. Otherwise, nothing else to see here folks. Cheers, Chris ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Roman Shaposhnik r...@apache.org Date: Saturday, June 8, 2013 3:03 PM To: jpluser chris.a.mattm...@jpl.nasa.gov Subject: Re: [PROPOSAL] Apache Spark for the Incubator On Mon, Jun 3, 2013 at 6:40 PM, Mattmann, Chris A (398J) chris.a.mattm...@jpl.nasa.gov wrote: Hi Roman, I've conferred with the incoming Spark community and we are happy to have you as a mentor for the project. Feel free to add yourself to the wiki proposal. Great news! Done. Thanks, Roman. - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
Re: [PROPOSAL] Apache Spark for the Incubator
On Sat, Jun 8, 2013 at 4:55 PM, Mattmann, Chris A (398J) chris.a.mattm...@jpl.nasa.gov wrote: Note: we discussed adding Roman before the VOTE and it was fine with the incoming Spark community, so Roman is now on the wiki for the proposal. In case this changes anyone's VOTE on the VOTE thread, feel free to speak up or change your VOTE. Otherwise, nothing else to see here folks. +1 for the original proposal. +0.9 for the new proposal. Yes, I expect you to tally my vote that way. :) Next time, please be more careful when starting a VOTE and please don't change the proposal text in the middle of a vote. Personnel issues in proposals have caused significant problems in the past. That's unlikely to happen in this case, but I want to register my protest now because it might save us hundreds or thousands of emails in the future. Good luck, Spark! Marvin Humphrey - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
Re: [PROPOSAL] Apache Spark for the Incubator
Hi Henry, Thanks for your support! I will leave it up to Matei and the incoming Spark community to decide if they would like to add you (or anyone else) to the wiki as a contributor on the project. Thanks! Cheers, Chris ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Henry Saputra henry.sapu...@gmail.com Reply-To: general@incubator.apache.org general@incubator.apache.org Date: Friday, May 31, 2013 12:38 PM To: general@incubator.apache.org general@incubator.apache.org Subject: Re: [PROPOSAL] Apache Spark for the Incubator Wow! I have been using Shark, which runs on top of Shark, with Mesos in our prototype for API analytics for a while and would LOVE to help as mentor and initial contributors. - Henry On Fri, May 31, 2013 at 11:03 AM, Mattmann, Chris A (398J) chris.a.mattm...@jpl.nasa.gov wrote: Hi Folks, I'm pleased to bring you a proposal to the Apache Incubator for the Apache Spark project: https://wiki.apache.org/incubator/SparkProposal The work originates from the Berkeley AMPLab and through a number of industry participants, and other institutions. Spark is a framework for large-scale data analysis on clusters, with a particular focus on low latency operations. The source code is written in Scala, and provides a number of APIs and bindings in various programming languages. The proposal text is copied to the bottom of this email. I'm going to leave this thread open for the next week for discussion. Once it's died down, I'll call an official VOTE. Suresh, Ross G. -- heads up -- this project may be of interest to you both and would welcome you guys as additional mentors. We currently have 3 mentors committed to the project, but would love to have more. People interested in contributing should declare their interest here on the general@incubator thread and those potential contributors will be discussed by the incoming Spark community. Questions -- let's hear em'! :) Cheers, Chris (Champion, incoming Apache Spark) === Abstract === Spark is an open source system for large-scale data analysis on clusters. === Proposal === Spark is an open source system for fast and flexible large-scale data analysis. Spark provides a general purpose runtime that supports low-latency execution in several forms. These include interactive exploration of very large datasets, near real-time stream processing, and ad-hoc SQL analytics (through higher layer extensions). Spark interfaces with HDFS, HBase, Cassandra and several other storage storage layers, and exposes APIs in Scala, Java and Python. Background Spark started as U.C. Berkeley research project, designed to efficiently run machine learning algorithms on large datasets. Over time, it has evolved into a general computing engine as outlined above. Spark¹s developer community has also grown to include additional institutions, such as universities, research labs, and corporations. Funding has been provided by various institutions including the U.S. National Science Foundation, DARPA, and a number of industry sponsors. See: https://amplab.cs.berkeley.edu/sponsors/ for full details. === Rationale === As the number of contributors to Spark has grown, we have sought for a long-term home for the project, and we believe the Apache foundation would be a great fit. Spark is a natural fit for the Apache foundation: Spark already interoperates with several existing Apache projects (HDFS, HBase, Hive, Cassandra, Avro and Flume to name a few). The Spark team is familiar with the Apache process and and subscribes to the Apache mission - the team includes multiple Apache committers already. Finally, joining Apache will help coordinate the development effort of the growing number of organizations which contribute to Spark. == Initial Goals == The initial goals will most likely be to move the existing codebase to Apache and integrate with the Apache development process. Furthermore, we plan for incremental development, and releases along with the Apache guidelines. === Current Status === == Meritocracy == The Spark project already operates on meritocratic principles. Today, Spark has several developers and has accepted multiple major patches from outside of U.C. Berkeley. While this process has remained mostly informal (we do not have an official committer list), an implicit organization exists in which individuals who contribute major components act as maintainers for those modules
Re: [PROPOSAL] Apache Spark for the Incubator
Thanks for the support, Pei. I think the questions you had about frameworks/etc., hopefully were answered. Cheers, Chris ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Chen, Pei pei.c...@childrens.harvard.edu Reply-To: general@incubator.apache.org general@incubator.apache.org Date: Friday, May 31, 2013 11:45 AM To: general@incubator.apache.org general@incubator.apache.org Subject: RE: [PROPOSAL] Apache Spark for the Incubator +1 (non-binding) This seems like a really interesting project. Q- Is Spark just a framework/API or does it also have some tools implemented for data analytics? --Pei -Original Message- From: Mattmann, Chris A (398J) [mailto:chris.a.mattm...@jpl.nasa.gov] Sent: Friday, May 31, 2013 2:04 PM To: general@incubator.apache.org Subject: [PROPOSAL] Apache Spark for the Incubator Hi Folks, I'm pleased to bring you a proposal to the Apache Incubator for the Apache Spark project: https://wiki.apache.org/incubator/SparkProposal The work originates from the Berkeley AMPLab and through a number of industry participants, and other institutions. Spark is a framework for large- scale data analysis on clusters, with a particular focus on low latency operations. The source code is written in Scala, and provides a number of APIs and bindings in various programming languages. The proposal text is copied to the bottom of this email. I'm going to leave this thread open for the next week for discussion. Once it's died down, I'll call an official VOTE. Suresh, Ross G. -- heads up -- this project may be of interest to you both and would welcome you guys as additional mentors. We currently have 3 mentors committed to the project, but would love to have more. People interested in contributing should declare their interest here on the general@incubator thread and those potential contributors will be discussed by the incoming Spark community. Questions -- let's hear em'! :) Cheers, Chris (Champion, incoming Apache Spark) === Abstract === Spark is an open source system for large-scale data analysis on clusters. === Proposal === Spark is an open source system for fast and flexible large-scale data analysis. Spark provides a general purpose runtime that supports low-latency execution in several forms. These include interactive exploration of very large datasets, near real-time stream processing, and ad-hoc SQL analytics (through higher layer extensions). Spark interfaces with HDFS, HBase, Cassandra and several other storage storage layers, and exposes APIs in Scala, Java and Python. Background Spark started as U.C. Berkeley research project, designed to efficiently run machine learning algorithms on large datasets. Over time, it has evolved into a general computing engine as outlined above. Spark¹s developer community has also grown to include additional institutions, such as universities, research labs, and corporations. Funding has been provided by various institutions including the U.S. National Science Foundation, DARPA, and a number of industry sponsors. See: https://amplab.cs.berkeley.edu/sponsors/ for full details. === Rationale === As the number of contributors to Spark has grown, we have sought for a long-term home for the project, and we believe the Apache foundation would be a great fit. Spark is a natural fit for the Apache foundation: Spark already interoperates with several existing Apache projects (HDFS, HBase, Hive, Cassandra, Avro and Flume to name a few). The Spark team is familiar with the Apache process and and subscribes to the Apache mission - the team includes multiple Apache committers already. Finally, joining Apache will help coordinate the development effort of the growing number of organizations which contribute to Spark. == Initial Goals == The initial goals will most likely be to move the existing codebase to Apache and integrate with the Apache development process. Furthermore, we plan for incremental development, and releases along with the Apache guidelines. === Current Status === == Meritocracy == The Spark project already operates on meritocratic principles. Today, Spark has several developers and has accepted multiple major patches from outside of U.C. Berkeley. While this process has remained mostly informal (we do not have an official committer list), an implicit organization exists in which individuals who contribute major components act as maintainers for those
Re: [PROPOSAL] Apache Spark for the Incubator
Thanks for the support Roman! I will leave it up to the incoming Spark community members to decide if they need more mentors and we'll be in touch. Thank you again. Cheers, Chris ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Roman Shaposhnik r...@apache.org Reply-To: general@incubator.apache.org general@incubator.apache.org Date: Friday, May 31, 2013 3:25 PM To: general@incubator.apache.org general@incubator.apache.org Subject: Re: [PROPOSAL] Apache Spark for the Incubator Extremely enthusiastic +1!!! If you ever need help with mentorship -- please let me know. Also, looking forward to seeing this in Bigtop! Thanks, Roman. On Fri, May 31, 2013 at 11:03 AM, Mattmann, Chris A (398J) chris.a.mattm...@jpl.nasa.gov wrote: Hi Folks, I'm pleased to bring you a proposal to the Apache Incubator for the Apache Spark project: https://wiki.apache.org/incubator/SparkProposal The work originates from the Berkeley AMPLab and through a number of industry participants, and other institutions. Spark is a framework for large-scale data analysis on clusters, with a particular focus on low latency operations. The source code is written in Scala, and provides a number of APIs and bindings in various programming languages. The proposal text is copied to the bottom of this email. I'm going to leave this thread open for the next week for discussion. Once it's died down, I'll call an official VOTE. Suresh, Ross G. -- heads up -- this project may be of interest to you both and would welcome you guys as additional mentors. We currently have 3 mentors committed to the project, but would love to have more. People interested in contributing should declare their interest here on the general@incubator thread and those potential contributors will be discussed by the incoming Spark community. Questions -- let's hear em'! :) Cheers, Chris (Champion, incoming Apache Spark) === Abstract === Spark is an open source system for large-scale data analysis on clusters. === Proposal === Spark is an open source system for fast and flexible large-scale data analysis. Spark provides a general purpose runtime that supports low-latency execution in several forms. These include interactive exploration of very large datasets, near real-time stream processing, and ad-hoc SQL analytics (through higher layer extensions). Spark interfaces with HDFS, HBase, Cassandra and several other storage storage layers, and exposes APIs in Scala, Java and Python. Background Spark started as U.C. Berkeley research project, designed to efficiently run machine learning algorithms on large datasets. Over time, it has evolved into a general computing engine as outlined above. Spark¹s developer community has also grown to include additional institutions, such as universities, research labs, and corporations. Funding has been provided by various institutions including the U.S. National Science Foundation, DARPA, and a number of industry sponsors. See: https://amplab.cs.berkeley.edu/sponsors/ for full details. === Rationale === As the number of contributors to Spark has grown, we have sought for a long-term home for the project, and we believe the Apache foundation would be a great fit. Spark is a natural fit for the Apache foundation: Spark already interoperates with several existing Apache projects (HDFS, HBase, Hive, Cassandra, Avro and Flume to name a few). The Spark team is familiar with the Apache process and and subscribes to the Apache mission - the team includes multiple Apache committers already. Finally, joining Apache will help coordinate the development effort of the growing number of organizations which contribute to Spark. == Initial Goals == The initial goals will most likely be to move the existing codebase to Apache and integrate with the Apache development process. Furthermore, we plan for incremental development, and releases along with the Apache guidelines. === Current Status === == Meritocracy == The Spark project already operates on meritocratic principles. Today, Spark has several developers and has accepted multiple major patches from outside of U.C. Berkeley. While this process has remained mostly informal (we do not have an official committer list), an implicit organization exists in which individuals who contribute major components act as maintainers for those modules. If accepted, the Spark project would include several of these participants
Re: [PROPOSAL] Apache Spark for the Incubator
Thanks Suresh, after conferring with the incoming Spark community members, I will add you as a mentor on the wiki. Cheers, Chris ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Suresh Marru sma...@apache.org Reply-To: general@incubator.apache.org general@incubator.apache.org Date: Saturday, June 1, 2013 9:12 PM To: general@incubator.apache.org general@incubator.apache.org Subject: Re: [PROPOSAL] Apache Spark for the Incubator On May 31, 2013, at 2:03 PM, Mattmann, Chris A (398J) chris.a.mattm...@jpl.nasa.gov wrote: Hi Folks, I'm pleased to bring you a proposal to the Apache Incubator for the Apache Spark project: https://wiki.apache.org/incubator/SparkProposal The work originates from the Berkeley AMPLab and through a number of industry participants, and other institutions. Spark is a framework for large-scale data analysis on clusters, with a particular focus on low latency operations. The source code is written in Scala, and provides a number of APIs and bindings in various programming languages. The proposal text is copied to the bottom of this email. I'm going to leave this thread open for the next week for discussion. Once it's died down, I'll call an official VOTE. Suresh, Ross G. -- heads up -- this project may be of interest to you both and would welcome you guys as additional mentors. We currently have 3 mentors committed to the project, but would love to have more. Thanks Chris for the alert. Great proposal indeed, if the podling needs help I am in. Suresh People interested in contributing should declare their interest here on the general@incubator thread and those potential contributors will be discussed by the incoming Spark community. Questions -- let's hear em'! :) Cheers, Chris (Champion, incoming Apache Spark) === Abstract === Spark is an open source system for large-scale data analysis on clusters. === Proposal === Spark is an open source system for fast and flexible large-scale data analysis. Spark provides a general purpose runtime that supports low-latency execution in several forms. These include interactive exploration of very large datasets, near real-time stream processing, and ad-hoc SQL analytics (through higher layer extensions). Spark interfaces with HDFS, HBase, Cassandra and several other storage storage layers, and exposes APIs in Scala, Java and Python. Background Spark started as U.C. Berkeley research project, designed to efficiently run machine learning algorithms on large datasets. Over time, it has evolved into a general computing engine as outlined above. Spark¹s developer community has also grown to include additional institutions, such as universities, research labs, and corporations. Funding has been provided by various institutions including the U.S. National Science Foundation, DARPA, and a number of industry sponsors. See: https://amplab.cs.berkeley.edu/sponsors/ for full details. === Rationale === As the number of contributors to Spark has grown, we have sought for a long-term home for the project, and we believe the Apache foundation would be a great fit. Spark is a natural fit for the Apache foundation: Spark already interoperates with several existing Apache projects (HDFS, HBase, Hive, Cassandra, Avro and Flume to name a few). The Spark team is familiar with the Apache process and and subscribes to the Apache mission - the team includes multiple Apache committers already. Finally, joining Apache will help coordinate the development effort of the growing number of organizations which contribute to Spark. == Initial Goals == The initial goals will most likely be to move the existing codebase to Apache and integrate with the Apache development process. Furthermore, we plan for incremental development, and releases along with the Apache guidelines. === Current Status === == Meritocracy == The Spark project already operates on meritocratic principles. Today, Spark has several developers and has accepted multiple major patches from outside of U.C. Berkeley. While this process has remained mostly informal (we do not have an official committer list), an implicit organization exists in which individuals who contribute major components act as maintainers for those modules. If accepted, the Spark project would include several of these participants as committers from the onset. We will work to identify all committers and PPMC members for the project
Re: [PROPOSAL] Apache Spark for the Incubator
Hi Konstantin, Thanks for your kind words and expressed interest. I will leave it to Matei and the incoming Spark community members to comment on adding you (or anyone else) as a contributor to the wiki. If they are OK with it, then I am very much too. Cheers, Chris ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Konstantin Boudnik c...@apache.org Reply-To: general@incubator.apache.org general@incubator.apache.org Date: Friday, May 31, 2013 12:29 PM To: general@incubator.apache.org general@incubator.apache.org Subject: Re: [PROPOSAL] Apache Spark for the Incubator Great news! Definitely +1 (non-binding, I guess) on adding Spark to the family of ASF project! I also express the interest to contribute to the project and move it forward to the graduation! Bigtop has been packaging and providing Spark as a part of Hadoop 1.x software stacks for some time; and hopefully would be able to offer it as a part of Hadoop 2.x line in the coming days. Dr. Konstantin Boudnik Hadoop committer BigTop PMC On Fri, May 31, 2013 at 06:03PM, Mattmann, Chris A (398J) wrote: Hi Folks, I'm pleased to bring you a proposal to the Apache Incubator for the Apache Spark project: https://wiki.apache.org/incubator/SparkProposal The work originates from the Berkeley AMPLab and through a number of industry participants, and other institutions. Spark is a framework for large-scale data analysis on clusters, with a particular focus on low latency operations. The source code is written in Scala, and provides a number of APIs and bindings in various programming languages. The proposal text is copied to the bottom of this email. I'm going to leave this thread open for the next week for discussion. Once it's died down, I'll call an official VOTE. Suresh, Ross G. -- heads up -- this project may be of interest to you both and would welcome you guys as additional mentors. We currently have 3 mentors committed to the project, but would love to have more. People interested in contributing should declare their interest here on the general@incubator thread and those potential contributors will be discussed by the incoming Spark community. Questions -- let's hear em'! :) Cheers, Chris (Champion, incoming Apache Spark) === Abstract === Spark is an open source system for large-scale data analysis on clusters. === Proposal === Spark is an open source system for fast and flexible large-scale data analysis. Spark provides a general purpose runtime that supports low-latency execution in several forms. These include interactive exploration of very large datasets, near real-time stream processing, and ad-hoc SQL analytics (through higher layer extensions). Spark interfaces with HDFS, HBase, Cassandra and several other storage storage layers, and exposes APIs in Scala, Java and Python. Background Spark started as U.C. Berkeley research project, designed to efficiently run machine learning algorithms on large datasets. Over time, it has evolved into a general computing engine as outlined above. Spark╧s developer community has also grown to include additional institutions, such as universities, research labs, and corporations. Funding has been provided by various institutions including the U.S. National Science Foundation, DARPA, and a number of industry sponsors. See: https://amplab.cs.berkeley.edu/sponsors/ for full details. === Rationale === As the number of contributors to Spark has grown, we have sought for a long-term home for the project, and we believe the Apache foundation would be a great fit. Spark is a natural fit for the Apache foundation: Spark already interoperates with several existing Apache projects (HDFS, HBase, Hive, Cassandra, Avro and Flume to name a few). The Spark team is familiar with the Apache process and and subscribes to the Apache mission - the team includes multiple Apache committers already. Finally, joining Apache will help coordinate the development effort of the growing number of organizations which contribute to Spark. == Initial Goals == The initial goals will most likely be to move the existing codebase to Apache and integrate with the Apache development process. Furthermore, we plan for incremental development, and releases along with the Apache guidelines. === Current Status === == Meritocracy == The Spark project already operates on meritocratic principles. Today, Spark has several developers and has accepted
Re: [PROPOSAL] Apache Spark for the Incubator
Hi Henry, I've conferred with the incoming Spark community and we are very happy to have you as a mentor on the project. Please feel free to add yourself to the wiki. Cheers, Chris ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Henry Saputra henry.sapu...@gmail.com Reply-To: general@incubator.apache.org general@incubator.apache.org Date: Friday, May 31, 2013 12:38 PM To: general@incubator.apache.org general@incubator.apache.org Subject: Re: [PROPOSAL] Apache Spark for the Incubator Wow! I have been using Shark, which runs on top of Shark, with Mesos in our prototype for API analytics for a while and would LOVE to help as mentor and initial contributors. - Henry On Fri, May 31, 2013 at 11:03 AM, Mattmann, Chris A (398J) chris.a.mattm...@jpl.nasa.gov wrote: Hi Folks, I'm pleased to bring you a proposal to the Apache Incubator for the Apache Spark project: https://wiki.apache.org/incubator/SparkProposal The work originates from the Berkeley AMPLab and through a number of industry participants, and other institutions. Spark is a framework for large-scale data analysis on clusters, with a particular focus on low latency operations. The source code is written in Scala, and provides a number of APIs and bindings in various programming languages. The proposal text is copied to the bottom of this email. I'm going to leave this thread open for the next week for discussion. Once it's died down, I'll call an official VOTE. Suresh, Ross G. -- heads up -- this project may be of interest to you both and would welcome you guys as additional mentors. We currently have 3 mentors committed to the project, but would love to have more. People interested in contributing should declare their interest here on the general@incubator thread and those potential contributors will be discussed by the incoming Spark community. Questions -- let's hear em'! :) Cheers, Chris (Champion, incoming Apache Spark) === Abstract === Spark is an open source system for large-scale data analysis on clusters. === Proposal === Spark is an open source system for fast and flexible large-scale data analysis. Spark provides a general purpose runtime that supports low-latency execution in several forms. These include interactive exploration of very large datasets, near real-time stream processing, and ad-hoc SQL analytics (through higher layer extensions). Spark interfaces with HDFS, HBase, Cassandra and several other storage storage layers, and exposes APIs in Scala, Java and Python. Background Spark started as U.C. Berkeley research project, designed to efficiently run machine learning algorithms on large datasets. Over time, it has evolved into a general computing engine as outlined above. Spark¹s developer community has also grown to include additional institutions, such as universities, research labs, and corporations. Funding has been provided by various institutions including the U.S. National Science Foundation, DARPA, and a number of industry sponsors. See: https://amplab.cs.berkeley.edu/sponsors/ for full details. === Rationale === As the number of contributors to Spark has grown, we have sought for a long-term home for the project, and we believe the Apache foundation would be a great fit. Spark is a natural fit for the Apache foundation: Spark already interoperates with several existing Apache projects (HDFS, HBase, Hive, Cassandra, Avro and Flume to name a few). The Spark team is familiar with the Apache process and and subscribes to the Apache mission - the team includes multiple Apache committers already. Finally, joining Apache will help coordinate the development effort of the growing number of organizations which contribute to Spark. == Initial Goals == The initial goals will most likely be to move the existing codebase to Apache and integrate with the Apache development process. Furthermore, we plan for incremental development, and releases along with the Apache guidelines. === Current Status === == Meritocracy == The Spark project already operates on meritocratic principles. Today, Spark has several developers and has accepted multiple major patches from outside of U.C. Berkeley. While this process has remained mostly informal (we do not have an official committer list), an implicit organization exists in which individuals who contribute major components act as maintainers for those modules. If accepted, the Spark project would include
Re: [PROPOSAL] Apache Spark for the Incubator
Hi Roman, I've conferred with the incoming Spark community and we are happy to have you as a mentor for the project. Feel free to add yourself to the wiki proposal. Cheers, Chris ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Roman Shaposhnik r...@apache.org Reply-To: general@incubator.apache.org general@incubator.apache.org Date: Friday, May 31, 2013 3:25 PM To: general@incubator.apache.org general@incubator.apache.org Subject: Re: [PROPOSAL] Apache Spark for the Incubator Extremely enthusiastic +1!!! If you ever need help with mentorship -- please let me know. Also, looking forward to seeing this in Bigtop! Thanks, Roman. On Fri, May 31, 2013 at 11:03 AM, Mattmann, Chris A (398J) chris.a.mattm...@jpl.nasa.gov wrote: Hi Folks, I'm pleased to bring you a proposal to the Apache Incubator for the Apache Spark project: https://wiki.apache.org/incubator/SparkProposal The work originates from the Berkeley AMPLab and through a number of industry participants, and other institutions. Spark is a framework for large-scale data analysis on clusters, with a particular focus on low latency operations. The source code is written in Scala, and provides a number of APIs and bindings in various programming languages. The proposal text is copied to the bottom of this email. I'm going to leave this thread open for the next week for discussion. Once it's died down, I'll call an official VOTE. Suresh, Ross G. -- heads up -- this project may be of interest to you both and would welcome you guys as additional mentors. We currently have 3 mentors committed to the project, but would love to have more. People interested in contributing should declare their interest here on the general@incubator thread and those potential contributors will be discussed by the incoming Spark community. Questions -- let's hear em'! :) Cheers, Chris (Champion, incoming Apache Spark) === Abstract === Spark is an open source system for large-scale data analysis on clusters. === Proposal === Spark is an open source system for fast and flexible large-scale data analysis. Spark provides a general purpose runtime that supports low-latency execution in several forms. These include interactive exploration of very large datasets, near real-time stream processing, and ad-hoc SQL analytics (through higher layer extensions). Spark interfaces with HDFS, HBase, Cassandra and several other storage storage layers, and exposes APIs in Scala, Java and Python. Background Spark started as U.C. Berkeley research project, designed to efficiently run machine learning algorithms on large datasets. Over time, it has evolved into a general computing engine as outlined above. Spark¹s developer community has also grown to include additional institutions, such as universities, research labs, and corporations. Funding has been provided by various institutions including the U.S. National Science Foundation, DARPA, and a number of industry sponsors. See: https://amplab.cs.berkeley.edu/sponsors/ for full details. === Rationale === As the number of contributors to Spark has grown, we have sought for a long-term home for the project, and we believe the Apache foundation would be a great fit. Spark is a natural fit for the Apache foundation: Spark already interoperates with several existing Apache projects (HDFS, HBase, Hive, Cassandra, Avro and Flume to name a few). The Spark team is familiar with the Apache process and and subscribes to the Apache mission - the team includes multiple Apache committers already. Finally, joining Apache will help coordinate the development effort of the growing number of organizations which contribute to Spark. == Initial Goals == The initial goals will most likely be to move the existing codebase to Apache and integrate with the Apache development process. Furthermore, we plan for incremental development, and releases along with the Apache guidelines. === Current Status === == Meritocracy == The Spark project already operates on meritocratic principles. Today, Spark has several developers and has accepted multiple major patches from outside of U.C. Berkeley. While this process has remained mostly informal (we do not have an official committer list), an implicit organization exists in which individuals who contribute major components act as maintainers for those modules. If accepted, the Spark project would include several of these participants
Re: [PROPOSAL] Apache Spark for the Incubator
Dear Konstantin, Thanks! The incoming Spark project is excited about the relationship with Bigtop that could happen here. As for new committers, after conferring with the Spark project members, we would like to adopt a simple policy of having all new committers not add themselves to the wiki as of yet, but simply join the project mailing lists when they are created, and then from there, contribute. I and other mentors, and the Spark community are committed to being inclusive, so hopefully won't take too long for anybody to become a PPMC member/committer on the project after some demonstrated contributions. Thanks for your interest and again for your kind words. Cheers! Chris ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Konstantin Boudnik c...@apache.org Reply-To: general@incubator.apache.org general@incubator.apache.org Date: Friday, May 31, 2013 12:29 PM To: general@incubator.apache.org general@incubator.apache.org Subject: Re: [PROPOSAL] Apache Spark for the Incubator Great news! Definitely +1 (non-binding, I guess) on adding Spark to the family of ASF project! I also express the interest to contribute to the project and move it forward to the graduation! Bigtop has been packaging and providing Spark as a part of Hadoop 1.x software stacks for some time; and hopefully would be able to offer it as a part of Hadoop 2.x line in the coming days. Dr. Konstantin Boudnik Hadoop committer BigTop PMC On Fri, May 31, 2013 at 06:03PM, Mattmann, Chris A (398J) wrote: Hi Folks, I'm pleased to bring you a proposal to the Apache Incubator for the Apache Spark project: https://wiki.apache.org/incubator/SparkProposal The work originates from the Berkeley AMPLab and through a number of industry participants, and other institutions. Spark is a framework for large-scale data analysis on clusters, with a particular focus on low latency operations. The source code is written in Scala, and provides a number of APIs and bindings in various programming languages. The proposal text is copied to the bottom of this email. I'm going to leave this thread open for the next week for discussion. Once it's died down, I'll call an official VOTE. Suresh, Ross G. -- heads up -- this project may be of interest to you both and would welcome you guys as additional mentors. We currently have 3 mentors committed to the project, but would love to have more. People interested in contributing should declare their interest here on the general@incubator thread and those potential contributors will be discussed by the incoming Spark community. Questions -- let's hear em'! :) Cheers, Chris (Champion, incoming Apache Spark) === Abstract === Spark is an open source system for large-scale data analysis on clusters. === Proposal === Spark is an open source system for fast and flexible large-scale data analysis. Spark provides a general purpose runtime that supports low-latency execution in several forms. These include interactive exploration of very large datasets, near real-time stream processing, and ad-hoc SQL analytics (through higher layer extensions). Spark interfaces with HDFS, HBase, Cassandra and several other storage storage layers, and exposes APIs in Scala, Java and Python. Background Spark started as U.C. Berkeley research project, designed to efficiently run machine learning algorithms on large datasets. Over time, it has evolved into a general computing engine as outlined above. Spark╧s developer community has also grown to include additional institutions, such as universities, research labs, and corporations. Funding has been provided by various institutions including the U.S. National Science Foundation, DARPA, and a number of industry sponsors. See: https://amplab.cs.berkeley.edu/sponsors/ for full details. === Rationale === As the number of contributors to Spark has grown, we have sought for a long-term home for the project, and we believe the Apache foundation would be a great fit. Spark is a natural fit for the Apache foundation: Spark already interoperates with several existing Apache projects (HDFS, HBase, Hive, Cassandra, Avro and Flume to name a few). The Spark team is familiar with the Apache process and and subscribes to the Apache mission - the team includes multiple Apache committers already. Finally, joining Apache will help coordinate the development effort of the growing number of organizations which contribute to Spark
Re: [PROPOSAL] Apache Spark for the Incubator
Thanks Chris, looking forward for this project to be part of ASF family. I have added my name as mentor in the proposal. - Henry On Mon, Jun 3, 2013 at 6:41 PM, Mattmann, Chris A (398J) chris.a.mattm...@jpl.nasa.gov wrote: Hi Henry, I've conferred with the incoming Spark community and we are very happy to have you as a mentor on the project. Please feel free to add yourself to the wiki. Cheers, Chris ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Henry Saputra henry.sapu...@gmail.com Reply-To: general@incubator.apache.org general@incubator.apache.org Date: Friday, May 31, 2013 12:38 PM To: general@incubator.apache.org general@incubator.apache.org Subject: Re: [PROPOSAL] Apache Spark for the Incubator Wow! I have been using Shark, which runs on top of Shark, with Mesos in our prototype for API analytics for a while and would LOVE to help as mentor and initial contributors. - Henry On Fri, May 31, 2013 at 11:03 AM, Mattmann, Chris A (398J) chris.a.mattm...@jpl.nasa.gov wrote: Hi Folks, I'm pleased to bring you a proposal to the Apache Incubator for the Apache Spark project: https://wiki.apache.org/incubator/SparkProposal The work originates from the Berkeley AMPLab and through a number of industry participants, and other institutions. Spark is a framework for large-scale data analysis on clusters, with a particular focus on low latency operations. The source code is written in Scala, and provides a number of APIs and bindings in various programming languages. The proposal text is copied to the bottom of this email. I'm going to leave this thread open for the next week for discussion. Once it's died down, I'll call an official VOTE. Suresh, Ross G. -- heads up -- this project may be of interest to you both and would welcome you guys as additional mentors. We currently have 3 mentors committed to the project, but would love to have more. People interested in contributing should declare their interest here on the general@incubator thread and those potential contributors will be discussed by the incoming Spark community. Questions -- let's hear em'! :) Cheers, Chris (Champion, incoming Apache Spark) === Abstract === Spark is an open source system for large-scale data analysis on clusters. === Proposal === Spark is an open source system for fast and flexible large-scale data analysis. Spark provides a general purpose runtime that supports low-latency execution in several forms. These include interactive exploration of very large datasets, near real-time stream processing, and ad-hoc SQL analytics (through higher layer extensions). Spark interfaces with HDFS, HBase, Cassandra and several other storage storage layers, and exposes APIs in Scala, Java and Python. Background Spark started as U.C. Berkeley research project, designed to efficiently run machine learning algorithms on large datasets. Over time, it has evolved into a general computing engine as outlined above. Spark¹s developer community has also grown to include additional institutions, such as universities, research labs, and corporations. Funding has been provided by various institutions including the U.S. National Science Foundation, DARPA, and a number of industry sponsors. See: https://amplab.cs.berkeley.edu/sponsors/ for full details. === Rationale === As the number of contributors to Spark has grown, we have sought for a long-term home for the project, and we believe the Apache foundation would be a great fit. Spark is a natural fit for the Apache foundation: Spark already interoperates with several existing Apache projects (HDFS, HBase, Hive, Cassandra, Avro and Flume to name a few). The Spark team is familiar with the Apache process and and subscribes to the Apache mission - the team includes multiple Apache committers already. Finally, joining Apache will help coordinate the development effort of the growing number of organizations which contribute to Spark. == Initial Goals == The initial goals will most likely be to move the existing codebase to Apache and integrate with the Apache development process. Furthermore, we plan for incremental development, and releases along with the Apache guidelines. === Current Status === == Meritocracy == The Spark project already operates on meritocratic principles. Today, Spark has
Re: [PROPOSAL] Apache Spark for the Incubator
On May 31, 2013, at 2:03 PM, Mattmann, Chris A (398J) chris.a.mattm...@jpl.nasa.gov wrote: Hi Folks, I'm pleased to bring you a proposal to the Apache Incubator for the Apache Spark project: https://wiki.apache.org/incubator/SparkProposal The work originates from the Berkeley AMPLab and through a number of industry participants, and other institutions. Spark is a framework for large-scale data analysis on clusters, with a particular focus on low latency operations. The source code is written in Scala, and provides a number of APIs and bindings in various programming languages. The proposal text is copied to the bottom of this email. I'm going to leave this thread open for the next week for discussion. Once it's died down, I'll call an official VOTE. Suresh, Ross G. -- heads up -- this project may be of interest to you both and would welcome you guys as additional mentors. We currently have 3 mentors committed to the project, but would love to have more. Thanks Chris for the alert. Great proposal indeed, if the podling needs help I am in. Suresh People interested in contributing should declare their interest here on the general@incubator thread and those potential contributors will be discussed by the incoming Spark community. Questions -- let's hear em'! :) Cheers, Chris (Champion, incoming Apache Spark) === Abstract === Spark is an open source system for large-scale data analysis on clusters. === Proposal === Spark is an open source system for fast and flexible large-scale data analysis. Spark provides a general purpose runtime that supports low-latency execution in several forms. These include interactive exploration of very large datasets, near real-time stream processing, and ad-hoc SQL analytics (through higher layer extensions). Spark interfaces with HDFS, HBase, Cassandra and several other storage storage layers, and exposes APIs in Scala, Java and Python. Background Spark started as U.C. Berkeley research project, designed to efficiently run machine learning algorithms on large datasets. Over time, it has evolved into a general computing engine as outlined above. Spark¹s developer community has also grown to include additional institutions, such as universities, research labs, and corporations. Funding has been provided by various institutions including the U.S. National Science Foundation, DARPA, and a number of industry sponsors. See: https://amplab.cs.berkeley.edu/sponsors/ for full details. === Rationale === As the number of contributors to Spark has grown, we have sought for a long-term home for the project, and we believe the Apache foundation would be a great fit. Spark is a natural fit for the Apache foundation: Spark already interoperates with several existing Apache projects (HDFS, HBase, Hive, Cassandra, Avro and Flume to name a few). The Spark team is familiar with the Apache process and and subscribes to the Apache mission - the team includes multiple Apache committers already. Finally, joining Apache will help coordinate the development effort of the growing number of organizations which contribute to Spark. == Initial Goals == The initial goals will most likely be to move the existing codebase to Apache and integrate with the Apache development process. Furthermore, we plan for incremental development, and releases along with the Apache guidelines. === Current Status === == Meritocracy == The Spark project already operates on meritocratic principles. Today, Spark has several developers and has accepted multiple major patches from outside of U.C. Berkeley. While this process has remained mostly informal (we do not have an official committer list), an implicit organization exists in which individuals who contribute major components act as maintainers for those modules. If accepted, the Spark project would include several of these participants as committers from the onset. We will work to identify all committers and PPMC members for the project and to operate under the ASF meritocratic principles. === Community === Acceptance into the Apache foundation would bolster the already strong user and developer community around Spark. That community includes dozens of contributors from several institutions, a meetup group with several hundred members, and an active mailing list composed of hundreds of users. Core Developers The core developers of our project are listed in our contributors and initial PPMC below. Though many exist at UC Berkeley, there is a representative cross sampling of other organizations including Quantifind, Microsoft, Yahoo!, ClearStory Data, Bizo, Intel, Tagged and Webtrends. === Alignment === Our proposed effort aligns with several ongoing BIGDATA and U.S. National priority funding interests including the NSF and its Expeditions program, and the DARPA XDATA project. Our industry partners and collaborators are well aligned with our code base.
[PROPOSAL] Apache Spark for the Incubator
Hi Folks, I'm pleased to bring you a proposal to the Apache Incubator for the Apache Spark project: https://wiki.apache.org/incubator/SparkProposal The work originates from the Berkeley AMPLab and through a number of industry participants, and other institutions. Spark is a framework for large-scale data analysis on clusters, with a particular focus on low latency operations. The source code is written in Scala, and provides a number of APIs and bindings in various programming languages. The proposal text is copied to the bottom of this email. I'm going to leave this thread open for the next week for discussion. Once it's died down, I'll call an official VOTE. Suresh, Ross G. -- heads up -- this project may be of interest to you both and would welcome you guys as additional mentors. We currently have 3 mentors committed to the project, but would love to have more. People interested in contributing should declare their interest here on the general@incubator thread and those potential contributors will be discussed by the incoming Spark community. Questions -- let's hear em'! :) Cheers, Chris (Champion, incoming Apache Spark) === Abstract === Spark is an open source system for large-scale data analysis on clusters. === Proposal === Spark is an open source system for fast and flexible large-scale data analysis. Spark provides a general purpose runtime that supports low-latency execution in several forms. These include interactive exploration of very large datasets, near real-time stream processing, and ad-hoc SQL analytics (through higher layer extensions). Spark interfaces with HDFS, HBase, Cassandra and several other storage storage layers, and exposes APIs in Scala, Java and Python. Background Spark started as U.C. Berkeley research project, designed to efficiently run machine learning algorithms on large datasets. Over time, it has evolved into a general computing engine as outlined above. Spark¹s developer community has also grown to include additional institutions, such as universities, research labs, and corporations. Funding has been provided by various institutions including the U.S. National Science Foundation, DARPA, and a number of industry sponsors. See: https://amplab.cs.berkeley.edu/sponsors/ for full details. === Rationale === As the number of contributors to Spark has grown, we have sought for a long-term home for the project, and we believe the Apache foundation would be a great fit. Spark is a natural fit for the Apache foundation: Spark already interoperates with several existing Apache projects (HDFS, HBase, Hive, Cassandra, Avro and Flume to name a few). The Spark team is familiar with the Apache process and and subscribes to the Apache mission - the team includes multiple Apache committers already. Finally, joining Apache will help coordinate the development effort of the growing number of organizations which contribute to Spark. == Initial Goals == The initial goals will most likely be to move the existing codebase to Apache and integrate with the Apache development process. Furthermore, we plan for incremental development, and releases along with the Apache guidelines. === Current Status === == Meritocracy == The Spark project already operates on meritocratic principles. Today, Spark has several developers and has accepted multiple major patches from outside of U.C. Berkeley. While this process has remained mostly informal (we do not have an official committer list), an implicit organization exists in which individuals who contribute major components act as maintainers for those modules. If accepted, the Spark project would include several of these participants as committers from the onset. We will work to identify all committers and PPMC members for the project and to operate under the ASF meritocratic principles. === Community === Acceptance into the Apache foundation would bolster the already strong user and developer community around Spark. That community includes dozens of contributors from several institutions, a meetup group with several hundred members, and an active mailing list composed of hundreds of users. Core Developers The core developers of our project are listed in our contributors and initial PPMC below. Though many exist at UC Berkeley, there is a representative cross sampling of other organizations including Quantifind, Microsoft, Yahoo!, ClearStory Data, Bizo, Intel, Tagged and Webtrends. === Alignment === Our proposed effort aligns with several ongoing BIGDATA and U.S. National priority funding interests including the NSF and its Expeditions program, and the DARPA XDATA project. Our industry partners and collaborators are well aligned with our code base. There are also a number of related Apache projects and dependencies, that will be mentioned in the Relationships with Other Apache products section. == Known Risks == === Orphaned Products === Given the current level of investment in Spark - the risk of the project being abandoned is minimal.
Re: [PROPOSAL] Apache Spark for the Incubator
Guys, I've added: Thomas Dudziak as a mentor to the proposal at his request. He is a member of the ASF and should be granted IPMC access soon. Cheers, Chris ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Mattmann, jpluser chris.a.mattm...@jpl.nasa.gov Reply-To: general@incubator.apache.org general@incubator.apache.org Date: Friday, May 31, 2013 11:03 AM To: general@incubator.apache.org general@incubator.apache.org Subject: [PROPOSAL] Apache Spark for the Incubator Hi Folks, I'm pleased to bring you a proposal to the Apache Incubator for the Apache Spark project: https://wiki.apache.org/incubator/SparkProposal The work originates from the Berkeley AMPLab and through a number of industry participants, and other institutions. Spark is a framework for large-scale data analysis on clusters, with a particular focus on low latency operations. The source code is written in Scala, and provides a number of APIs and bindings in various programming languages. The proposal text is copied to the bottom of this email. I'm going to leave this thread open for the next week for discussion. Once it's died down, I'll call an official VOTE. Suresh, Ross G. -- heads up -- this project may be of interest to you both and would welcome you guys as additional mentors. We currently have 3 mentors committed to the project, but would love to have more. People interested in contributing should declare their interest here on the general@incubator thread and those potential contributors will be discussed by the incoming Spark community. Questions -- let's hear em'! :) Cheers, Chris (Champion, incoming Apache Spark) === Abstract === Spark is an open source system for large-scale data analysis on clusters. === Proposal === Spark is an open source system for fast and flexible large-scale data analysis. Spark provides a general purpose runtime that supports low-latency execution in several forms. These include interactive exploration of very large datasets, near real-time stream processing, and ad-hoc SQL analytics (through higher layer extensions). Spark interfaces with HDFS, HBase, Cassandra and several other storage storage layers, and exposes APIs in Scala, Java and Python. Background Spark started as U.C. Berkeley research project, designed to efficiently run machine learning algorithms on large datasets. Over time, it has evolved into a general computing engine as outlined above. Spark¹s developer community has also grown to include additional institutions, such as universities, research labs, and corporations. Funding has been provided by various institutions including the U.S. National Science Foundation, DARPA, and a number of industry sponsors. See: https://amplab.cs.berkeley.edu/sponsors/ for full details. === Rationale === As the number of contributors to Spark has grown, we have sought for a long-term home for the project, and we believe the Apache foundation would be a great fit. Spark is a natural fit for the Apache foundation: Spark already interoperates with several existing Apache projects (HDFS, HBase, Hive, Cassandra, Avro and Flume to name a few). The Spark team is familiar with the Apache process and and subscribes to the Apache mission - the team includes multiple Apache committers already. Finally, joining Apache will help coordinate the development effort of the growing number of organizations which contribute to Spark. == Initial Goals == The initial goals will most likely be to move the existing codebase to Apache and integrate with the Apache development process. Furthermore, we plan for incremental development, and releases along with the Apache guidelines. === Current Status === == Meritocracy == The Spark project already operates on meritocratic principles. Today, Spark has several developers and has accepted multiple major patches from outside of U.C. Berkeley. While this process has remained mostly informal (we do not have an official committer list), an implicit organization exists in which individuals who contribute major components act as maintainers for those modules. If accepted, the Spark project would include several of these participants as committers from the onset. We will work to identify all committers and PPMC members for the project and to operate under the ASF meritocratic principles. === Community === Acceptance into the Apache foundation would bolster the already strong user and developer community around Spark. That community includes dozens of contributors from
RE: [PROPOSAL] Apache Spark for the Incubator
+1 (non-binding) This seems like a really interesting project. Q- Is Spark just a framework/API or does it also have some tools implemented for data analytics? --Pei -Original Message- From: Mattmann, Chris A (398J) [mailto:chris.a.mattm...@jpl.nasa.gov] Sent: Friday, May 31, 2013 2:04 PM To: general@incubator.apache.org Subject: [PROPOSAL] Apache Spark for the Incubator Hi Folks, I'm pleased to bring you a proposal to the Apache Incubator for the Apache Spark project: https://wiki.apache.org/incubator/SparkProposal The work originates from the Berkeley AMPLab and through a number of industry participants, and other institutions. Spark is a framework for large- scale data analysis on clusters, with a particular focus on low latency operations. The source code is written in Scala, and provides a number of APIs and bindings in various programming languages. The proposal text is copied to the bottom of this email. I'm going to leave this thread open for the next week for discussion. Once it's died down, I'll call an official VOTE. Suresh, Ross G. -- heads up -- this project may be of interest to you both and would welcome you guys as additional mentors. We currently have 3 mentors committed to the project, but would love to have more. People interested in contributing should declare their interest here on the general@incubator thread and those potential contributors will be discussed by the incoming Spark community. Questions -- let's hear em'! :) Cheers, Chris (Champion, incoming Apache Spark) === Abstract === Spark is an open source system for large-scale data analysis on clusters. === Proposal === Spark is an open source system for fast and flexible large-scale data analysis. Spark provides a general purpose runtime that supports low-latency execution in several forms. These include interactive exploration of very large datasets, near real-time stream processing, and ad-hoc SQL analytics (through higher layer extensions). Spark interfaces with HDFS, HBase, Cassandra and several other storage storage layers, and exposes APIs in Scala, Java and Python. Background Spark started as U.C. Berkeley research project, designed to efficiently run machine learning algorithms on large datasets. Over time, it has evolved into a general computing engine as outlined above. Spark¹s developer community has also grown to include additional institutions, such as universities, research labs, and corporations. Funding has been provided by various institutions including the U.S. National Science Foundation, DARPA, and a number of industry sponsors. See: https://amplab.cs.berkeley.edu/sponsors/ for full details. === Rationale === As the number of contributors to Spark has grown, we have sought for a long-term home for the project, and we believe the Apache foundation would be a great fit. Spark is a natural fit for the Apache foundation: Spark already interoperates with several existing Apache projects (HDFS, HBase, Hive, Cassandra, Avro and Flume to name a few). The Spark team is familiar with the Apache process and and subscribes to the Apache mission - the team includes multiple Apache committers already. Finally, joining Apache will help coordinate the development effort of the growing number of organizations which contribute to Spark. == Initial Goals == The initial goals will most likely be to move the existing codebase to Apache and integrate with the Apache development process. Furthermore, we plan for incremental development, and releases along with the Apache guidelines. === Current Status === == Meritocracy == The Spark project already operates on meritocratic principles. Today, Spark has several developers and has accepted multiple major patches from outside of U.C. Berkeley. While this process has remained mostly informal (we do not have an official committer list), an implicit organization exists in which individuals who contribute major components act as maintainers for those modules. If accepted, the Spark project would include several of these participants as committers from the onset. We will work to identify all committers and PPMC members for the project and to operate under the ASF meritocratic principles. === Community === Acceptance into the Apache foundation would bolster the already strong user and developer community around Spark. That community includes dozens of contributors from several institutions, a meetup group with several hundred members, and an active mailing list composed of hundreds of users. Core Developers The core developers of our project are listed in our contributors and initial PPMC below. Though many exist at UC Berkeley, there is a representative cross sampling of other organizations including Quantifind, Microsoft, Yahoo!, ClearStory Data, Bizo, Intel, Tagged and Webtrends. === Alignment === Our proposed effort aligns with several ongoing BIGDATA
Re: [PROPOSAL] Apache Spark for the Incubator
Great news! Definitely +1 (non-binding, I guess) on adding Spark to the family of ASF project! I also express the interest to contribute to the project and move it forward to the graduation! Bigtop has been packaging and providing Spark as a part of Hadoop 1.x software stacks for some time; and hopefully would be able to offer it as a part of Hadoop 2.x line in the coming days. Dr. Konstantin Boudnik Hadoop committer BigTop PMC On Fri, May 31, 2013 at 06:03PM, Mattmann, Chris A (398J) wrote: Hi Folks, I'm pleased to bring you a proposal to the Apache Incubator for the Apache Spark project: https://wiki.apache.org/incubator/SparkProposal The work originates from the Berkeley AMPLab and through a number of industry participants, and other institutions. Spark is a framework for large-scale data analysis on clusters, with a particular focus on low latency operations. The source code is written in Scala, and provides a number of APIs and bindings in various programming languages. The proposal text is copied to the bottom of this email. I'm going to leave this thread open for the next week for discussion. Once it's died down, I'll call an official VOTE. Suresh, Ross G. -- heads up -- this project may be of interest to you both and would welcome you guys as additional mentors. We currently have 3 mentors committed to the project, but would love to have more. People interested in contributing should declare their interest here on the general@incubator thread and those potential contributors will be discussed by the incoming Spark community. Questions -- let's hear em'! :) Cheers, Chris (Champion, incoming Apache Spark) === Abstract === Spark is an open source system for large-scale data analysis on clusters. === Proposal === Spark is an open source system for fast and flexible large-scale data analysis. Spark provides a general purpose runtime that supports low-latency execution in several forms. These include interactive exploration of very large datasets, near real-time stream processing, and ad-hoc SQL analytics (through higher layer extensions). Spark interfaces with HDFS, HBase, Cassandra and several other storage storage layers, and exposes APIs in Scala, Java and Python. Background Spark started as U.C. Berkeley research project, designed to efficiently run machine learning algorithms on large datasets. Over time, it has evolved into a general computing engine as outlined above. Spark╧s developer community has also grown to include additional institutions, such as universities, research labs, and corporations. Funding has been provided by various institutions including the U.S. National Science Foundation, DARPA, and a number of industry sponsors. See: https://amplab.cs.berkeley.edu/sponsors/ for full details. === Rationale === As the number of contributors to Spark has grown, we have sought for a long-term home for the project, and we believe the Apache foundation would be a great fit. Spark is a natural fit for the Apache foundation: Spark already interoperates with several existing Apache projects (HDFS, HBase, Hive, Cassandra, Avro and Flume to name a few). The Spark team is familiar with the Apache process and and subscribes to the Apache mission - the team includes multiple Apache committers already. Finally, joining Apache will help coordinate the development effort of the growing number of organizations which contribute to Spark. == Initial Goals == The initial goals will most likely be to move the existing codebase to Apache and integrate with the Apache development process. Furthermore, we plan for incremental development, and releases along with the Apache guidelines. === Current Status === == Meritocracy == The Spark project already operates on meritocratic principles. Today, Spark has several developers and has accepted multiple major patches from outside of U.C. Berkeley. While this process has remained mostly informal (we do not have an official committer list), an implicit organization exists in which individuals who contribute major components act as maintainers for those modules. If accepted, the Spark project would include several of these participants as committers from the onset. We will work to identify all committers and PPMC members for the project and to operate under the ASF meritocratic principles. === Community === Acceptance into the Apache foundation would bolster the already strong user and developer community around Spark. That community includes dozens of contributors from several institutions, a meetup group with several hundred members, and an active mailing list composed of hundreds of users. Core Developers The core developers of our project are listed in our contributors and initial PPMC below. Though many exist at UC Berkeley, there is a representative cross sampling of other organizations including Quantifind, Microsoft, Yahoo!, ClearStory Data, Bizo,
Re: [PROPOSAL] Apache Spark for the Incubator
Wow! I have been using Shark, which runs on top of Shark, with Mesos in our prototype for API analytics for a while and would LOVE to help as mentor and initial contributors. - Henry On Fri, May 31, 2013 at 11:03 AM, Mattmann, Chris A (398J) chris.a.mattm...@jpl.nasa.gov wrote: Hi Folks, I'm pleased to bring you a proposal to the Apache Incubator for the Apache Spark project: https://wiki.apache.org/incubator/SparkProposal The work originates from the Berkeley AMPLab and through a number of industry participants, and other institutions. Spark is a framework for large-scale data analysis on clusters, with a particular focus on low latency operations. The source code is written in Scala, and provides a number of APIs and bindings in various programming languages. The proposal text is copied to the bottom of this email. I'm going to leave this thread open for the next week for discussion. Once it's died down, I'll call an official VOTE. Suresh, Ross G. -- heads up -- this project may be of interest to you both and would welcome you guys as additional mentors. We currently have 3 mentors committed to the project, but would love to have more. People interested in contributing should declare their interest here on the general@incubator thread and those potential contributors will be discussed by the incoming Spark community. Questions -- let's hear em'! :) Cheers, Chris (Champion, incoming Apache Spark) === Abstract === Spark is an open source system for large-scale data analysis on clusters. === Proposal === Spark is an open source system for fast and flexible large-scale data analysis. Spark provides a general purpose runtime that supports low-latency execution in several forms. These include interactive exploration of very large datasets, near real-time stream processing, and ad-hoc SQL analytics (through higher layer extensions). Spark interfaces with HDFS, HBase, Cassandra and several other storage storage layers, and exposes APIs in Scala, Java and Python. Background Spark started as U.C. Berkeley research project, designed to efficiently run machine learning algorithms on large datasets. Over time, it has evolved into a general computing engine as outlined above. Spark¹s developer community has also grown to include additional institutions, such as universities, research labs, and corporations. Funding has been provided by various institutions including the U.S. National Science Foundation, DARPA, and a number of industry sponsors. See: https://amplab.cs.berkeley.edu/sponsors/ for full details. === Rationale === As the number of contributors to Spark has grown, we have sought for a long-term home for the project, and we believe the Apache foundation would be a great fit. Spark is a natural fit for the Apache foundation: Spark already interoperates with several existing Apache projects (HDFS, HBase, Hive, Cassandra, Avro and Flume to name a few). The Spark team is familiar with the Apache process and and subscribes to the Apache mission - the team includes multiple Apache committers already. Finally, joining Apache will help coordinate the development effort of the growing number of organizations which contribute to Spark. == Initial Goals == The initial goals will most likely be to move the existing codebase to Apache and integrate with the Apache development process. Furthermore, we plan for incremental development, and releases along with the Apache guidelines. === Current Status === == Meritocracy == The Spark project already operates on meritocratic principles. Today, Spark has several developers and has accepted multiple major patches from outside of U.C. Berkeley. While this process has remained mostly informal (we do not have an official committer list), an implicit organization exists in which individuals who contribute major components act as maintainers for those modules. If accepted, the Spark project would include several of these participants as committers from the onset. We will work to identify all committers and PPMC members for the project and to operate under the ASF meritocratic principles. === Community === Acceptance into the Apache foundation would bolster the already strong user and developer community around Spark. That community includes dozens of contributors from several institutions, a meetup group with several hundred members, and an active mailing list composed of hundreds of users. Core Developers The core developers of our project are listed in our contributors and initial PPMC below. Though many exist at UC Berkeley, there is a representative cross sampling of other organizations including Quantifind, Microsoft, Yahoo!, ClearStory Data, Bizo, Intel, Tagged and Webtrends. === Alignment === Our proposed effort aligns with several ongoing BIGDATA and U.S. National priority funding interests including the NSF and its Expeditions program, and the DARPA XDATA project. Our
[PROPOSAL] Apache Spark for the Incubator
I believe it is more of a framework but you can take a look at Shark which using Spark to do data warehousing that support hive query ( http://shark.cs.berkeley.edu) - Henry On Friday, May 31, 2013, Chen, Pei wrote: +1 (non-binding) This seems like a really interesting project. Q- Is Spark just a framework/API or does it also have some tools implemented for data analytics? --Pei -Original Message- From: Mattmann, Chris A (398J) [mailto:chris.a.mattm...@jpl.nasa.gov] Sent: Friday, May 31, 2013 2:04 PM To: general@incubator.apache.org Subject: [PROPOSAL] Apache Spark for the Incubator Hi Folks, I'm pleased to bring you a proposal to the Apache Incubator for the Apache Spark project: https://wiki.apache.org/incubator/SparkProposal The work originates from the Berkeley AMPLab and through a number of industry participants, and other institutions. Spark is a framework for large- scale data analysis on clusters, with a particular focus on low latency operations. The source code is written in Scala, and provides a number of APIs and bindings in various programming languages. The proposal text is copied to the bottom of this email. I'm going to leave this thread open for the next week for discussion. Once it's died down, I'll call an official VOTE. Suresh, Ross G. -- heads up -- this project may be of interest to you both and would welcome you guys as additional mentors. We currently have 3 mentors committed to the project, but would love to have more. People interested in contributing should declare their interest here on the general@incubator thread and those potential contributors will be discussed by the incoming Spark community. Questions -- let's hear em'! :) Cheers, Chris (Champion, incoming Apache Spark) === Abstract === Spark is an open source system for large-scale data analysis on clusters. === Proposal === Spark is an open source system for fast and flexible large-scale data analysis. Spark provides a general purpose runtime that supports low-latency execution in several forms. These include interactive exploration of very large datasets, near real-time stream processing, and ad-hoc SQL analytics (through higher layer extensions). Spark interfaces with HDFS, HBase, Cassandra and several other storage storage layers, and exposes APIs in Scala, Java and Python. Background Spark started as U.C. Berkeley research project, designed to efficiently run machine learning algorithms on large datasets. Over time, it has evolved into a general computing engine as outlined above. Spark¹s developer community has also grown to include additional institutions, such as universities, research labs, and corporations. Funding has been provided by various institutions including the U.S. National Science Foundation, DARPA, and a number of industry sponsors. See: https://amplab.cs.berkeley.edu/sponsors/ for full details. === Rationale === As the number of contributors to Spark has grown, we have sought for a long-term home for the project, and we believe the Apache foundation would be a great fit. Spark is a natural fit for the Apache foundation: Spark already interoperates with several existing Apache projects (HDFS, HBase, Hive, Cassandra, Avro and Flume to name a few). The Spark team is familiar with the Apache process and and subscribes to the Apache mission - the team includes multiple Apache committers already. Finally, joining Apache will help coordinate the development effort of the growing number of organizations which contribute to Spark. == Initial Goals == The initial goals will most likely be to move the existing codebase to Apache and integrate with the Apache development process. Furthermore, we plan for incremental development, and releases along with the Apache guidelines. === Current Status === == Meritocracy == The Spark project already operates on meritocratic principles. Today, Spark has several developers and has accepted multiple major patches from outside of U.C. Berkeley. While this process has remained mostly informal (we do not have an official committer list), an implicit organization exists in which individuals who contribute major components act as maintainers for those modules. If accepted, the Spark project would include several of these participants as committers from the onset. We will work to identify all committers and PPMC members for the project and to operate under the ASF meritocratic principles. === Community === Acceptance into the Apache foundation would bolster the already strong user and developer community around Spark. That community includes dozens of contributors from several institutions, a meetup group with several hundred members, and an active mailing list composed of hundreds of users. Core Developers The core developers of our project are listed in our
Re: [PROPOSAL] Apache Spark for the Incubator
Spark it is an execution framework, but it also provides some high level APIs which makes it much easier to do data analytics. For example, to do grep like queries: val docs = sparkContext.textFile(hdfs://...) docs.filter(doc = doc.contains(Berkeley)).count Another example to do word count (using the Scala API): val docs = sparkContext.textFile(hdfs://...) val counts = docs.flatMap(line = line.split(\\s+)).map(word = (word, 1)).reduceByKey(_ + _) counts.saveAsTextFile(hdfs://...) The high level APIs are similar to a lot of the relational operators, including aggregations, group bys, joins, etc. Shark uses Spark as the execution engine but provides a Hive-compatible SQL interface. This proposal is however only about moving Spark to ASF incubator, and not Shark. -- Reynold Xin, AMPLab, UC Berkeley http://rxin.org On Fri, May 31, 2013 at 1:03 PM, Henry Saputra henry.sapu...@gmail.comwrote: I believe it is more of a framework but you can take a look at Shark which using Spark to do data warehousing that support hive query ( http://shark.cs.berkeley.edu) - Henry On Friday, May 31, 2013, Chen, Pei wrote: +1 (non-binding) This seems like a really interesting project. Q- Is Spark just a framework/API or does it also have some tools implemented for data analytics? --Pei -Original Message- From: Mattmann, Chris A (398J) [mailto:chris.a.mattm...@jpl.nasa.gov] Sent: Friday, May 31, 2013 2:04 PM To: general@incubator.apache.org Subject: [PROPOSAL] Apache Spark for the Incubator Hi Folks, I'm pleased to bring you a proposal to the Apache Incubator for the Apache Spark project: https://wiki.apache.org/incubator/SparkProposal The work originates from the Berkeley AMPLab and through a number of industry participants, and other institutions. Spark is a framework for large- scale data analysis on clusters, with a particular focus on low latency operations. The source code is written in Scala, and provides a number of APIs and bindings in various programming languages. The proposal text is copied to the bottom of this email. I'm going to leave this thread open for the next week for discussion. Once it's died down, I'll call an official VOTE. Suresh, Ross G. -- heads up -- this project may be of interest to you both and would welcome you guys as additional mentors. We currently have 3 mentors committed to the project, but would love to have more. People interested in contributing should declare their interest here on the general@incubator thread and those potential contributors will be discussed by the incoming Spark community. Questions -- let's hear em'! :) Cheers, Chris (Champion, incoming Apache Spark) === Abstract === Spark is an open source system for large-scale data analysis on clusters. === Proposal === Spark is an open source system for fast and flexible large-scale data analysis. Spark provides a general purpose runtime that supports low-latency execution in several forms. These include interactive exploration of very large datasets, near real-time stream processing, and ad-hoc SQL analytics (through higher layer extensions). Spark interfaces with HDFS, HBase, Cassandra and several other storage storage layers, and exposes APIs in Scala, Java and Python. Background Spark started as U.C. Berkeley research project, designed to efficiently run machine learning algorithms on large datasets. Over time, it has evolved into a general computing engine as outlined above. Spark¹s developer community has also grown to include additional institutions, such as universities, research labs, and corporations. Funding has been provided by various institutions including the U.S. National Science Foundation, DARPA, and a number of industry sponsors. See: https://amplab.cs.berkeley.edu/sponsors/ for full details. === Rationale === As the number of contributors to Spark has grown, we have sought for a long-term home for the project, and we believe the Apache foundation would be a great fit. Spark is a natural fit for the Apache foundation: Spark already interoperates with several existing Apache projects (HDFS, HBase, Hive, Cassandra, Avro and Flume to name a few). The Spark team is familiar with the Apache process and and subscribes to the Apache mission - the team includes multiple Apache committers already. Finally, joining Apache will help coordinate the development effort of the growing number of organizations which contribute to Spark. == Initial Goals == The initial goals will most likely be to move the existing codebase to Apache and integrate with the Apache development process. Furthermore, we plan for incremental development, and releases along with the Apache guidelines. === Current Status === == Meritocracy == The Spark
Re: [PROPOSAL] Apache Spark for the Incubator
Extremely enthusiastic +1!!! If you ever need help with mentorship -- please let me know. Also, looking forward to seeing this in Bigtop! Thanks, Roman. On Fri, May 31, 2013 at 11:03 AM, Mattmann, Chris A (398J) chris.a.mattm...@jpl.nasa.gov wrote: Hi Folks, I'm pleased to bring you a proposal to the Apache Incubator for the Apache Spark project: https://wiki.apache.org/incubator/SparkProposal The work originates from the Berkeley AMPLab and through a number of industry participants, and other institutions. Spark is a framework for large-scale data analysis on clusters, with a particular focus on low latency operations. The source code is written in Scala, and provides a number of APIs and bindings in various programming languages. The proposal text is copied to the bottom of this email. I'm going to leave this thread open for the next week for discussion. Once it's died down, I'll call an official VOTE. Suresh, Ross G. -- heads up -- this project may be of interest to you both and would welcome you guys as additional mentors. We currently have 3 mentors committed to the project, but would love to have more. People interested in contributing should declare their interest here on the general@incubator thread and those potential contributors will be discussed by the incoming Spark community. Questions -- let's hear em'! :) Cheers, Chris (Champion, incoming Apache Spark) === Abstract === Spark is an open source system for large-scale data analysis on clusters. === Proposal === Spark is an open source system for fast and flexible large-scale data analysis. Spark provides a general purpose runtime that supports low-latency execution in several forms. These include interactive exploration of very large datasets, near real-time stream processing, and ad-hoc SQL analytics (through higher layer extensions). Spark interfaces with HDFS, HBase, Cassandra and several other storage storage layers, and exposes APIs in Scala, Java and Python. Background Spark started as U.C. Berkeley research project, designed to efficiently run machine learning algorithms on large datasets. Over time, it has evolved into a general computing engine as outlined above. Spark¹s developer community has also grown to include additional institutions, such as universities, research labs, and corporations. Funding has been provided by various institutions including the U.S. National Science Foundation, DARPA, and a number of industry sponsors. See: https://amplab.cs.berkeley.edu/sponsors/ for full details. === Rationale === As the number of contributors to Spark has grown, we have sought for a long-term home for the project, and we believe the Apache foundation would be a great fit. Spark is a natural fit for the Apache foundation: Spark already interoperates with several existing Apache projects (HDFS, HBase, Hive, Cassandra, Avro and Flume to name a few). The Spark team is familiar with the Apache process and and subscribes to the Apache mission - the team includes multiple Apache committers already. Finally, joining Apache will help coordinate the development effort of the growing number of organizations which contribute to Spark. == Initial Goals == The initial goals will most likely be to move the existing codebase to Apache and integrate with the Apache development process. Furthermore, we plan for incremental development, and releases along with the Apache guidelines. === Current Status === == Meritocracy == The Spark project already operates on meritocratic principles. Today, Spark has several developers and has accepted multiple major patches from outside of U.C. Berkeley. While this process has remained mostly informal (we do not have an official committer list), an implicit organization exists in which individuals who contribute major components act as maintainers for those modules. If accepted, the Spark project would include several of these participants as committers from the onset. We will work to identify all committers and PPMC members for the project and to operate under the ASF meritocratic principles. === Community === Acceptance into the Apache foundation would bolster the already strong user and developer community around Spark. That community includes dozens of contributors from several institutions, a meetup group with several hundred members, and an active mailing list composed of hundreds of users. Core Developers The core developers of our project are listed in our contributors and initial PPMC below. Though many exist at UC Berkeley, there is a representative cross sampling of other organizations including Quantifind, Microsoft, Yahoo!, ClearStory Data, Bizo, Intel, Tagged and Webtrends. === Alignment === Our proposed effort aligns with several ongoing BIGDATA and U.S. National priority funding interests including the NSF and its Expeditions program, and the DARPA XDATA project. Our industry partners and