Re: [PROPOSAL] Apache Spark for the Incubator

2013-06-28 Thread Konstantin Boudnik
That makes sense. Thanks for the update - I am still catching up on my emails
backed up because of the Hadoop summit.

Cos

On Tue, Jun 04, 2013 at 01:44AM, Mattmann, Chris A (398J) wrote:
 Dear Konstantin,
 
 Thanks! The incoming Spark project is excited about the relationship
 with Bigtop that could happen here.
 
 As for new committers, after conferring with the Spark project
 members, we would like to adopt a simple policy of having all new
 committers not add themselves to the wiki as of yet, but simply
 join the project mailing lists when they are created, and then from
 there, contribute. I and other mentors, and the Spark community are
 committed to being inclusive, so hopefully won't take too long for
 anybody to become a PPMC member/committer on the project after some
 demonstrated contributions.
 
 Thanks for your interest and again for your kind words.
 
 Cheers!
 
 Chris
 
 
 ++
 Chris Mattmann, Ph.D.
 Senior Computer Scientist
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 171-266B, Mailstop: 171-246
 Email: chris.a.mattm...@nasa.gov
 WWW:  http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Assistant Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++
 
 
 
 
 
 
 -Original Message-
 From: Konstantin Boudnik c...@apache.org
 Reply-To: general@incubator.apache.org general@incubator.apache.org
 Date: Friday, May 31, 2013 12:29 PM
 To: general@incubator.apache.org general@incubator.apache.org
 Subject: Re: [PROPOSAL] Apache Spark for the Incubator
 
 Great news!
 
 Definitely +1 (non-binding, I guess) on adding Spark to the family
 of ASF project!
 
 I also express the interest to contribute to the project and move it
 forward
 to the graduation! Bigtop has been packaging and providing Spark as a
 part of
 Hadoop 1.x software stacks for some time; and hopefully would be able to
 offer
 it as a part of Hadoop 2.x line in the coming days.
 
 Dr. Konstantin Boudnik
   Hadoop committer
   BigTop PMC
 
 On Fri, May 31, 2013 at 06:03PM, Mattmann, Chris A (398J) wrote:
  Hi Folks,
  
  I'm pleased to bring you a proposal to the Apache Incubator for the
 Apache
  Spark project: https://wiki.apache.org/incubator/SparkProposal
  
  The work originates from the Berkeley AMPLab and through a number of
  industry
  participants, and other institutions. Spark is a framework for
 large-scale
  data 
  analysis on clusters, with a particular focus on low latency operations.
  The
  source code is written in Scala, and provides a number of APIs and
 bindings
  in various programming languages.
  
  The proposal text is copied to the bottom of this email. I'm going to
 leave
  this thread open for the next week for discussion. Once it's died down,
  I'll
  call an official VOTE.
  
  Suresh, Ross G. -- heads up -- this project may be of interest to you
 both
  and would welcome you guys as additional mentors. We currently have 3
  mentors
  committed to the project, but would love to have more. People
 interested in
  contributing should declare their interest here on the general@incubator
  thread
  and those potential contributors will be discussed by the incoming Spark
  community.
  
  Questions -- let's hear em'! :)
  
  Cheers,
  Chris
  (Champion, incoming Apache Spark)
  
  === Abstract ===
  Spark is an open source system for large-scale data analysis on
 clusters.
  
  === Proposal ===
  Spark is an open source system for fast and flexible large-scale data
  analysis. Spark provides a general purpose runtime that supports
  low-latency execution in several forms. These include interactive
  exploration of very large datasets, near real-time stream processing,
 and
  ad-hoc SQL analytics (through higher layer extensions). Spark interfaces
  with HDFS, HBase, Cassandra and several other storage storage layers,
 and
  exposes APIs in Scala, Java and Python.
  Background
  Spark started as U.C. Berkeley research project, designed to efficiently
  run machine learning algorithms on large datasets. Over time, it has
  evolved into a general computing engine as outlined above. Spark╧s
  developer community has also grown to include additional institutions,
  such as universities, research labs, and corporations. Funding has been
  provided by various institutions including the U.S. National Science
  Foundation, DARPA, and a number of industry sponsors. See:
  https://amplab.cs.berkeley.edu/sponsors/ for full details.
  
  === Rationale ===
  As the number of contributors to Spark has grown, we have sought for a
  long-term home for the project, and we believe the Apache foundation
 would
  be a great fit. Spark is a natural fit for the Apache foundation: Spark
  already interoperates with several existing Apache projects (HDFS,
 HBase,
  Hive

Re: [PROPOSAL] Apache Spark for the Incubator

2013-06-26 Thread Matt Franklin
Yes. d...@spark.incubator.apache.org

On Wednesday, June 26, 2013, karthik tunga wrote:

 Hi,

 Is the mailing list setup ?

 Cheers,
 Karthik


 On 20 June 2013 02:38, Matei Zaharia ma...@eecs.berkeley.edu wrote:

  Thanks Chris! We'll get started on all the required steps.
 
  Matei
 
  On Jun 20, 2013, at 4:35 AM, Mattmann, Chris A (398J) 
  chris.a.mattm...@jpl.nasa.gov wrote:
 
   Hi Folks,
  
   This VOTE has passed with the following tallies:
  
   +1
   Chris Mattmann*
   Konstantin Boudnik
   Henry Saputra*
   Reynold Xin
   Pei Chen
   Roman Shaposhnik*
   Suresh Marru*
   Scott Deboy
   Ted Dunning*
   Hitesh Shah
   Paul Ramirez*
   Ralph Goers*
   Alan Cabrera*
   Thilina Gunarathne
   Marcel Offermans*
   Alex Karasulu*
   Chris Douglas*
   Andrew Hart*
   Deepal jayasinghe
   Ashish
   Joe Brockmeier*
   Mohammad Nour El-Din*
   Arun C Murthy*
   Tim Williams*
   Arvind Prabhakar*
   Matt Franklin*
   Matei Zaharia
   Andy Konwinski
  
   +0.9
  
  
   Marvin Humphrey
  
   * -indicates IPMC
  
  
   I'll go ahead and get the JIRA tickets filed for email/issue
  tracking/Git,
   and then work with the community to get them moving on' over. Thanks
 for
   VOTE'ing!
  
   Cheers,
   Chris
  
  
   ++
   Chris Mattmann, Ph.D.
   Senior Computer Scientist
   NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
   Office: 171-266B, Mailstop: 171-246
   Email: chris.a.mattm...@nasa.gov
   WWW:  http://sunset.usc.edu/~mattmann/
   ++
   Adjunct Assistant Professor, Computer Science Department
   University of Southern California, Los Angeles, CA 90089 USA
   ++
  
  
  
  
  
  
   -Original Message-
   From: Mattmann, jpluser chris.a.mattm...@jpl.nasa.gov
   Reply-To: general@incubator.apache.org general@incubator.apache.org
 
   Date: Friday, June 7, 2013 10:34 PM
   To: general@incubator.apache.org general@incubator.apache.org
   Subject: [VOTE] Apache Spark for the Incubator
  
   Hi Folks,
  
   OK discussion has died down, time to VOTE to accept Spark into the
   Apache Incubator. I'll let the VOTE run for at least a week.
  
   So far I've heard +1s from the following folks, so no need for them
   to VOTE again unless they want to change their VOTE:
  
   +1
  
   Chris Mattmann*
   Konstantin Boudnik
   Henry Saputra*
   Reynold Xin
   Pei Chen
   Roman Shaposhnik*
   Suresh Marru*
  
   * -indicates IPMC
  
   [ ] +1 Accept Spark into the Apache Incubator.
   [ ] +0 Don't care.
   [ ] -1 Don't accept Spark into the Apache Incubator because..
  
   Proposal text is below.
  
   === Abstract ===
   Spark is an open source system for large-scale data analysis on
  clusters.


Re: [PROPOSAL] Apache Spark for the Incubator

2013-06-10 Thread Mohammad Nour El-Din
Hi Marvin


On Sun, Jun 9, 2013 at 5:15 AM, Marvin Humphrey mar...@rectangular.comwrote:

 On Sat, Jun 8, 2013 at 4:55 PM, Mattmann, Chris A (398J)
 chris.a.mattm...@jpl.nasa.gov wrote:
  Note: we discussed adding Roman before the VOTE and it was
  fine with the incoming Spark community, so Roman is now on
  the wiki for the proposal.
 
  In case this changes anyone's VOTE on the VOTE thread, feel
  free to speak up or change your VOTE. Otherwise, nothing else
  to see here folks.

 +1 for the original proposal.

 +0.9 for the new proposal.

 Yes, I expect you to tally my vote that way.  :)

 Next time, please be more careful when starting a VOTE and please don't
 change
 the proposal text in the middle of a vote.  Personnel issues in proposals
 have
 caused significant problems in the past.  That's unlikely to happen in this
 case, but I want to register my protest now because it might save us
 hundreds
 or thousands of emails in the future.


This is *not* a [VOTE] yet, this is a [PROPOSAL] in which case the proposal
can be updated and enhanced if required. So allow me to disagree about what
you replied regarding *not to make changes to the proposal in such phase*



 Good luck, Spark!

 Marvin Humphrey

 -
 To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
 For additional commands, e-mail: general-h...@incubator.apache.org




-- 
Thanks
- Mohammad Nour

Life is like riding a bicycle. To keep your balance you must keep moving
- Albert Einstein


Re: [PROPOSAL] Apache Spark for the Incubator

2013-06-08 Thread Mattmann, Chris A (398J)
Note: we discussed adding Roman before the VOTE and it was
fine with the incoming Spark community, so Roman is now on
the wiki for the proposal.

In case this changes anyone's VOTE on the VOTE thread, feel
free to speak up or change your VOTE. Otherwise, nothing else
to see here folks.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Roman Shaposhnik r...@apache.org
Date: Saturday, June 8, 2013 3:03 PM
To: jpluser chris.a.mattm...@jpl.nasa.gov
Subject: Re: [PROPOSAL] Apache Spark for the Incubator

On Mon, Jun 3, 2013 at 6:40 PM, Mattmann, Chris A (398J)
chris.a.mattm...@jpl.nasa.gov wrote:
 Hi Roman, I've conferred with the incoming Spark community and we are
 happy to have you
 as a mentor for the project.

 Feel free to add yourself to the wiki proposal.

Great news! Done.

Thanks,
Roman.


-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



Re: [PROPOSAL] Apache Spark for the Incubator

2013-06-08 Thread Marvin Humphrey
On Sat, Jun 8, 2013 at 4:55 PM, Mattmann, Chris A (398J)
chris.a.mattm...@jpl.nasa.gov wrote:
 Note: we discussed adding Roman before the VOTE and it was
 fine with the incoming Spark community, so Roman is now on
 the wiki for the proposal.

 In case this changes anyone's VOTE on the VOTE thread, feel
 free to speak up or change your VOTE. Otherwise, nothing else
 to see here folks.

+1 for the original proposal.

+0.9 for the new proposal.

Yes, I expect you to tally my vote that way.  :)

Next time, please be more careful when starting a VOTE and please don't change
the proposal text in the middle of a vote.  Personnel issues in proposals have
caused significant problems in the past.  That's unlikely to happen in this
case, but I want to register my protest now because it might save us hundreds
or thousands of emails in the future.

Good luck, Spark!

Marvin Humphrey

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



Re: [PROPOSAL] Apache Spark for the Incubator

2013-06-03 Thread Mattmann, Chris A (398J)
Hi Henry,

Thanks for your support! I will leave it up to Matei and
the incoming Spark community to decide if they would like
to add you (or anyone else) to the wiki as a contributor
on the project.

Thanks!

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Henry Saputra henry.sapu...@gmail.com
Reply-To: general@incubator.apache.org general@incubator.apache.org
Date: Friday, May 31, 2013 12:38 PM
To: general@incubator.apache.org general@incubator.apache.org
Subject: Re: [PROPOSAL] Apache Spark for the Incubator

Wow! I have been using Shark, which runs on top of Shark, with Mesos in
our
prototype for API analytics for a while and would LOVE to help as mentor
and initial contributors.


- Henry



On Fri, May 31, 2013 at 11:03 AM, Mattmann, Chris A (398J) 
chris.a.mattm...@jpl.nasa.gov wrote:

 Hi Folks,

 I'm pleased to bring you a proposal to the Apache Incubator for the
Apache
 Spark project: https://wiki.apache.org/incubator/SparkProposal

 The work originates from the Berkeley AMPLab and through a number of
 industry
 participants, and other institutions. Spark is a framework for
large-scale
 data
 analysis on clusters, with a particular focus on low latency operations.
 The
 source code is written in Scala, and provides a number of APIs and
bindings
 in various programming languages.

 The proposal text is copied to the bottom of this email. I'm going to
leave
 this thread open for the next week for discussion. Once it's died down,
 I'll
 call an official VOTE.

 Suresh, Ross G. -- heads up -- this project may be of interest to you
both
 and would welcome you guys as additional mentors. We currently have 3
 mentors
 committed to the project, but would love to have more. People
interested in
 contributing should declare their interest here on the general@incubator
 thread
 and those potential contributors will be discussed by the incoming Spark
 community.

 Questions -- let's hear em'! :)

 Cheers,
 Chris
 (Champion, incoming Apache Spark)

 === Abstract ===
 Spark is an open source system for large-scale data analysis on
clusters.

 === Proposal ===
 Spark is an open source system for fast and flexible large-scale data
 analysis. Spark provides a general purpose runtime that supports
 low-latency execution in several forms. These include interactive
 exploration of very large datasets, near real-time stream processing,
and
 ad-hoc SQL analytics (through higher layer extensions). Spark interfaces
 with HDFS, HBase, Cassandra and several other storage storage layers,
and
 exposes APIs in Scala, Java and Python.
 Background
 Spark started as U.C. Berkeley research project, designed to efficiently
 run machine learning algorithms on large datasets. Over time, it has
 evolved into a general computing engine as outlined above. Spark¹s
 developer community has also grown to include additional institutions,
 such as universities, research labs, and corporations. Funding has been
 provided by various institutions including the U.S. National Science
 Foundation, DARPA, and a number of industry sponsors. See:
 https://amplab.cs.berkeley.edu/sponsors/ for full details.

 === Rationale ===
 As the number of contributors to Spark has grown, we have sought for a
 long-term home for the project, and we believe the Apache foundation
would
 be a great fit. Spark is a natural fit for the Apache foundation: Spark
 already interoperates with several existing Apache projects (HDFS,
HBase,
 Hive, Cassandra, Avro and Flume to name a few). The Spark team is
familiar
 with the Apache process and and subscribes to the Apache mission - the
 team includes multiple Apache committers already. Finally, joining
Apache
 will help coordinate the development effort of the growing number of
 organizations which contribute to Spark.

 == Initial Goals ==
 The initial goals will most likely be to move the existing codebase to
 Apache and integrate with the Apache development process. Furthermore,
we
 plan for incremental development, and releases along with the Apache
 guidelines.

 === Current Status ===
 == Meritocracy ==
 The Spark project already operates on meritocratic principles. Today,
 Spark has several developers and has accepted multiple major patches
from
 outside of U.C. Berkeley. While this process has remained mostly
informal
 (we do not have an official committer list), an implicit organization
 exists in which individuals who contribute major components act as
 maintainers for those modules

Re: [PROPOSAL] Apache Spark for the Incubator

2013-06-03 Thread Mattmann, Chris A (398J)
Thanks for the support, Pei. I think the questions you had
about frameworks/etc., hopefully were answered.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Chen, Pei pei.c...@childrens.harvard.edu
Reply-To: general@incubator.apache.org general@incubator.apache.org
Date: Friday, May 31, 2013 11:45 AM
To: general@incubator.apache.org general@incubator.apache.org
Subject: RE: [PROPOSAL] Apache Spark for the Incubator

+1 (non-binding)
This seems like a really interesting project.
Q- Is Spark just a framework/API or does it also have some tools
implemented for data analytics?
--Pei

 -Original Message-
 From: Mattmann, Chris A (398J) [mailto:chris.a.mattm...@jpl.nasa.gov]
 Sent: Friday, May 31, 2013 2:04 PM
 To: general@incubator.apache.org
 Subject: [PROPOSAL] Apache Spark for the Incubator
 
 Hi Folks,
 
 I'm pleased to bring you a proposal to the Apache Incubator for the
Apache
 Spark project: https://wiki.apache.org/incubator/SparkProposal
 
 The work originates from the Berkeley AMPLab and through a number of
 industry participants, and other institutions. Spark is a framework for
large-
 scale data analysis on clusters, with a particular focus on low latency
 operations.
 The
 source code is written in Scala, and provides a number of APIs and
bindings in
 various programming languages.
 
 The proposal text is copied to the bottom of this email. I'm going to
leave this
 thread open for the next week for discussion. Once it's died down, I'll
call an
 official VOTE.
 
 Suresh, Ross G. -- heads up -- this project may be of interest to you
both and
 would welcome you guys as additional mentors. We currently have 3
 mentors committed to the project, but would love to have more. People
 interested in contributing should declare their interest here on the
 general@incubator thread and those potential contributors will be
discussed
 by the incoming Spark community.
 
 Questions -- let's hear em'! :)
 
 Cheers,
 Chris
 (Champion, incoming Apache Spark)
 
 === Abstract ===
 Spark is an open source system for large-scale data analysis on
clusters.
 
 === Proposal ===
 Spark is an open source system for fast and flexible large-scale data
analysis.
 Spark provides a general purpose runtime that supports low-latency
 execution in several forms. These include interactive exploration of
very
 large datasets, near real-time stream processing, and ad-hoc SQL
analytics
 (through higher layer extensions). Spark interfaces with HDFS, HBase,
 Cassandra and several other storage storage layers, and exposes APIs in
 Scala, Java and Python.
 Background
 Spark started as U.C. Berkeley research project, designed to
efficiently run
 machine learning algorithms on large datasets. Over time, it has
evolved into
 a general computing engine as outlined above. Spark¹s developer
community
 has also grown to include additional institutions, such as universities,
 research labs, and corporations. Funding has been provided by various
 institutions including the U.S. National Science Foundation, DARPA, and
a
 number of industry sponsors. See:
 https://amplab.cs.berkeley.edu/sponsors/ for full details.
 
 === Rationale ===
 As the number of contributors to Spark has grown, we have sought for a
 long-term home for the project, and we believe the Apache foundation
 would be a great fit. Spark is a natural fit for the Apache foundation:
Spark
 already interoperates with several existing Apache projects (HDFS,
HBase,
 Hive, Cassandra, Avro and Flume to name a few). The Spark team is
familiar
 with the Apache process and and subscribes to the Apache mission - the
 team includes multiple Apache committers already. Finally, joining
Apache
 will help coordinate the development effort of the growing number of
 organizations which contribute to Spark.
 
 == Initial Goals ==
 The initial goals will most likely be to move the existing codebase to
Apache
 and integrate with the Apache development process. Furthermore, we plan
 for incremental development, and releases along with the Apache
 guidelines.
 
 === Current Status ===
 == Meritocracy ==
 The Spark project already operates on meritocratic principles. Today,
Spark
 has several developers and has accepted multiple major patches from
 outside of U.C. Berkeley. While this process has remained mostly
informal
 (we do not have an official committer list), an implicit organization
exists in
 which individuals who contribute major components act as maintainers for
 those

Re: [PROPOSAL] Apache Spark for the Incubator

2013-06-03 Thread Mattmann, Chris A (398J)
Thanks for the support Roman!

I will leave it up to the incoming Spark community members to
decide if they need more mentors and we'll be in touch.

Thank you again.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Roman Shaposhnik r...@apache.org
Reply-To: general@incubator.apache.org general@incubator.apache.org
Date: Friday, May 31, 2013 3:25 PM
To: general@incubator.apache.org general@incubator.apache.org
Subject: Re: [PROPOSAL] Apache Spark for the Incubator

Extremely enthusiastic +1!!!

If you ever need help with mentorship -- please let me know.

Also, looking forward to seeing this in Bigtop!

Thanks,
Roman.

On Fri, May 31, 2013 at 11:03 AM, Mattmann, Chris A (398J)
chris.a.mattm...@jpl.nasa.gov wrote:
 Hi Folks,

 I'm pleased to bring you a proposal to the Apache Incubator for the
Apache
 Spark project: https://wiki.apache.org/incubator/SparkProposal

 The work originates from the Berkeley AMPLab and through a number of
 industry
 participants, and other institutions. Spark is a framework for
large-scale
 data
 analysis on clusters, with a particular focus on low latency operations.
 The
 source code is written in Scala, and provides a number of APIs and
bindings
 in various programming languages.

 The proposal text is copied to the bottom of this email. I'm going to
leave
 this thread open for the next week for discussion. Once it's died down,
 I'll
 call an official VOTE.

 Suresh, Ross G. -- heads up -- this project may be of interest to you
both
 and would welcome you guys as additional mentors. We currently have 3
 mentors
 committed to the project, but would love to have more. People
interested in
 contributing should declare their interest here on the general@incubator
 thread
 and those potential contributors will be discussed by the incoming Spark
 community.

 Questions -- let's hear em'! :)

 Cheers,
 Chris
 (Champion, incoming Apache Spark)

 === Abstract ===
 Spark is an open source system for large-scale data analysis on
clusters.

 === Proposal ===
 Spark is an open source system for fast and flexible large-scale data
 analysis. Spark provides a general purpose runtime that supports
 low-latency execution in several forms. These include interactive
 exploration of very large datasets, near real-time stream processing,
and
 ad-hoc SQL analytics (through higher layer extensions). Spark interfaces
 with HDFS, HBase, Cassandra and several other storage storage layers,
and
 exposes APIs in Scala, Java and Python.
 Background
 Spark started as U.C. Berkeley research project, designed to efficiently
 run machine learning algorithms on large datasets. Over time, it has
 evolved into a general computing engine as outlined above. Spark¹s
 developer community has also grown to include additional institutions,
 such as universities, research labs, and corporations. Funding has been
 provided by various institutions including the U.S. National Science
 Foundation, DARPA, and a number of industry sponsors. See:
 https://amplab.cs.berkeley.edu/sponsors/ for full details.

 === Rationale ===
 As the number of contributors to Spark has grown, we have sought for a
 long-term home for the project, and we believe the Apache foundation
would
 be a great fit. Spark is a natural fit for the Apache foundation: Spark
 already interoperates with several existing Apache projects (HDFS,
HBase,
 Hive, Cassandra, Avro and Flume to name a few). The Spark team is
familiar
 with the Apache process and and subscribes to the Apache mission - the
 team includes multiple Apache committers already. Finally, joining
Apache
 will help coordinate the development effort of the growing number of
 organizations which contribute to Spark.

 == Initial Goals ==
 The initial goals will most likely be to move the existing codebase to
 Apache and integrate with the Apache development process. Furthermore,
we
 plan for incremental development, and releases along with the Apache
 guidelines.

 === Current Status ===
 == Meritocracy ==
 The Spark project already operates on meritocratic principles. Today,
 Spark has several developers and has accepted multiple major patches
from
 outside of U.C. Berkeley. While this process has remained mostly
informal
 (we do not have an official committer list), an implicit organization
 exists in which individuals who contribute major components act as
 maintainers for those modules. If accepted, the Spark project would
 include several of these participants

Re: [PROPOSAL] Apache Spark for the Incubator

2013-06-03 Thread Mattmann, Chris A (398J)
Thanks Suresh, after conferring with the incoming Spark community
members, I will add you as a mentor on the wiki.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Suresh Marru sma...@apache.org
Reply-To: general@incubator.apache.org general@incubator.apache.org
Date: Saturday, June 1, 2013 9:12 PM
To: general@incubator.apache.org general@incubator.apache.org
Subject: Re: [PROPOSAL] Apache Spark for the Incubator

On May 31, 2013, at 2:03 PM, Mattmann, Chris A (398J)
chris.a.mattm...@jpl.nasa.gov wrote:

 Hi Folks,
 
 I'm pleased to bring you a proposal to the Apache Incubator for the
Apache
 Spark project: https://wiki.apache.org/incubator/SparkProposal
 
 The work originates from the Berkeley AMPLab and through a number of
 industry
 participants, and other institutions. Spark is a framework for
large-scale
 data 
 analysis on clusters, with a particular focus on low latency operations.
 The
 source code is written in Scala, and provides a number of APIs and
bindings
 in various programming languages.
 
 The proposal text is copied to the bottom of this email. I'm going to
leave
 this thread open for the next week for discussion. Once it's died down,
 I'll
 call an official VOTE.
 
 Suresh, Ross G. -- heads up -- this project may be of interest to you
both
 and would welcome you guys as additional mentors. We currently have 3
 mentors
 committed to the project, but would love to have more.

Thanks Chris for the alert. Great proposal indeed, if the podling needs
help I am in.

Suresh


 People interested in
 contributing should declare their interest here on the general@incubator
 thread
 and those potential contributors will be discussed by the incoming Spark
 community.
 
 Questions -- let's hear em'! :)
 
 Cheers,
 Chris
 (Champion, incoming Apache Spark)
 
 === Abstract ===
 Spark is an open source system for large-scale data analysis on
clusters.
 
 === Proposal ===
 Spark is an open source system for fast and flexible large-scale data
 analysis. Spark provides a general purpose runtime that supports
 low-latency execution in several forms. These include interactive
 exploration of very large datasets, near real-time stream processing,
and
 ad-hoc SQL analytics (through higher layer extensions). Spark interfaces
 with HDFS, HBase, Cassandra and several other storage storage layers,
and
 exposes APIs in Scala, Java and Python.
 Background
 Spark started as U.C. Berkeley research project, designed to efficiently
 run machine learning algorithms on large datasets. Over time, it has
 evolved into a general computing engine as outlined above. Spark¹s
 developer community has also grown to include additional institutions,
 such as universities, research labs, and corporations. Funding has been
 provided by various institutions including the U.S. National Science
 Foundation, DARPA, and a number of industry sponsors. See:
 https://amplab.cs.berkeley.edu/sponsors/ for full details.
 
 === Rationale ===
 As the number of contributors to Spark has grown, we have sought for a
 long-term home for the project, and we believe the Apache foundation
would
 be a great fit. Spark is a natural fit for the Apache foundation: Spark
 already interoperates with several existing Apache projects (HDFS,
HBase,
 Hive, Cassandra, Avro and Flume to name a few). The Spark team is
familiar
 with the Apache process and and subscribes to the Apache mission - the
 team includes multiple Apache committers already. Finally, joining
Apache
 will help coordinate the development effort of the growing number of
 organizations which contribute to Spark.
 
 == Initial Goals ==
 The initial goals will most likely be to move the existing codebase to
 Apache and integrate with the Apache development process. Furthermore,
we
 plan for incremental development, and releases along with the Apache
 guidelines.
 
 === Current Status ===
 == Meritocracy ==
 The Spark project already operates on meritocratic principles. Today,
 Spark has several developers and has accepted multiple major patches
from
 outside of U.C. Berkeley. While this process has remained mostly
informal
 (we do not have an official committer list), an implicit organization
 exists in which individuals who contribute major components act as
 maintainers for those modules. If accepted, the Spark project would
 include several of these participants as committers from the onset. We
 will work to identify all committers and PPMC members for the project

Re: [PROPOSAL] Apache Spark for the Incubator

2013-06-03 Thread Mattmann, Chris A (398J)
Hi Konstantin,

Thanks for your kind words and expressed interest. I will leave it
to Matei and the incoming Spark community members to comment on adding
you (or anyone else) as a contributor to the wiki. If they are OK with
it, then I am very much too.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Konstantin Boudnik c...@apache.org
Reply-To: general@incubator.apache.org general@incubator.apache.org
Date: Friday, May 31, 2013 12:29 PM
To: general@incubator.apache.org general@incubator.apache.org
Subject: Re: [PROPOSAL] Apache Spark for the Incubator

Great news!

Definitely +1 (non-binding, I guess) on adding Spark to the family
of ASF project!

I also express the interest to contribute to the project and move it
forward
to the graduation! Bigtop has been packaging and providing Spark as a
part of
Hadoop 1.x software stacks for some time; and hopefully would be able to
offer
it as a part of Hadoop 2.x line in the coming days.

Dr. Konstantin Boudnik
  Hadoop committer
  BigTop PMC

On Fri, May 31, 2013 at 06:03PM, Mattmann, Chris A (398J) wrote:
 Hi Folks,
 
 I'm pleased to bring you a proposal to the Apache Incubator for the
Apache
 Spark project: https://wiki.apache.org/incubator/SparkProposal
 
 The work originates from the Berkeley AMPLab and through a number of
 industry
 participants, and other institutions. Spark is a framework for
large-scale
 data 
 analysis on clusters, with a particular focus on low latency operations.
 The
 source code is written in Scala, and provides a number of APIs and
bindings
 in various programming languages.
 
 The proposal text is copied to the bottom of this email. I'm going to
leave
 this thread open for the next week for discussion. Once it's died down,
 I'll
 call an official VOTE.
 
 Suresh, Ross G. -- heads up -- this project may be of interest to you
both
 and would welcome you guys as additional mentors. We currently have 3
 mentors
 committed to the project, but would love to have more. People
interested in
 contributing should declare their interest here on the general@incubator
 thread
 and those potential contributors will be discussed by the incoming Spark
 community.
 
 Questions -- let's hear em'! :)
 
 Cheers,
 Chris
 (Champion, incoming Apache Spark)
 
 === Abstract ===
 Spark is an open source system for large-scale data analysis on
clusters.
 
 === Proposal ===
 Spark is an open source system for fast and flexible large-scale data
 analysis. Spark provides a general purpose runtime that supports
 low-latency execution in several forms. These include interactive
 exploration of very large datasets, near real-time stream processing,
and
 ad-hoc SQL analytics (through higher layer extensions). Spark interfaces
 with HDFS, HBase, Cassandra and several other storage storage layers,
and
 exposes APIs in Scala, Java and Python.
 Background
 Spark started as U.C. Berkeley research project, designed to efficiently
 run machine learning algorithms on large datasets. Over time, it has
 evolved into a general computing engine as outlined above. Spark╧s
 developer community has also grown to include additional institutions,
 such as universities, research labs, and corporations. Funding has been
 provided by various institutions including the U.S. National Science
 Foundation, DARPA, and a number of industry sponsors. See:
 https://amplab.cs.berkeley.edu/sponsors/ for full details.
 
 === Rationale ===
 As the number of contributors to Spark has grown, we have sought for a
 long-term home for the project, and we believe the Apache foundation
would
 be a great fit. Spark is a natural fit for the Apache foundation: Spark
 already interoperates with several existing Apache projects (HDFS,
HBase,
 Hive, Cassandra, Avro and Flume to name a few). The Spark team is
familiar
 with the Apache process and and subscribes to the Apache mission - the
 team includes multiple Apache committers already. Finally, joining
Apache
 will help coordinate the development effort of the growing number of
 organizations which contribute to Spark.
 
 == Initial Goals ==
 The initial goals will most likely be to move the existing codebase to
 Apache and integrate with the Apache development process. Furthermore,
we
 plan for incremental development, and releases along with the Apache
 guidelines.
 
 === Current Status ===
 == Meritocracy ==
 The Spark project already operates on meritocratic principles. Today,
 Spark has several developers and has accepted

Re: [PROPOSAL] Apache Spark for the Incubator

2013-06-03 Thread Mattmann, Chris A (398J)
Hi Henry,

I've conferred with the incoming Spark community and we are
very happy to have you as a mentor on the project.
Please feel free to add yourself to the wiki.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Henry Saputra henry.sapu...@gmail.com
Reply-To: general@incubator.apache.org general@incubator.apache.org
Date: Friday, May 31, 2013 12:38 PM
To: general@incubator.apache.org general@incubator.apache.org
Subject: Re: [PROPOSAL] Apache Spark for the Incubator

Wow! I have been using Shark, which runs on top of Shark, with Mesos in
our
prototype for API analytics for a while and would LOVE to help as mentor
and initial contributors.


- Henry



On Fri, May 31, 2013 at 11:03 AM, Mattmann, Chris A (398J) 
chris.a.mattm...@jpl.nasa.gov wrote:

 Hi Folks,

 I'm pleased to bring you a proposal to the Apache Incubator for the
Apache
 Spark project: https://wiki.apache.org/incubator/SparkProposal

 The work originates from the Berkeley AMPLab and through a number of
 industry
 participants, and other institutions. Spark is a framework for
large-scale
 data
 analysis on clusters, with a particular focus on low latency operations.
 The
 source code is written in Scala, and provides a number of APIs and
bindings
 in various programming languages.

 The proposal text is copied to the bottom of this email. I'm going to
leave
 this thread open for the next week for discussion. Once it's died down,
 I'll
 call an official VOTE.

 Suresh, Ross G. -- heads up -- this project may be of interest to you
both
 and would welcome you guys as additional mentors. We currently have 3
 mentors
 committed to the project, but would love to have more. People
interested in
 contributing should declare their interest here on the general@incubator
 thread
 and those potential contributors will be discussed by the incoming Spark
 community.

 Questions -- let's hear em'! :)

 Cheers,
 Chris
 (Champion, incoming Apache Spark)

 === Abstract ===
 Spark is an open source system for large-scale data analysis on
clusters.

 === Proposal ===
 Spark is an open source system for fast and flexible large-scale data
 analysis. Spark provides a general purpose runtime that supports
 low-latency execution in several forms. These include interactive
 exploration of very large datasets, near real-time stream processing,
and
 ad-hoc SQL analytics (through higher layer extensions). Spark interfaces
 with HDFS, HBase, Cassandra and several other storage storage layers,
and
 exposes APIs in Scala, Java and Python.
 Background
 Spark started as U.C. Berkeley research project, designed to efficiently
 run machine learning algorithms on large datasets. Over time, it has
 evolved into a general computing engine as outlined above. Spark¹s
 developer community has also grown to include additional institutions,
 such as universities, research labs, and corporations. Funding has been
 provided by various institutions including the U.S. National Science
 Foundation, DARPA, and a number of industry sponsors. See:
 https://amplab.cs.berkeley.edu/sponsors/ for full details.

 === Rationale ===
 As the number of contributors to Spark has grown, we have sought for a
 long-term home for the project, and we believe the Apache foundation
would
 be a great fit. Spark is a natural fit for the Apache foundation: Spark
 already interoperates with several existing Apache projects (HDFS,
HBase,
 Hive, Cassandra, Avro and Flume to name a few). The Spark team is
familiar
 with the Apache process and and subscribes to the Apache mission - the
 team includes multiple Apache committers already. Finally, joining
Apache
 will help coordinate the development effort of the growing number of
 organizations which contribute to Spark.

 == Initial Goals ==
 The initial goals will most likely be to move the existing codebase to
 Apache and integrate with the Apache development process. Furthermore,
we
 plan for incremental development, and releases along with the Apache
 guidelines.

 === Current Status ===
 == Meritocracy ==
 The Spark project already operates on meritocratic principles. Today,
 Spark has several developers and has accepted multiple major patches
from
 outside of U.C. Berkeley. While this process has remained mostly
informal
 (we do not have an official committer list), an implicit organization
 exists in which individuals who contribute major components act as
 maintainers for those modules. If accepted, the Spark project would
 include

Re: [PROPOSAL] Apache Spark for the Incubator

2013-06-03 Thread Mattmann, Chris A (398J)
Hi Roman, I've conferred with the incoming Spark community and we are
happy to have you
as a mentor for the project.

Feel free to add yourself to the wiki proposal.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Roman Shaposhnik r...@apache.org
Reply-To: general@incubator.apache.org general@incubator.apache.org
Date: Friday, May 31, 2013 3:25 PM
To: general@incubator.apache.org general@incubator.apache.org
Subject: Re: [PROPOSAL] Apache Spark for the Incubator

Extremely enthusiastic +1!!!

If you ever need help with mentorship -- please let me know.

Also, looking forward to seeing this in Bigtop!

Thanks,
Roman.

On Fri, May 31, 2013 at 11:03 AM, Mattmann, Chris A (398J)
chris.a.mattm...@jpl.nasa.gov wrote:
 Hi Folks,

 I'm pleased to bring you a proposal to the Apache Incubator for the
Apache
 Spark project: https://wiki.apache.org/incubator/SparkProposal

 The work originates from the Berkeley AMPLab and through a number of
 industry
 participants, and other institutions. Spark is a framework for
large-scale
 data
 analysis on clusters, with a particular focus on low latency operations.
 The
 source code is written in Scala, and provides a number of APIs and
bindings
 in various programming languages.

 The proposal text is copied to the bottom of this email. I'm going to
leave
 this thread open for the next week for discussion. Once it's died down,
 I'll
 call an official VOTE.

 Suresh, Ross G. -- heads up -- this project may be of interest to you
both
 and would welcome you guys as additional mentors. We currently have 3
 mentors
 committed to the project, but would love to have more. People
interested in
 contributing should declare their interest here on the general@incubator
 thread
 and those potential contributors will be discussed by the incoming Spark
 community.

 Questions -- let's hear em'! :)

 Cheers,
 Chris
 (Champion, incoming Apache Spark)

 === Abstract ===
 Spark is an open source system for large-scale data analysis on
clusters.

 === Proposal ===
 Spark is an open source system for fast and flexible large-scale data
 analysis. Spark provides a general purpose runtime that supports
 low-latency execution in several forms. These include interactive
 exploration of very large datasets, near real-time stream processing,
and
 ad-hoc SQL analytics (through higher layer extensions). Spark interfaces
 with HDFS, HBase, Cassandra and several other storage storage layers,
and
 exposes APIs in Scala, Java and Python.
 Background
 Spark started as U.C. Berkeley research project, designed to efficiently
 run machine learning algorithms on large datasets. Over time, it has
 evolved into a general computing engine as outlined above. Spark¹s
 developer community has also grown to include additional institutions,
 such as universities, research labs, and corporations. Funding has been
 provided by various institutions including the U.S. National Science
 Foundation, DARPA, and a number of industry sponsors. See:
 https://amplab.cs.berkeley.edu/sponsors/ for full details.

 === Rationale ===
 As the number of contributors to Spark has grown, we have sought for a
 long-term home for the project, and we believe the Apache foundation
would
 be a great fit. Spark is a natural fit for the Apache foundation: Spark
 already interoperates with several existing Apache projects (HDFS,
HBase,
 Hive, Cassandra, Avro and Flume to name a few). The Spark team is
familiar
 with the Apache process and and subscribes to the Apache mission - the
 team includes multiple Apache committers already. Finally, joining
Apache
 will help coordinate the development effort of the growing number of
 organizations which contribute to Spark.

 == Initial Goals ==
 The initial goals will most likely be to move the existing codebase to
 Apache and integrate with the Apache development process. Furthermore,
we
 plan for incremental development, and releases along with the Apache
 guidelines.

 === Current Status ===
 == Meritocracy ==
 The Spark project already operates on meritocratic principles. Today,
 Spark has several developers and has accepted multiple major patches
from
 outside of U.C. Berkeley. While this process has remained mostly
informal
 (we do not have an official committer list), an implicit organization
 exists in which individuals who contribute major components act as
 maintainers for those modules. If accepted, the Spark project would
 include several of these participants

Re: [PROPOSAL] Apache Spark for the Incubator

2013-06-03 Thread Mattmann, Chris A (398J)
Dear Konstantin,

Thanks! The incoming Spark project is excited about the relationship
with Bigtop that could happen here.

As for new committers, after conferring with the Spark project
members, we would like to adopt a simple policy of having all new
committers not add themselves to the wiki as of yet, but simply
join the project mailing lists when they are created, and then from
there, contribute. I and other mentors, and the Spark community are
committed to being inclusive, so hopefully won't take too long for
anybody to become a PPMC member/committer on the project after some
demonstrated contributions.

Thanks for your interest and again for your kind words.

Cheers!

Chris


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Konstantin Boudnik c...@apache.org
Reply-To: general@incubator.apache.org general@incubator.apache.org
Date: Friday, May 31, 2013 12:29 PM
To: general@incubator.apache.org general@incubator.apache.org
Subject: Re: [PROPOSAL] Apache Spark for the Incubator

Great news!

Definitely +1 (non-binding, I guess) on adding Spark to the family
of ASF project!

I also express the interest to contribute to the project and move it
forward
to the graduation! Bigtop has been packaging and providing Spark as a
part of
Hadoop 1.x software stacks for some time; and hopefully would be able to
offer
it as a part of Hadoop 2.x line in the coming days.

Dr. Konstantin Boudnik
  Hadoop committer
  BigTop PMC

On Fri, May 31, 2013 at 06:03PM, Mattmann, Chris A (398J) wrote:
 Hi Folks,
 
 I'm pleased to bring you a proposal to the Apache Incubator for the
Apache
 Spark project: https://wiki.apache.org/incubator/SparkProposal
 
 The work originates from the Berkeley AMPLab and through a number of
 industry
 participants, and other institutions. Spark is a framework for
large-scale
 data 
 analysis on clusters, with a particular focus on low latency operations.
 The
 source code is written in Scala, and provides a number of APIs and
bindings
 in various programming languages.
 
 The proposal text is copied to the bottom of this email. I'm going to
leave
 this thread open for the next week for discussion. Once it's died down,
 I'll
 call an official VOTE.
 
 Suresh, Ross G. -- heads up -- this project may be of interest to you
both
 and would welcome you guys as additional mentors. We currently have 3
 mentors
 committed to the project, but would love to have more. People
interested in
 contributing should declare their interest here on the general@incubator
 thread
 and those potential contributors will be discussed by the incoming Spark
 community.
 
 Questions -- let's hear em'! :)
 
 Cheers,
 Chris
 (Champion, incoming Apache Spark)
 
 === Abstract ===
 Spark is an open source system for large-scale data analysis on
clusters.
 
 === Proposal ===
 Spark is an open source system for fast and flexible large-scale data
 analysis. Spark provides a general purpose runtime that supports
 low-latency execution in several forms. These include interactive
 exploration of very large datasets, near real-time stream processing,
and
 ad-hoc SQL analytics (through higher layer extensions). Spark interfaces
 with HDFS, HBase, Cassandra and several other storage storage layers,
and
 exposes APIs in Scala, Java and Python.
 Background
 Spark started as U.C. Berkeley research project, designed to efficiently
 run machine learning algorithms on large datasets. Over time, it has
 evolved into a general computing engine as outlined above. Spark╧s
 developer community has also grown to include additional institutions,
 such as universities, research labs, and corporations. Funding has been
 provided by various institutions including the U.S. National Science
 Foundation, DARPA, and a number of industry sponsors. See:
 https://amplab.cs.berkeley.edu/sponsors/ for full details.
 
 === Rationale ===
 As the number of contributors to Spark has grown, we have sought for a
 long-term home for the project, and we believe the Apache foundation
would
 be a great fit. Spark is a natural fit for the Apache foundation: Spark
 already interoperates with several existing Apache projects (HDFS,
HBase,
 Hive, Cassandra, Avro and Flume to name a few). The Spark team is
familiar
 with the Apache process and and subscribes to the Apache mission - the
 team includes multiple Apache committers already. Finally, joining
Apache
 will help coordinate the development effort of the growing number of
 organizations which contribute to Spark

Re: [PROPOSAL] Apache Spark for the Incubator

2013-06-03 Thread Henry Saputra
Thanks Chris, looking forward for this project to be part of ASF family.

I have added my name as mentor in the proposal.

- Henry


On Mon, Jun 3, 2013 at 6:41 PM, Mattmann, Chris A (398J) 
chris.a.mattm...@jpl.nasa.gov wrote:

 Hi Henry,

 I've conferred with the incoming Spark community and we are
 very happy to have you as a mentor on the project.
 Please feel free to add yourself to the wiki.

 Cheers,
 Chris

 ++
 Chris Mattmann, Ph.D.
 Senior Computer Scientist
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 171-266B, Mailstop: 171-246
 Email: chris.a.mattm...@nasa.gov
 WWW:  http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Assistant Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++






 -Original Message-
 From: Henry Saputra henry.sapu...@gmail.com
 Reply-To: general@incubator.apache.org general@incubator.apache.org
 Date: Friday, May 31, 2013 12:38 PM
 To: general@incubator.apache.org general@incubator.apache.org
 Subject: Re: [PROPOSAL] Apache Spark for the Incubator

 Wow! I have been using Shark, which runs on top of Shark, with Mesos in
 our
 prototype for API analytics for a while and would LOVE to help as mentor
 and initial contributors.
 
 
 - Henry
 
 
 
 On Fri, May 31, 2013 at 11:03 AM, Mattmann, Chris A (398J) 
 chris.a.mattm...@jpl.nasa.gov wrote:
 
  Hi Folks,
 
  I'm pleased to bring you a proposal to the Apache Incubator for the
 Apache
  Spark project: https://wiki.apache.org/incubator/SparkProposal
 
  The work originates from the Berkeley AMPLab and through a number of
  industry
  participants, and other institutions. Spark is a framework for
 large-scale
  data
  analysis on clusters, with a particular focus on low latency operations.
  The
  source code is written in Scala, and provides a number of APIs and
 bindings
  in various programming languages.
 
  The proposal text is copied to the bottom of this email. I'm going to
 leave
  this thread open for the next week for discussion. Once it's died down,
  I'll
  call an official VOTE.
 
  Suresh, Ross G. -- heads up -- this project may be of interest to you
 both
  and would welcome you guys as additional mentors. We currently have 3
  mentors
  committed to the project, but would love to have more. People
 interested in
  contributing should declare their interest here on the general@incubator
  thread
  and those potential contributors will be discussed by the incoming Spark
  community.
 
  Questions -- let's hear em'! :)
 
  Cheers,
  Chris
  (Champion, incoming Apache Spark)
 
  === Abstract ===
  Spark is an open source system for large-scale data analysis on
 clusters.
 
  === Proposal ===
  Spark is an open source system for fast and flexible large-scale data
  analysis. Spark provides a general purpose runtime that supports
  low-latency execution in several forms. These include interactive
  exploration of very large datasets, near real-time stream processing,
 and
  ad-hoc SQL analytics (through higher layer extensions). Spark interfaces
  with HDFS, HBase, Cassandra and several other storage storage layers,
 and
  exposes APIs in Scala, Java and Python.
  Background
  Spark started as U.C. Berkeley research project, designed to efficiently
  run machine learning algorithms on large datasets. Over time, it has
  evolved into a general computing engine as outlined above. Spark¹s
  developer community has also grown to include additional institutions,
  such as universities, research labs, and corporations. Funding has been
  provided by various institutions including the U.S. National Science
  Foundation, DARPA, and a number of industry sponsors. See:
  https://amplab.cs.berkeley.edu/sponsors/ for full details.
 
  === Rationale ===
  As the number of contributors to Spark has grown, we have sought for a
  long-term home for the project, and we believe the Apache foundation
 would
  be a great fit. Spark is a natural fit for the Apache foundation: Spark
  already interoperates with several existing Apache projects (HDFS,
 HBase,
  Hive, Cassandra, Avro and Flume to name a few). The Spark team is
 familiar
  with the Apache process and and subscribes to the Apache mission - the
  team includes multiple Apache committers already. Finally, joining
 Apache
  will help coordinate the development effort of the growing number of
  organizations which contribute to Spark.
 
  == Initial Goals ==
  The initial goals will most likely be to move the existing codebase to
  Apache and integrate with the Apache development process. Furthermore,
 we
  plan for incremental development, and releases along with the Apache
  guidelines.
 
  === Current Status ===
  == Meritocracy ==
  The Spark project already operates on meritocratic principles. Today,
  Spark has

Re: [PROPOSAL] Apache Spark for the Incubator

2013-06-01 Thread Suresh Marru
On May 31, 2013, at 2:03 PM, Mattmann, Chris A (398J) 
chris.a.mattm...@jpl.nasa.gov wrote:

 Hi Folks,
 
 I'm pleased to bring you a proposal to the Apache Incubator for the Apache
 Spark project: https://wiki.apache.org/incubator/SparkProposal
 
 The work originates from the Berkeley AMPLab and through a number of
 industry
 participants, and other institutions. Spark is a framework for large-scale
 data 
 analysis on clusters, with a particular focus on low latency operations.
 The
 source code is written in Scala, and provides a number of APIs and bindings
 in various programming languages.
 
 The proposal text is copied to the bottom of this email. I'm going to leave
 this thread open for the next week for discussion. Once it's died down,
 I'll
 call an official VOTE.
 
 Suresh, Ross G. -- heads up -- this project may be of interest to you both
 and would welcome you guys as additional mentors. We currently have 3
 mentors
 committed to the project, but would love to have more.

Thanks Chris for the alert. Great proposal indeed, if the podling needs help I 
am in.

Suresh


 People interested in
 contributing should declare their interest here on the general@incubator
 thread
 and those potential contributors will be discussed by the incoming Spark
 community.
 
 Questions -- let's hear em'! :)
 
 Cheers,
 Chris
 (Champion, incoming Apache Spark)
 
 === Abstract ===
 Spark is an open source system for large-scale data analysis on clusters.
 
 === Proposal ===
 Spark is an open source system for fast and flexible large-scale data
 analysis. Spark provides a general purpose runtime that supports
 low-latency execution in several forms. These include interactive
 exploration of very large datasets, near real-time stream processing, and
 ad-hoc SQL analytics (through higher layer extensions). Spark interfaces
 with HDFS, HBase, Cassandra and several other storage storage layers, and
 exposes APIs in Scala, Java and Python.
 Background
 Spark started as U.C. Berkeley research project, designed to efficiently
 run machine learning algorithms on large datasets. Over time, it has
 evolved into a general computing engine as outlined above. Spark¹s
 developer community has also grown to include additional institutions,
 such as universities, research labs, and corporations. Funding has been
 provided by various institutions including the U.S. National Science
 Foundation, DARPA, and a number of industry sponsors. See:
 https://amplab.cs.berkeley.edu/sponsors/ for full details.
 
 === Rationale ===
 As the number of contributors to Spark has grown, we have sought for a
 long-term home for the project, and we believe the Apache foundation would
 be a great fit. Spark is a natural fit for the Apache foundation: Spark
 already interoperates with several existing Apache projects (HDFS, HBase,
 Hive, Cassandra, Avro and Flume to name a few). The Spark team is familiar
 with the Apache process and and subscribes to the Apache mission - the
 team includes multiple Apache committers already. Finally, joining Apache
 will help coordinate the development effort of the growing number of
 organizations which contribute to Spark.
 
 == Initial Goals ==
 The initial goals will most likely be to move the existing codebase to
 Apache and integrate with the Apache development process. Furthermore, we
 plan for incremental development, and releases along with the Apache
 guidelines.
 
 === Current Status ===
 == Meritocracy ==
 The Spark project already operates on meritocratic principles. Today,
 Spark has several developers and has accepted multiple major patches from
 outside of U.C. Berkeley. While this process has remained mostly informal
 (we do not have an official committer list), an implicit organization
 exists in which individuals who contribute major components act as
 maintainers for those modules. If accepted, the Spark project would
 include several of these participants as committers from the onset. We
 will work to identify all committers and PPMC members for the project and
 to operate under the ASF meritocratic principles.
 
 === Community ===
 Acceptance into the Apache foundation would bolster the already strong
 user and developer community around Spark. That community includes dozens
 of contributors from several institutions, a meetup group with several
 hundred members, and an active mailing list composed of hundreds of users.
 Core Developers
 The core developers of our project are listed in our contributors and
 initial PPMC below. Though many exist at UC Berkeley, there is a
 representative cross sampling of other organizations including Quantifind,
 Microsoft, Yahoo!, ClearStory Data, Bizo, Intel, Tagged and Webtrends.
 
 
 === Alignment ===
 Our proposed effort aligns with several ongoing BIGDATA and U.S. National
 priority funding interests including the NSF and its Expeditions program,
 and the DARPA XDATA project. Our industry partners and collaborators are
 well aligned with our code base.
 
 

[PROPOSAL] Apache Spark for the Incubator

2013-05-31 Thread Mattmann, Chris A (398J)
Hi Folks,

I'm pleased to bring you a proposal to the Apache Incubator for the Apache
Spark project: https://wiki.apache.org/incubator/SparkProposal

The work originates from the Berkeley AMPLab and through a number of
industry
participants, and other institutions. Spark is a framework for large-scale
data 
analysis on clusters, with a particular focus on low latency operations.
The
source code is written in Scala, and provides a number of APIs and bindings
in various programming languages.

The proposal text is copied to the bottom of this email. I'm going to leave
this thread open for the next week for discussion. Once it's died down,
I'll
call an official VOTE.

Suresh, Ross G. -- heads up -- this project may be of interest to you both
and would welcome you guys as additional mentors. We currently have 3
mentors
committed to the project, but would love to have more. People interested in
contributing should declare their interest here on the general@incubator
thread
and those potential contributors will be discussed by the incoming Spark
community.

Questions -- let's hear em'! :)

Cheers,
Chris
(Champion, incoming Apache Spark)

=== Abstract ===
Spark is an open source system for large-scale data analysis on clusters.

=== Proposal ===
Spark is an open source system for fast and flexible large-scale data
analysis. Spark provides a general purpose runtime that supports
low-latency execution in several forms. These include interactive
exploration of very large datasets, near real-time stream processing, and
ad-hoc SQL analytics (through higher layer extensions). Spark interfaces
with HDFS, HBase, Cassandra and several other storage storage layers, and
exposes APIs in Scala, Java and Python.
Background
Spark started as U.C. Berkeley research project, designed to efficiently
run machine learning algorithms on large datasets. Over time, it has
evolved into a general computing engine as outlined above. Spark¹s
developer community has also grown to include additional institutions,
such as universities, research labs, and corporations. Funding has been
provided by various institutions including the U.S. National Science
Foundation, DARPA, and a number of industry sponsors. See:
https://amplab.cs.berkeley.edu/sponsors/ for full details.

=== Rationale ===
As the number of contributors to Spark has grown, we have sought for a
long-term home for the project, and we believe the Apache foundation would
be a great fit. Spark is a natural fit for the Apache foundation: Spark
already interoperates with several existing Apache projects (HDFS, HBase,
Hive, Cassandra, Avro and Flume to name a few). The Spark team is familiar
with the Apache process and and subscribes to the Apache mission - the
team includes multiple Apache committers already. Finally, joining Apache
will help coordinate the development effort of the growing number of
organizations which contribute to Spark.

== Initial Goals ==
The initial goals will most likely be to move the existing codebase to
Apache and integrate with the Apache development process. Furthermore, we
plan for incremental development, and releases along with the Apache
guidelines.

=== Current Status ===
== Meritocracy ==
The Spark project already operates on meritocratic principles. Today,
Spark has several developers and has accepted multiple major patches from
outside of U.C. Berkeley. While this process has remained mostly informal
(we do not have an official committer list), an implicit organization
exists in which individuals who contribute major components act as
maintainers for those modules. If accepted, the Spark project would
include several of these participants as committers from the onset. We
will work to identify all committers and PPMC members for the project and
to operate under the ASF meritocratic principles.

=== Community ===
Acceptance into the Apache foundation would bolster the already strong
user and developer community around Spark. That community includes dozens
of contributors from several institutions, a meetup group with several
hundred members, and an active mailing list composed of hundreds of users.
Core Developers
The core developers of our project are listed in our contributors and
initial PPMC below. Though many exist at UC Berkeley, there is a
representative cross sampling of other organizations including Quantifind,
Microsoft, Yahoo!, ClearStory Data, Bizo, Intel, Tagged and Webtrends.


=== Alignment ===
Our proposed effort aligns with several ongoing BIGDATA and U.S. National
priority funding interests including the NSF and its Expeditions program,
and the DARPA XDATA project. Our industry partners and collaborators are
well aligned with our code base.

There are also a number of related Apache projects and dependencies, that
will be mentioned in the Relationships with Other Apache products section.

== Known Risks ==

=== Orphaned Products ===
Given the current level of investment in Spark - the risk of the project
being abandoned is minimal. 

Re: [PROPOSAL] Apache Spark for the Incubator

2013-05-31 Thread Mattmann, Chris A (398J)
Guys, I've added: Thomas Dudziak as a mentor to the proposal
at his request. He is a member of the ASF and should be granted
IPMC access soon.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Mattmann, jpluser chris.a.mattm...@jpl.nasa.gov
Reply-To: general@incubator.apache.org general@incubator.apache.org
Date: Friday, May 31, 2013 11:03 AM
To: general@incubator.apache.org general@incubator.apache.org
Subject: [PROPOSAL] Apache Spark for the Incubator

Hi Folks,

I'm pleased to bring you a proposal to the Apache Incubator for the Apache
Spark project: https://wiki.apache.org/incubator/SparkProposal

The work originates from the Berkeley AMPLab and through a number of
industry
participants, and other institutions. Spark is a framework for large-scale
data 
analysis on clusters, with a particular focus on low latency operations.
The
source code is written in Scala, and provides a number of APIs and
bindings
in various programming languages.

The proposal text is copied to the bottom of this email. I'm going to
leave
this thread open for the next week for discussion. Once it's died down,
I'll
call an official VOTE.

Suresh, Ross G. -- heads up -- this project may be of interest to you both
and would welcome you guys as additional mentors. We currently have 3
mentors
committed to the project, but would love to have more. People interested
in
contributing should declare their interest here on the general@incubator
thread
and those potential contributors will be discussed by the incoming Spark
community.

Questions -- let's hear em'! :)

Cheers,
Chris
(Champion, incoming Apache Spark)

=== Abstract ===
Spark is an open source system for large-scale data analysis on clusters.

=== Proposal ===
Spark is an open source system for fast and flexible large-scale data
analysis. Spark provides a general purpose runtime that supports
low-latency execution in several forms. These include interactive
exploration of very large datasets, near real-time stream processing, and
ad-hoc SQL analytics (through higher layer extensions). Spark interfaces
with HDFS, HBase, Cassandra and several other storage storage layers, and
exposes APIs in Scala, Java and Python.
Background
Spark started as U.C. Berkeley research project, designed to efficiently
run machine learning algorithms on large datasets. Over time, it has
evolved into a general computing engine as outlined above. Spark¹s
developer community has also grown to include additional institutions,
such as universities, research labs, and corporations. Funding has been
provided by various institutions including the U.S. National Science
Foundation, DARPA, and a number of industry sponsors. See:
https://amplab.cs.berkeley.edu/sponsors/ for full details.

=== Rationale ===
As the number of contributors to Spark has grown, we have sought for a
long-term home for the project, and we believe the Apache foundation would
be a great fit. Spark is a natural fit for the Apache foundation: Spark
already interoperates with several existing Apache projects (HDFS, HBase,
Hive, Cassandra, Avro and Flume to name a few). The Spark team is familiar
with the Apache process and and subscribes to the Apache mission - the
team includes multiple Apache committers already. Finally, joining Apache
will help coordinate the development effort of the growing number of
organizations which contribute to Spark.

== Initial Goals ==
The initial goals will most likely be to move the existing codebase to
Apache and integrate with the Apache development process. Furthermore, we
plan for incremental development, and releases along with the Apache
guidelines.

=== Current Status ===
== Meritocracy ==
The Spark project already operates on meritocratic principles. Today,
Spark has several developers and has accepted multiple major patches from
outside of U.C. Berkeley. While this process has remained mostly informal
(we do not have an official committer list), an implicit organization
exists in which individuals who contribute major components act as
maintainers for those modules. If accepted, the Spark project would
include several of these participants as committers from the onset. We
will work to identify all committers and PPMC members for the project and
to operate under the ASF meritocratic principles.

=== Community ===
Acceptance into the Apache foundation would bolster the already strong
user and developer community around Spark. That community includes dozens
of contributors from

RE: [PROPOSAL] Apache Spark for the Incubator

2013-05-31 Thread Chen, Pei
+1 (non-binding)
This seems like a really interesting project.  
Q- Is Spark just a framework/API or does it also have some tools implemented 
for data analytics?
--Pei

 -Original Message-
 From: Mattmann, Chris A (398J) [mailto:chris.a.mattm...@jpl.nasa.gov]
 Sent: Friday, May 31, 2013 2:04 PM
 To: general@incubator.apache.org
 Subject: [PROPOSAL] Apache Spark for the Incubator
 
 Hi Folks,
 
 I'm pleased to bring you a proposal to the Apache Incubator for the Apache
 Spark project: https://wiki.apache.org/incubator/SparkProposal
 
 The work originates from the Berkeley AMPLab and through a number of
 industry participants, and other institutions. Spark is a framework for large-
 scale data analysis on clusters, with a particular focus on low latency
 operations.
 The
 source code is written in Scala, and provides a number of APIs and bindings in
 various programming languages.
 
 The proposal text is copied to the bottom of this email. I'm going to leave 
 this
 thread open for the next week for discussion. Once it's died down, I'll call 
 an
 official VOTE.
 
 Suresh, Ross G. -- heads up -- this project may be of interest to you both and
 would welcome you guys as additional mentors. We currently have 3
 mentors committed to the project, but would love to have more. People
 interested in contributing should declare their interest here on the
 general@incubator thread and those potential contributors will be discussed
 by the incoming Spark community.
 
 Questions -- let's hear em'! :)
 
 Cheers,
 Chris
 (Champion, incoming Apache Spark)
 
 === Abstract ===
 Spark is an open source system for large-scale data analysis on clusters.
 
 === Proposal ===
 Spark is an open source system for fast and flexible large-scale data 
 analysis.
 Spark provides a general purpose runtime that supports low-latency
 execution in several forms. These include interactive exploration of very
 large datasets, near real-time stream processing, and ad-hoc SQL analytics
 (through higher layer extensions). Spark interfaces with HDFS, HBase,
 Cassandra and several other storage storage layers, and exposes APIs in
 Scala, Java and Python.
 Background
 Spark started as U.C. Berkeley research project, designed to efficiently run
 machine learning algorithms on large datasets. Over time, it has evolved into
 a general computing engine as outlined above. Spark¹s developer community
 has also grown to include additional institutions, such as universities,
 research labs, and corporations. Funding has been provided by various
 institutions including the U.S. National Science Foundation, DARPA, and a
 number of industry sponsors. See:
 https://amplab.cs.berkeley.edu/sponsors/ for full details.
 
 === Rationale ===
 As the number of contributors to Spark has grown, we have sought for a
 long-term home for the project, and we believe the Apache foundation
 would be a great fit. Spark is a natural fit for the Apache foundation: Spark
 already interoperates with several existing Apache projects (HDFS, HBase,
 Hive, Cassandra, Avro and Flume to name a few). The Spark team is familiar
 with the Apache process and and subscribes to the Apache mission - the
 team includes multiple Apache committers already. Finally, joining Apache
 will help coordinate the development effort of the growing number of
 organizations which contribute to Spark.
 
 == Initial Goals ==
 The initial goals will most likely be to move the existing codebase to Apache
 and integrate with the Apache development process. Furthermore, we plan
 for incremental development, and releases along with the Apache
 guidelines.
 
 === Current Status ===
 == Meritocracy ==
 The Spark project already operates on meritocratic principles. Today, Spark
 has several developers and has accepted multiple major patches from
 outside of U.C. Berkeley. While this process has remained mostly informal
 (we do not have an official committer list), an implicit organization exists 
 in
 which individuals who contribute major components act as maintainers for
 those modules. If accepted, the Spark project would include several of these
 participants as committers from the onset. We will work to identify all
 committers and PPMC members for the project and to operate under the
 ASF meritocratic principles.
 
 === Community ===
 Acceptance into the Apache foundation would bolster the already strong
 user and developer community around Spark. That community includes
 dozens of contributors from several institutions, a meetup group with
 several hundred members, and an active mailing list composed of hundreds
 of users.
 Core Developers
 The core developers of our project are listed in our contributors and initial
 PPMC below. Though many exist at UC Berkeley, there is a representative
 cross sampling of other organizations including Quantifind, Microsoft, Yahoo!,
 ClearStory Data, Bizo, Intel, Tagged and Webtrends.
 
 
 === Alignment ===
 Our proposed effort aligns with several ongoing BIGDATA

Re: [PROPOSAL] Apache Spark for the Incubator

2013-05-31 Thread Konstantin Boudnik
Great news!

Definitely +1 (non-binding, I guess) on adding Spark to the family
of ASF project!

I also express the interest to contribute to the project and move it forward
to the graduation! Bigtop has been packaging and providing Spark as a part of
Hadoop 1.x software stacks for some time; and hopefully would be able to offer
it as a part of Hadoop 2.x line in the coming days.

Dr. Konstantin Boudnik
  Hadoop committer
  BigTop PMC

On Fri, May 31, 2013 at 06:03PM, Mattmann, Chris A (398J) wrote:
 Hi Folks,
 
 I'm pleased to bring you a proposal to the Apache Incubator for the Apache
 Spark project: https://wiki.apache.org/incubator/SparkProposal
 
 The work originates from the Berkeley AMPLab and through a number of
 industry
 participants, and other institutions. Spark is a framework for large-scale
 data 
 analysis on clusters, with a particular focus on low latency operations.
 The
 source code is written in Scala, and provides a number of APIs and bindings
 in various programming languages.
 
 The proposal text is copied to the bottom of this email. I'm going to leave
 this thread open for the next week for discussion. Once it's died down,
 I'll
 call an official VOTE.
 
 Suresh, Ross G. -- heads up -- this project may be of interest to you both
 and would welcome you guys as additional mentors. We currently have 3
 mentors
 committed to the project, but would love to have more. People interested in
 contributing should declare their interest here on the general@incubator
 thread
 and those potential contributors will be discussed by the incoming Spark
 community.
 
 Questions -- let's hear em'! :)
 
 Cheers,
 Chris
 (Champion, incoming Apache Spark)
 
 === Abstract ===
 Spark is an open source system for large-scale data analysis on clusters.
 
 === Proposal ===
 Spark is an open source system for fast and flexible large-scale data
 analysis. Spark provides a general purpose runtime that supports
 low-latency execution in several forms. These include interactive
 exploration of very large datasets, near real-time stream processing, and
 ad-hoc SQL analytics (through higher layer extensions). Spark interfaces
 with HDFS, HBase, Cassandra and several other storage storage layers, and
 exposes APIs in Scala, Java and Python.
 Background
 Spark started as U.C. Berkeley research project, designed to efficiently
 run machine learning algorithms on large datasets. Over time, it has
 evolved into a general computing engine as outlined above. Spark╧s
 developer community has also grown to include additional institutions,
 such as universities, research labs, and corporations. Funding has been
 provided by various institutions including the U.S. National Science
 Foundation, DARPA, and a number of industry sponsors. See:
 https://amplab.cs.berkeley.edu/sponsors/ for full details.
 
 === Rationale ===
 As the number of contributors to Spark has grown, we have sought for a
 long-term home for the project, and we believe the Apache foundation would
 be a great fit. Spark is a natural fit for the Apache foundation: Spark
 already interoperates with several existing Apache projects (HDFS, HBase,
 Hive, Cassandra, Avro and Flume to name a few). The Spark team is familiar
 with the Apache process and and subscribes to the Apache mission - the
 team includes multiple Apache committers already. Finally, joining Apache
 will help coordinate the development effort of the growing number of
 organizations which contribute to Spark.
 
 == Initial Goals ==
 The initial goals will most likely be to move the existing codebase to
 Apache and integrate with the Apache development process. Furthermore, we
 plan for incremental development, and releases along with the Apache
 guidelines.
 
 === Current Status ===
 == Meritocracy ==
 The Spark project already operates on meritocratic principles. Today,
 Spark has several developers and has accepted multiple major patches from
 outside of U.C. Berkeley. While this process has remained mostly informal
 (we do not have an official committer list), an implicit organization
 exists in which individuals who contribute major components act as
 maintainers for those modules. If accepted, the Spark project would
 include several of these participants as committers from the onset. We
 will work to identify all committers and PPMC members for the project and
 to operate under the ASF meritocratic principles.
 
 === Community ===
 Acceptance into the Apache foundation would bolster the already strong
 user and developer community around Spark. That community includes dozens
 of contributors from several institutions, a meetup group with several
 hundred members, and an active mailing list composed of hundreds of users.
 Core Developers
 The core developers of our project are listed in our contributors and
 initial PPMC below. Though many exist at UC Berkeley, there is a
 representative cross sampling of other organizations including Quantifind,
 Microsoft, Yahoo!, ClearStory Data, Bizo, 

Re: [PROPOSAL] Apache Spark for the Incubator

2013-05-31 Thread Henry Saputra
Wow! I have been using Shark, which runs on top of Shark, with Mesos in our
prototype for API analytics for a while and would LOVE to help as mentor
and initial contributors.


- Henry



On Fri, May 31, 2013 at 11:03 AM, Mattmann, Chris A (398J) 
chris.a.mattm...@jpl.nasa.gov wrote:

 Hi Folks,

 I'm pleased to bring you a proposal to the Apache Incubator for the Apache
 Spark project: https://wiki.apache.org/incubator/SparkProposal

 The work originates from the Berkeley AMPLab and through a number of
 industry
 participants, and other institutions. Spark is a framework for large-scale
 data
 analysis on clusters, with a particular focus on low latency operations.
 The
 source code is written in Scala, and provides a number of APIs and bindings
 in various programming languages.

 The proposal text is copied to the bottom of this email. I'm going to leave
 this thread open for the next week for discussion. Once it's died down,
 I'll
 call an official VOTE.

 Suresh, Ross G. -- heads up -- this project may be of interest to you both
 and would welcome you guys as additional mentors. We currently have 3
 mentors
 committed to the project, but would love to have more. People interested in
 contributing should declare their interest here on the general@incubator
 thread
 and those potential contributors will be discussed by the incoming Spark
 community.

 Questions -- let's hear em'! :)

 Cheers,
 Chris
 (Champion, incoming Apache Spark)

 === Abstract ===
 Spark is an open source system for large-scale data analysis on clusters.

 === Proposal ===
 Spark is an open source system for fast and flexible large-scale data
 analysis. Spark provides a general purpose runtime that supports
 low-latency execution in several forms. These include interactive
 exploration of very large datasets, near real-time stream processing, and
 ad-hoc SQL analytics (through higher layer extensions). Spark interfaces
 with HDFS, HBase, Cassandra and several other storage storage layers, and
 exposes APIs in Scala, Java and Python.
 Background
 Spark started as U.C. Berkeley research project, designed to efficiently
 run machine learning algorithms on large datasets. Over time, it has
 evolved into a general computing engine as outlined above. Spark¹s
 developer community has also grown to include additional institutions,
 such as universities, research labs, and corporations. Funding has been
 provided by various institutions including the U.S. National Science
 Foundation, DARPA, and a number of industry sponsors. See:
 https://amplab.cs.berkeley.edu/sponsors/ for full details.

 === Rationale ===
 As the number of contributors to Spark has grown, we have sought for a
 long-term home for the project, and we believe the Apache foundation would
 be a great fit. Spark is a natural fit for the Apache foundation: Spark
 already interoperates with several existing Apache projects (HDFS, HBase,
 Hive, Cassandra, Avro and Flume to name a few). The Spark team is familiar
 with the Apache process and and subscribes to the Apache mission - the
 team includes multiple Apache committers already. Finally, joining Apache
 will help coordinate the development effort of the growing number of
 organizations which contribute to Spark.

 == Initial Goals ==
 The initial goals will most likely be to move the existing codebase to
 Apache and integrate with the Apache development process. Furthermore, we
 plan for incremental development, and releases along with the Apache
 guidelines.

 === Current Status ===
 == Meritocracy ==
 The Spark project already operates on meritocratic principles. Today,
 Spark has several developers and has accepted multiple major patches from
 outside of U.C. Berkeley. While this process has remained mostly informal
 (we do not have an official committer list), an implicit organization
 exists in which individuals who contribute major components act as
 maintainers for those modules. If accepted, the Spark project would
 include several of these participants as committers from the onset. We
 will work to identify all committers and PPMC members for the project and
 to operate under the ASF meritocratic principles.

 === Community ===
 Acceptance into the Apache foundation would bolster the already strong
 user and developer community around Spark. That community includes dozens
 of contributors from several institutions, a meetup group with several
 hundred members, and an active mailing list composed of hundreds of users.
 Core Developers
 The core developers of our project are listed in our contributors and
 initial PPMC below. Though many exist at UC Berkeley, there is a
 representative cross sampling of other organizations including Quantifind,
 Microsoft, Yahoo!, ClearStory Data, Bizo, Intel, Tagged and Webtrends.


 === Alignment ===
 Our proposed effort aligns with several ongoing BIGDATA and U.S. National
 priority funding interests including the NSF and its Expeditions program,
 and the DARPA XDATA project. Our 

[PROPOSAL] Apache Spark for the Incubator

2013-05-31 Thread Henry Saputra
I believe it is more of a framework but you can take a look at Shark which
using Spark to do data warehousing that support hive query (
http://shark.cs.berkeley.edu)

- Henry

On Friday, May 31, 2013, Chen, Pei wrote:

 +1 (non-binding)
 This seems like a really interesting project.
 Q- Is Spark just a framework/API or does it also have some tools
 implemented for data analytics?
 --Pei

  -Original Message-
  From: Mattmann, Chris A (398J) [mailto:chris.a.mattm...@jpl.nasa.gov]
  Sent: Friday, May 31, 2013 2:04 PM
  To: general@incubator.apache.org
  Subject: [PROPOSAL] Apache Spark for the Incubator
 
  Hi Folks,
 
  I'm pleased to bring you a proposal to the Apache Incubator for the
 Apache
  Spark project: https://wiki.apache.org/incubator/SparkProposal
 
  The work originates from the Berkeley AMPLab and through a number of
  industry participants, and other institutions. Spark is a framework for
 large-
  scale data analysis on clusters, with a particular focus on low latency
  operations.
  The
  source code is written in Scala, and provides a number of APIs and
 bindings in
  various programming languages.
 
  The proposal text is copied to the bottom of this email. I'm going to
 leave this
  thread open for the next week for discussion. Once it's died down, I'll
 call an
  official VOTE.
 
  Suresh, Ross G. -- heads up -- this project may be of interest to you
 both and
  would welcome you guys as additional mentors. We currently have 3
  mentors committed to the project, but would love to have more. People
  interested in contributing should declare their interest here on the
  general@incubator thread and those potential contributors will be
 discussed
  by the incoming Spark community.
 
  Questions -- let's hear em'! :)
 
  Cheers,
  Chris
  (Champion, incoming Apache Spark)
 
  === Abstract ===
  Spark is an open source system for large-scale data analysis on clusters.
 
  === Proposal ===
  Spark is an open source system for fast and flexible large-scale data
 analysis.
  Spark provides a general purpose runtime that supports low-latency
  execution in several forms. These include interactive exploration of very
  large datasets, near real-time stream processing, and ad-hoc SQL
 analytics
  (through higher layer extensions). Spark interfaces with HDFS, HBase,
  Cassandra and several other storage storage layers, and exposes APIs in
  Scala, Java and Python.
  Background
  Spark started as U.C. Berkeley research project, designed to efficiently
 run
  machine learning algorithms on large datasets. Over time, it has evolved
 into
  a general computing engine as outlined above. Spark¹s developer community
  has also grown to include additional institutions, such as universities,
  research labs, and corporations. Funding has been provided by various
  institutions including the U.S. National Science Foundation, DARPA, and a
  number of industry sponsors. See:
  https://amplab.cs.berkeley.edu/sponsors/ for full details.
 
  === Rationale ===
  As the number of contributors to Spark has grown, we have sought for a
  long-term home for the project, and we believe the Apache foundation
  would be a great fit. Spark is a natural fit for the Apache foundation:
 Spark
  already interoperates with several existing Apache projects (HDFS, HBase,
  Hive, Cassandra, Avro and Flume to name a few). The Spark team is
 familiar
  with the Apache process and and subscribes to the Apache mission - the
  team includes multiple Apache committers already. Finally, joining Apache
  will help coordinate the development effort of the growing number of
  organizations which contribute to Spark.
 
  == Initial Goals ==
  The initial goals will most likely be to move the existing codebase to
 Apache
  and integrate with the Apache development process. Furthermore, we plan
  for incremental development, and releases along with the Apache
  guidelines.
 
  === Current Status ===
  == Meritocracy ==
  The Spark project already operates on meritocratic principles. Today,
 Spark
  has several developers and has accepted multiple major patches from
  outside of U.C. Berkeley. While this process has remained mostly informal
  (we do not have an official committer list), an implicit organization
 exists in
  which individuals who contribute major components act as maintainers for
  those modules. If accepted, the Spark project would include several of
 these
  participants as committers from the onset. We will work to identify all
  committers and PPMC members for the project and to operate under the
  ASF meritocratic principles.
 
  === Community ===
  Acceptance into the Apache foundation would bolster the already strong
  user and developer community around Spark. That community includes
  dozens of contributors from several institutions, a meetup group with
  several hundred members, and an active mailing list composed of hundreds
  of users.
  Core Developers
  The core developers of our project are listed in our

Re: [PROPOSAL] Apache Spark for the Incubator

2013-05-31 Thread Reynold Xin
Spark it is an execution framework, but it also provides some high level
APIs which makes it much easier to do data analytics.

For example, to do grep like queries:

val docs = sparkContext.textFile(hdfs://...)
docs.filter(doc = doc.contains(Berkeley)).count

Another example to do word count (using the Scala API):

val docs = sparkContext.textFile(hdfs://...)
val counts = docs.flatMap(line = line.split(\\s+)).map(word =
(word, 1)).reduceByKey(_
+ _)
counts.saveAsTextFile(hdfs://...)

The high level APIs are similar to a lot of the relational operators,
including aggregations, group bys, joins, etc.

Shark uses Spark as the execution engine but provides a Hive-compatible SQL
interface. This proposal is however only about moving Spark to ASF
incubator, and not Shark.

--
Reynold Xin, AMPLab, UC Berkeley
http://rxin.org


On Fri, May 31, 2013 at 1:03 PM, Henry Saputra henry.sapu...@gmail.comwrote:

 I believe it is more of a framework but you can take a look at Shark which
 using Spark to do data warehousing that support hive query (
 http://shark.cs.berkeley.edu)

 - Henry

 On Friday, May 31, 2013, Chen, Pei wrote:

  +1 (non-binding)
  This seems like a really interesting project.
  Q- Is Spark just a framework/API or does it also have some tools
  implemented for data analytics?
  --Pei
 
   -Original Message-
   From: Mattmann, Chris A (398J) [mailto:chris.a.mattm...@jpl.nasa.gov]
   Sent: Friday, May 31, 2013 2:04 PM
   To: general@incubator.apache.org
   Subject: [PROPOSAL] Apache Spark for the Incubator
  
   Hi Folks,
  
   I'm pleased to bring you a proposal to the Apache Incubator for the
  Apache
   Spark project: https://wiki.apache.org/incubator/SparkProposal
  
   The work originates from the Berkeley AMPLab and through a number of
   industry participants, and other institutions. Spark is a framework for
  large-
   scale data analysis on clusters, with a particular focus on low latency
   operations.
   The
   source code is written in Scala, and provides a number of APIs and
  bindings in
   various programming languages.
  
   The proposal text is copied to the bottom of this email. I'm going to
  leave this
   thread open for the next week for discussion. Once it's died down, I'll
  call an
   official VOTE.
  
   Suresh, Ross G. -- heads up -- this project may be of interest to you
  both and
   would welcome you guys as additional mentors. We currently have 3
   mentors committed to the project, but would love to have more. People
   interested in contributing should declare their interest here on the
   general@incubator thread and those potential contributors will be
  discussed
   by the incoming Spark community.
  
   Questions -- let's hear em'! :)
  
   Cheers,
   Chris
   (Champion, incoming Apache Spark)
  
   === Abstract ===
   Spark is an open source system for large-scale data analysis on
 clusters.
  
   === Proposal ===
   Spark is an open source system for fast and flexible large-scale data
  analysis.
   Spark provides a general purpose runtime that supports low-latency
   execution in several forms. These include interactive exploration of
 very
   large datasets, near real-time stream processing, and ad-hoc SQL
  analytics
   (through higher layer extensions). Spark interfaces with HDFS, HBase,
   Cassandra and several other storage storage layers, and exposes APIs in
   Scala, Java and Python.
   Background
   Spark started as U.C. Berkeley research project, designed to
 efficiently
  run
   machine learning algorithms on large datasets. Over time, it has
 evolved
  into
   a general computing engine as outlined above. Spark¹s developer
 community
   has also grown to include additional institutions, such as
 universities,
   research labs, and corporations. Funding has been provided by various
   institutions including the U.S. National Science Foundation, DARPA,
 and a
   number of industry sponsors. See:
   https://amplab.cs.berkeley.edu/sponsors/ for full details.
  
   === Rationale ===
   As the number of contributors to Spark has grown, we have sought for a
   long-term home for the project, and we believe the Apache foundation
   would be a great fit. Spark is a natural fit for the Apache foundation:
  Spark
   already interoperates with several existing Apache projects (HDFS,
 HBase,
   Hive, Cassandra, Avro and Flume to name a few). The Spark team is
  familiar
   with the Apache process and and subscribes to the Apache mission - the
   team includes multiple Apache committers already. Finally, joining
 Apache
   will help coordinate the development effort of the growing number of
   organizations which contribute to Spark.
  
   == Initial Goals ==
   The initial goals will most likely be to move the existing codebase to
  Apache
   and integrate with the Apache development process. Furthermore, we plan
   for incremental development, and releases along with the Apache
   guidelines.
  
   === Current Status ===
   == Meritocracy ==
   The Spark

Re: [PROPOSAL] Apache Spark for the Incubator

2013-05-31 Thread Roman Shaposhnik
Extremely enthusiastic +1!!!

If you ever need help with mentorship -- please let me know.

Also, looking forward to seeing this in Bigtop!

Thanks,
Roman.

On Fri, May 31, 2013 at 11:03 AM, Mattmann, Chris A (398J)
chris.a.mattm...@jpl.nasa.gov wrote:
 Hi Folks,

 I'm pleased to bring you a proposal to the Apache Incubator for the Apache
 Spark project: https://wiki.apache.org/incubator/SparkProposal

 The work originates from the Berkeley AMPLab and through a number of
 industry
 participants, and other institutions. Spark is a framework for large-scale
 data
 analysis on clusters, with a particular focus on low latency operations.
 The
 source code is written in Scala, and provides a number of APIs and bindings
 in various programming languages.

 The proposal text is copied to the bottom of this email. I'm going to leave
 this thread open for the next week for discussion. Once it's died down,
 I'll
 call an official VOTE.

 Suresh, Ross G. -- heads up -- this project may be of interest to you both
 and would welcome you guys as additional mentors. We currently have 3
 mentors
 committed to the project, but would love to have more. People interested in
 contributing should declare their interest here on the general@incubator
 thread
 and those potential contributors will be discussed by the incoming Spark
 community.

 Questions -- let's hear em'! :)

 Cheers,
 Chris
 (Champion, incoming Apache Spark)

 === Abstract ===
 Spark is an open source system for large-scale data analysis on clusters.

 === Proposal ===
 Spark is an open source system for fast and flexible large-scale data
 analysis. Spark provides a general purpose runtime that supports
 low-latency execution in several forms. These include interactive
 exploration of very large datasets, near real-time stream processing, and
 ad-hoc SQL analytics (through higher layer extensions). Spark interfaces
 with HDFS, HBase, Cassandra and several other storage storage layers, and
 exposes APIs in Scala, Java and Python.
 Background
 Spark started as U.C. Berkeley research project, designed to efficiently
 run machine learning algorithms on large datasets. Over time, it has
 evolved into a general computing engine as outlined above. Spark¹s
 developer community has also grown to include additional institutions,
 such as universities, research labs, and corporations. Funding has been
 provided by various institutions including the U.S. National Science
 Foundation, DARPA, and a number of industry sponsors. See:
 https://amplab.cs.berkeley.edu/sponsors/ for full details.

 === Rationale ===
 As the number of contributors to Spark has grown, we have sought for a
 long-term home for the project, and we believe the Apache foundation would
 be a great fit. Spark is a natural fit for the Apache foundation: Spark
 already interoperates with several existing Apache projects (HDFS, HBase,
 Hive, Cassandra, Avro and Flume to name a few). The Spark team is familiar
 with the Apache process and and subscribes to the Apache mission - the
 team includes multiple Apache committers already. Finally, joining Apache
 will help coordinate the development effort of the growing number of
 organizations which contribute to Spark.

 == Initial Goals ==
 The initial goals will most likely be to move the existing codebase to
 Apache and integrate with the Apache development process. Furthermore, we
 plan for incremental development, and releases along with the Apache
 guidelines.

 === Current Status ===
 == Meritocracy ==
 The Spark project already operates on meritocratic principles. Today,
 Spark has several developers and has accepted multiple major patches from
 outside of U.C. Berkeley. While this process has remained mostly informal
 (we do not have an official committer list), an implicit organization
 exists in which individuals who contribute major components act as
 maintainers for those modules. If accepted, the Spark project would
 include several of these participants as committers from the onset. We
 will work to identify all committers and PPMC members for the project and
 to operate under the ASF meritocratic principles.

 === Community ===
 Acceptance into the Apache foundation would bolster the already strong
 user and developer community around Spark. That community includes dozens
 of contributors from several institutions, a meetup group with several
 hundred members, and an active mailing list composed of hundreds of users.
 Core Developers
 The core developers of our project are listed in our contributors and
 initial PPMC below. Though many exist at UC Berkeley, there is a
 representative cross sampling of other organizations including Quantifind,
 Microsoft, Yahoo!, ClearStory Data, Bizo, Intel, Tagged and Webtrends.


 === Alignment ===
 Our proposed effort aligns with several ongoing BIGDATA and U.S. National
 priority funding interests including the NSF and its Expeditions program,
 and the DARPA XDATA project. Our industry partners and