Re: [DISCUSS] MADlib Incubation Proposal

2015-09-08 Thread Roman Shaposhnik
On Mon, Sep 7, 2015 at 5:06 AM, Atri Sharma  wrote:
> Now that HAWQ vote is closed can we also have a consensus on this proposal
> please?

I think that would make sense. I'll start a voting thread soon.

Thanks,
Roman.

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



Re: [DISCUSS] MADlib Incubation Proposal

2015-09-07 Thread Atri Sharma
Now that HAWQ vote is closed can we also have a consensus on this proposal
please?
On 3 Sep 2015 02:07, "Roman Shaposhnik"  wrote:

> Hi!
>
> on the heels of the HAWQ proposal, I'd like
> to follow with a discussion of accepting MADlib's
> community into the ASF Incubator:
>  https://wiki.apache.org/incubator/MADlibProposal
>
> There was an extensive discussion within the existing
> open source community and the overall consensus
> is extremely supportive of this proposal:
> http://madlib.net/pipermail/user/2015-August/
> http://madlib.net/pipermail/devel/2015-August/
>
> We've done quite a bit of outreach in order to identify
> all the folks who may be interested in joining the initial
> list of committers. The current proposal reflects that.
> Additionally, we hope that the ASF DISCUSS thread
> will help us in reaching out even further.
>
> Finally, while 3 experienced mentors currently mentioned
> on the proposal seems like a reasonable number, we would
> love if other folks from IPMC could volunteer to help us on
> this journey.
>
> Thanks,
> Roman.
>
> == Abstract ==
> MADlib is an open-source library (licensed under 2-clause BSD license)
> for scalable in-database analytics. It provides data-parallel
> implementations of mathematical, statistical and machine learning
> methods for structured and unstructured data. The MADlib mission is to
> foster widespread development of scalable analytic skills, by
> harnessing efforts from commercial practice, academic research, and
> open source development.
>
> MADlib occupies a unique niche in the realm of data science and
> machine learning libraries since its SQL APIs can allow it to work on
> a wide range of data stores and SQL engines.
>
> == Proposal ==
> The current open source community behind MADlib feels that aligning
> itself with HAWQ's community, governance model, infrastructure and
> roadmap will allow the project to accelerate adoption and community
> growth. Given HAWQ's trajectory of entering Apache Software Foundation
> family as an Incubating project, we feel that the best course of
> action for MADlib is to follow a similar route.
>
> MADlib and HAWQ are complementary technologies in that MADlib
> in-database analytical functions can run within the HAWQ execution
> engine. (MADlib also runs on Greenplum Database and PostgreSQL today.)
> It is expected that contributors to MADlib will be cognizant of the
> HAWQ ASF project and may contribute to it as well.  In short,
> collaboration between the two communities will make both projects more
> vibrant and advance the respective technologies in potentially novel
> directions.
>
> Contributors may also look at the HAWQ project as a starting port for
> ports to other parallel database engines. This proposal highly
> encourages this type of work as it would help to further realize the
> original cross-platform goal of MADlib as envisioned by its
> originators.
>
> Thus, the goal of this proposal is to bring the existing MADlib open
> source community into ASF, change the project's governance model to
> the "Apache Way" and transition the project's codebase and
> infrastructure into ASF INFRA. The community has agreed to transfer
> the brand name "MADlib" to Apache Software Foundation as well.
>
> Pivotal Inc. on behalf of the MADlib open source community is
> submitting this proposal to transition source code and associated
> artifacts (documentation, web site content, wiki, etc.) to the Apache
> Software Foundation Incubator under the Apache License, Version 2.0
> and is asking Incubator PMC to established a MADlib incubating
> project.
>
> Currently MADlib uses a few category X licensed software tools during
> its build (mostly for generating documentation):
>* doxypy 0.4.2 (GPL)
>* doxygen 1.8.4 (GPL)
>* TikZ-UML
>* bison 2.4 (GPL, with an exception for generated output)
> We feel that this usage is compatible with an overall project licensed
> under the ALv2 and don't anticipate any changes.
> Our usage of LGPL library cern_root-5.34 is expected to go away since
> the 2 cern modules used are being entirely re-written
> in MADlib
>
> Finally, MADlib inclusion of MPL licensed library (eigen 3.2.2) into
> its binary artifact seems to be consistent with
> ASF recommendation for managing "weak copyleft" dependencies.
>
>
> == Background ==
> MADlib grew out of discussions between database engine developers,
> data scientists, IT architects and academics interested in new
> approaches to scalable, sophisticated in-database analytics. These
> discussions were written up in a paper in VLDB 2009 that coined the
> term “MAD Skills” for data analysis
> (http://dl.acm.org/citation.cfm?id=1687576). The MADlib software
> project began the following year as a collaboration between
> researchers at UC Berkeley and engineers and data scientists at
> Pivotal (former EMC/Greenplum).
>
> The initial MADlib codebase came from EMC/Greenplum, UC Berkeley, the
> University 

[DISCUSS] MADlib Incubation Proposal

2015-09-02 Thread Roman Shaposhnik
Hi!

on the heels of the HAWQ proposal, I'd like
to follow with a discussion of accepting MADlib's
community into the ASF Incubator:
 https://wiki.apache.org/incubator/MADlibProposal

There was an extensive discussion within the existing
open source community and the overall consensus
is extremely supportive of this proposal:
http://madlib.net/pipermail/user/2015-August/
http://madlib.net/pipermail/devel/2015-August/

We've done quite a bit of outreach in order to identify
all the folks who may be interested in joining the initial
list of committers. The current proposal reflects that.
Additionally, we hope that the ASF DISCUSS thread
will help us in reaching out even further.

Finally, while 3 experienced mentors currently mentioned
on the proposal seems like a reasonable number, we would
love if other folks from IPMC could volunteer to help us on
this journey.

Thanks,
Roman.

== Abstract ==
MADlib is an open-source library (licensed under 2-clause BSD license)
for scalable in-database analytics. It provides data-parallel
implementations of mathematical, statistical and machine learning
methods for structured and unstructured data. The MADlib mission is to
foster widespread development of scalable analytic skills, by
harnessing efforts from commercial practice, academic research, and
open source development.

MADlib occupies a unique niche in the realm of data science and
machine learning libraries since its SQL APIs can allow it to work on
a wide range of data stores and SQL engines.

== Proposal ==
The current open source community behind MADlib feels that aligning
itself with HAWQ's community, governance model, infrastructure and
roadmap will allow the project to accelerate adoption and community
growth. Given HAWQ's trajectory of entering Apache Software Foundation
family as an Incubating project, we feel that the best course of
action for MADlib is to follow a similar route.

MADlib and HAWQ are complementary technologies in that MADlib
in-database analytical functions can run within the HAWQ execution
engine. (MADlib also runs on Greenplum Database and PostgreSQL today.)
It is expected that contributors to MADlib will be cognizant of the
HAWQ ASF project and may contribute to it as well.  In short,
collaboration between the two communities will make both projects more
vibrant and advance the respective technologies in potentially novel
directions.

Contributors may also look at the HAWQ project as a starting port for
ports to other parallel database engines. This proposal highly
encourages this type of work as it would help to further realize the
original cross-platform goal of MADlib as envisioned by its
originators.

Thus, the goal of this proposal is to bring the existing MADlib open
source community into ASF, change the project's governance model to
the "Apache Way" and transition the project's codebase and
infrastructure into ASF INFRA. The community has agreed to transfer
the brand name "MADlib" to Apache Software Foundation as well.

Pivotal Inc. on behalf of the MADlib open source community is
submitting this proposal to transition source code and associated
artifacts (documentation, web site content, wiki, etc.) to the Apache
Software Foundation Incubator under the Apache License, Version 2.0
and is asking Incubator PMC to established a MADlib incubating
project.

Currently MADlib uses a few category X licensed software tools during
its build (mostly for generating documentation):
   * doxypy 0.4.2 (GPL)
   * doxygen 1.8.4 (GPL)
   * TikZ-UML
   * bison 2.4 (GPL, with an exception for generated output)
We feel that this usage is compatible with an overall project licensed
under the ALv2 and don't anticipate any changes.
Our usage of LGPL library cern_root-5.34 is expected to go away since
the 2 cern modules used are being entirely re-written
in MADlib

Finally, MADlib inclusion of MPL licensed library (eigen 3.2.2) into
its binary artifact seems to be consistent with
ASF recommendation for managing "weak copyleft" dependencies.


== Background ==
MADlib grew out of discussions between database engine developers,
data scientists, IT architects and academics interested in new
approaches to scalable, sophisticated in-database analytics. These
discussions were written up in a paper in VLDB 2009 that coined the
term “MAD Skills” for data analysis
(http://dl.acm.org/citation.cfm?id=1687576). The MADlib software
project began the following year as a collaboration between
researchers at UC Berkeley and engineers and data scientists at
Pivotal (former EMC/Greenplum).

The initial MADlib codebase came from EMC/Greenplum, UC Berkeley, the
University of Wisconsin, and the University of Florida.  The project
was publicly documented in a paper at VLDB 2012
(http://vldb.org/pvldb/vol5/p1700_joehellerstein_vldb2012.pdf).  Today
MADlib has contributors from around the world including both
individuals and institutions.  For example, recent contributions have
come from Pivotal, Stanford 

Re: [DISCUSS] MADlib Incubation Proposal

2015-09-02 Thread Atri Sharma
I am very happy to see this proposal.

I think combination of HAWQ and MADlib makes it possible to have heavy
production level analytics on top of Hadoop (which is fantastic!).

That said, given MADlib 's flexibility, I feel it would be a great addition
to Apache big data stack in general and I am eagerly looking forward to
integration efforts with various existing Apache big data members.

Regards,

Atri
On 3 Sep 2015 02:07, "Roman Shaposhnik"  wrote:

> Hi!
>
> on the heels of the HAWQ proposal, I'd like
> to follow with a discussion of accepting MADlib's
> community into the ASF Incubator:
>  https://wiki.apache.org/incubator/MADlibProposal
>
> There was an extensive discussion within the existing
> open source community and the overall consensus
> is extremely supportive of this proposal:
> http://madlib.net/pipermail/user/2015-August/
> http://madlib.net/pipermail/devel/2015-August/
>
> We've done quite a bit of outreach in order to identify
> all the folks who may be interested in joining the initial
> list of committers. The current proposal reflects that.
> Additionally, we hope that the ASF DISCUSS thread
> will help us in reaching out even further.
>
> Finally, while 3 experienced mentors currently mentioned
> on the proposal seems like a reasonable number, we would
> love if other folks from IPMC could volunteer to help us on
> this journey.
>
> Thanks,
> Roman.
>
> == Abstract ==
> MADlib is an open-source library (licensed under 2-clause BSD license)
> for scalable in-database analytics. It provides data-parallel
> implementations of mathematical, statistical and machine learning
> methods for structured and unstructured data. The MADlib mission is to
> foster widespread development of scalable analytic skills, by
> harnessing efforts from commercial practice, academic research, and
> open source development.
>
> MADlib occupies a unique niche in the realm of data science and
> machine learning libraries since its SQL APIs can allow it to work on
> a wide range of data stores and SQL engines.
>
> == Proposal ==
> The current open source community behind MADlib feels that aligning
> itself with HAWQ's community, governance model, infrastructure and
> roadmap will allow the project to accelerate adoption and community
> growth. Given HAWQ's trajectory of entering Apache Software Foundation
> family as an Incubating project, we feel that the best course of
> action for MADlib is to follow a similar route.
>
> MADlib and HAWQ are complementary technologies in that MADlib
> in-database analytical functions can run within the HAWQ execution
> engine. (MADlib also runs on Greenplum Database and PostgreSQL today.)
> It is expected that contributors to MADlib will be cognizant of the
> HAWQ ASF project and may contribute to it as well.  In short,
> collaboration between the two communities will make both projects more
> vibrant and advance the respective technologies in potentially novel
> directions.
>
> Contributors may also look at the HAWQ project as a starting port for
> ports to other parallel database engines. This proposal highly
> encourages this type of work as it would help to further realize the
> original cross-platform goal of MADlib as envisioned by its
> originators.
>
> Thus, the goal of this proposal is to bring the existing MADlib open
> source community into ASF, change the project's governance model to
> the "Apache Way" and transition the project's codebase and
> infrastructure into ASF INFRA. The community has agreed to transfer
> the brand name "MADlib" to Apache Software Foundation as well.
>
> Pivotal Inc. on behalf of the MADlib open source community is
> submitting this proposal to transition source code and associated
> artifacts (documentation, web site content, wiki, etc.) to the Apache
> Software Foundation Incubator under the Apache License, Version 2.0
> and is asking Incubator PMC to established a MADlib incubating
> project.
>
> Currently MADlib uses a few category X licensed software tools during
> its build (mostly for generating documentation):
>* doxypy 0.4.2 (GPL)
>* doxygen 1.8.4 (GPL)
>* TikZ-UML
>* bison 2.4 (GPL, with an exception for generated output)
> We feel that this usage is compatible with an overall project licensed
> under the ALv2 and don't anticipate any changes.
> Our usage of LGPL library cern_root-5.34 is expected to go away since
> the 2 cern modules used are being entirely re-written
> in MADlib
>
> Finally, MADlib inclusion of MPL licensed library (eigen 3.2.2) into
> its binary artifact seems to be consistent with
> ASF recommendation for managing "weak copyleft" dependencies.
>
>
> == Background ==
> MADlib grew out of discussions between database engine developers,
> data scientists, IT architects and academics interested in new
> approaches to scalable, sophisticated in-database analytics. These
> discussions were written up in a paper in VLDB 2009 that coined the
> term “MAD Skills” for data