Re: [DISCUSS] Druid incubation proposal

Pramod Immaneni Thu, 22 Feb 2018 09:37:33 -0800

+1

On Fri, Feb 16, 2018 at 12:15 PM, Gian Merlino <g...@apache.org> wrote:


> Hi all,
>
> I would like to open up a discussion about incubating Druid at Apache. I've
> included a proposal in this mail and have also posted a draft at
> https://wiki.apache.org/incubator/DruidProposal. More information about
> Druid is also available on our project web site at: http://druid.io/
>
> Thanks for your consideration!
>
> Gian
>
> = Druid Proposal =
>
> == Abstract ==
>
> Druid is a high-performance, column-oriented, distributed data store.
>
> == Proposal ==
>
> Druid is an open source data store designed for real-time exploratory
> analytics on large data sets. Druid's key features are a column-oriented
> storage layout, a distributed shared-nothing architecture, and ability to
> generate and leverage indexing and caching structures. Druid is typically
> deployed in clusters of tens to hundreds of nodes, and has the ability to
> load data from Apache Kafka and Apache Hadoop, among other data sources.
> Druid offers two query languages: a SQL dialect (powered by Apache Calcite)
> and a JSON-over-HTTP API.
>
> Druid was originally developed to power a slice-and-dice analytical UI
> built on top of large event streams. The original use case for Druid
> targeted ingest rates of millions of records/sec, retention of over a year
> of data, and query latencies of sub-second to a few seconds. Many people
> can benefit from such capability, and many already have (see
> http://druid.io/druid-powered.html). In addition, new use cases have
> emerged since Druid's original development, such as OLAP acceleration of
> data warehouse tables and more highly concurrent applications operating
> with relatively narrower queries.
>
> == Background ==
>
> Druid is a data store designed for fast analytics. It would typically be
> used in lieu of more general purpose query systems like Hadoop !MapReduce
> or Spark when query latency is of the utmost importance. Druid is often
> used as a data store for powering GUI analytical applications.
>
> The buzzwordy description of Druid is a high-performance, column-oriented,
> distributed data store. What we mean by this is:
>
>  * "high performance": Druid aims to provide low query latency and high
> ingest rates possible.
>  * "column-oriented": Druid stores data in a column-oriented format, like
> most other systems designed for analytics. It can also store indexes along
> with the columns.
>  * "distributed": Druid is deployed in clusters, typically of tens to
> hundreds of nodes.
>  * "data store": Druid loads your data and stores a copy of it on the
> cluster's local disks (and may cache it in memory). It doesn't query your
> data from some other storage system.
>
> == Rationale ==
>
> Druid is a mature, active project with a large number of production
> installations, dozens of contributors to each release, and multiple vendors
> offering professional support. Given Druid's strong community, its close
> integration with many other Apache projects (such as Kafka, Hadoop, and
> Calcite), and its pre-existing Apache-inspired governance structure, we
> feel that Apache is the best home for the project on a long-term basis.
>
> == Current Status ==
>
> === Meritocracy ===
> Since Druid was first open sourced the original developers have solicited
> contributions from others, including through our blog, the project mailing
> lists, and through accepting !GitHub pull requests. We have an
> Apache-inspired governance structure with a PMC and committers, and our
> committer ranks include a good number of people from outside the original
> development team.
>
> === Community ===
>
> The Druid core developers have sought to nurture a community throughout the
> life of the project. We use !GitHub as the focal point for bug reports and
> code contributions, and the mailing lists for most other discussion. To try
> to make people feel welcome, we've also spelled this out on a "CONTRIBUTE"
> link from the project page: http://druid.io/community/. Today we have an
> active contributor base (a typical release has ~40 contributors) and
> mailing list.
>
> === Core Developers ===
>
> Druid enjoys good diversity of committer affiliation. The most active
> developers over the past year are affiliated with four different companies:
> Imply, Metamarkets, Yahoo, and Hortonworks. Many Druid committers are also
> committers on other ASF projects as well, including Apache Airflow, Apache
> Curator, and Apache Calcite. The original developers of Druid remain
> involved in the project.
>
> === Alignment ===
>
> Druid's current governance structure is Apache-inspired with a PMC and
> committers chosen by a meritocratic process. Additionally, Druid integrates
> with a number of other Apache projects, including Kafka, Hadoop, Hive,
> Calcite, Superset (incubating), Spark, Curator, and !ZooKeeper.
>
> == Known Risks ==
>
> === Orphaned products ===
>
> The risk of Druid becoming orphaned is low, due to a diverse committer base
> that is invested in the future of the project.
>
> === Inexperience with Open Source ===
>
> Druid's core developers have been running it as a community-oriented open
> source project for some time now, and many of them are committers on other
> open source projects as well, including Apache Airflow, Apache Curator, and
> Apache Calcite.
>
> === Homogenous Developers ===
>
> Druid's current diversity of committer affiliation means that we have
> become accustomed to working collaboratively and in the open. We hope that
> a transition to the ASF helps Druid's contributor base become even more
> diverse.
>
> === Reliance on Salaried Developers ===
>
> Druid's user base and contributor base skews heavily towards salaried
> developers. We believe this is natural since Druid is a technology designed
> to be deployed on large clusters, and due to this, tends to be deployed by
> organizations rather than by individuals. Nevertheless, many current Druid
> developers have continued working on the project even through job changes,
> which we take to be a good sign of developer commitment and personal
> interest.
>
> === Relationships with Other Apache Products ===
>
> Druid integrates with a number of other Apache projects. Druid internally
> uses Calcite for SQL planning, and Curator and !ZooKeeper for coordination.
> Druid can read data in Avro or Parquet format. Druid can load data from
> streams in Kafka or from files in Hadoop. Druid integrates with Hive as an
> option for SQL query acceleration. Druid data can be visualized by Superset
> (incubating).
>
> === A Excessive Fascination with the Apache Brand ===
>
> Druid is a successful project with a diverse community. The main reason for
> pursuing incubation is to find a stable, long term home for the project
> with a well known governance philosophy.
>
> == Required Resources ==
>
> === Mailing lists ===
>
> We would like to migrate the existing Druid mailing lists from Google
> Groups to Apache.
>
>  * druid-user@googlegroups -> us...@druid.incubator.apache.org
>  * druid-development@googlegroups -> d...@druid.incubator.apache.org
>
> === Source control ===
>
> Druid development currently takes place on !GitHub. We would like to
> continue using !GitHub, if possible, in order to preserve the workflows the
> community has developed around !GitHub pull requests.
>
> === Issue tracking ===
> Druid currently uses !GitHub issues for issue tracking. We would like to
> migrate to Apache JIRA at http://issues.apache.org/jira/browse/DRUID.
>
> == Documentation ==
>
> Druid's documentation can be found at http://druid.io/docs/latest/.
>
> == Initial Source ==
>
> Druid was initially open-sourced by Metamarkets in 2012 and has been run in
> a community-governed fashion since then. The code is currently hosted at
> https://github.com/druid-io/ and includes the following repositories:
>
>  * druid (primary repository)
>  * druid-console (web console for Druid)
>  * druid-io.github.io (source for Druid's website at http://druid.io/)
>  * tranquility (realtime stream push client for Druid)
>  * docker-druid (Docker image for Druid)
>  * pydruid (Python library)
>  * RDruid (R library)
>  * oss-parent (Maven POM files)
>
> == Source and Intellectual Property Submission Plan ==
>
> A complete set of the open source code needs to be licensed from the owning
> organization to the Foundation. Commercial legal counsel for the owning
> organization will review the standard Foundation licensing paperwork and
> propose any updates as needed. This license will enable Apache to incubate
> and manage the Druid project moving forward.
>
> Other Druid paraphernalia to be transferred to Apache consists of:
>
>  * !GitHub organization at https://github.com/druid-io/
>  * Twitter account at https://twitter.com/druidio
>  * "druid.io" domain name
>  * "Druid" trademark assignment per Foundation standard paper.  The
> trademark assignment paperwork shall be reviewed by the owning
> organization's commercial and IP counsel
>  * CLAs - all rights in the code licensed above should encompass the CLAs
> that existed between developers and owning organization
>
> A copyright license to the code, trademark assignment of Druid, and
> transfer of other paraphernalia to Apache should be sufficient to cover all
> rights required by Apache to operate the project.
>
> == External Dependencies ==
> External dependencies distributed with Druid currently all have one of the
> following Category A or B licenses: ASL, BSD, CDDL, EPL, MIT, MPL; with one
> exception: the optional Druid MySQL metadata store extension depends on
> MySQL Connector/J, which is GPL licensed. Druid currently packages this as
> a separate download; see our current presentation on:
> http://druid.io/downloads.html. As part of incubation we intend to
> determine the best strategy for handling the MySQL extension.
>
> == Cryptography ==
> Not applicable.
>
> == Initial Committers ==
>
> The initial committers for incubation are the current set of committers on
> Druid who have expressed interest in being involved in Apache incubation.
> Affiliations are listed where relevant. We may seek to add other committers
> during incubation; for example, we would want to add any current Druid
> committers who express an interest after incubation begins.
>
>  * Charles Allen (char...@allen-net.com) (Snap)
>  * David Lim (david.clarence....@gmail.com) (Imply)
>  * Eric Tschetter (ched...@apache.org) (Splunk)
>  * Fangjin Yang (f...@imply.io) (Imply)
>  * Gian Merlino (g...@apache.org) (Imply)
>  * Himanshu Gupta (g.himan...@gmail.com) (Oath)
>  * Jihoon Son (jihoon...@apache.org) (Imply)
>  * Jonathan Wei (jon....@imply.io) (Imply)
>  * Maxime Beauchemin (maximebeauche...@gmail.com) (Lyft)
>  * Mohamed Slim Bouguerra (slim.bougue...@gmail.com) (Hortonworks)
>  * Nishant Bangarwa (nish...@apache.org) (Hortonworks)
>  * Parag Jain (paragjai...@gmail.com) (Oath)
>  * Roman Leventov (leventov...@gmail.com) (Metamarkets)
>  * Xavier Léauté (xav...@leaute.com) (Confluent)
>
> == Sponsors ==
>
>  * Champion: Julian Hyde
>  * Nominated mentors: Julian Hyde, P. Taylor Goetz, Jun Rao
>  * Sponsoring entity: Apache Incubator
>

Re: [DISCUSS] Druid incubation proposal

Reply via email to