Sounds good to me. I'm not into Hadoop, but sounds like it's useful. The code seems to be ALv2 since quite some time thus I don't see much legal issues in this respect.
LieGrue, strub On Tuesday, 13 May 2014, 6:09, Chris Aniszczyk <[email protected]> wrote: We would like to propose Parquet as an Apache Incubator project. >https://wiki.apache.org/incubator/ParquetProposal > >Feel free to comment, we'll go for a vote in a week or two or whenever >consensus has been reached on the proposal. > >I've posted posted the text of the proposal below: > >== Abstract == >Parquet is a columnar storage format for Hadoop. > >== Proposal == > >We created Parquet to make the advantages of compressed, efficient columnar >data representation available to any project in the Hadoop ecosystem, >regardless of the choice of data processing framework, data model, or >programming language. > >== Background == > >Parquet is built from the ground up with complex nested data structures in >mind, and uses the repetition/definition level approach to encoding such >data structures, as popularized by Google Dremel ( >https://blog.twitter.com/2013/dremel-made-simple-with-parquet). We believe >this approach is superior to simple flattening of nested name spaces. > >Parquet is built to support very efficient compression and encoding >schemes. Parquet allows compression schemes to be specified on a per-column >level, and is future-proofed to allow adding more encodings as they are >invented and implemented. We separate the concepts of encoding and >compression, allowing parquet consumers to implement operators that work >directly on encoded data without paying decompression and decoding penalty >when possible. > >== Rationale == > >Parquet is built to be used by anyone. We believe that an efficient, >well-implemented columnar storage substrate should be useful to all >frameworks without the cost of extensive and difficult to set up >dependencies. > >Furthermore, the rapid growth of Parquet community is empowered by open >source. We believe the Apache foundation is a great fit as the long-term >home for Parquet, as it provides an established process for >community-driven development and decision making by consensus. This is >exactly the model we want for future Parquet development. > >== Initial Goals == > >* Move the existing codebase to Apache >* Integrate with the Apache development process >* Ensure all dependencies are compliant with Apache License version 2.0 >* Incremental development and releases per Apache guidelines > >== Current Status == > >Parquet has undergone 2 major releases: >https://github.com/Parquet/parquet-format/releases of the core format and >22 releases: https://github.com/Parquet/parquet-mr/releases of the >supporting set of Java libraries. > >The Parquet source is currently hosted at GitHub, which will seed the >Apache git repository. > >=== Meritocracy === > >We plan to invest in supporting a meritocracy. We will discuss the >requirements in an open forum. Several companies have already expressed >interest in this project, and we intend to invite additional developers to >participate. We will encourage and monitor community participation so that >privileges can be extended to those that contribute. > >=== Community === > >There is a large need for an advanced columnar storage format for Hadoop. >Parquet is being used in production by many organizations (see >https://github.com/Parquet/parquet-mr/blob/master/PoweredBy.md) > >* Cloudera: https://twitter.com/HenryR/statuses/324222874011451392 >* Criteo: https://twitter.com/julsimon/statuses/312114074911666177 >* Salesforce: https://twitter.com/TwitterOSS/statuses/392734610116726784 >* Stripe: https://twitter.com/avibryant/statuses/391339949250715648 >* Twitter: https://twitter.com/J_/statuses/315844725611581441 > >By bringing Parquet into Apache, we believe that the community will grow >even bigger. > >=== Core Developers === > >Parquet was initially developed as a collaboration between Twitter, >Cloudera and Criteo. > >See >https://blog.twitter.com/2013/announcing-parquet-10-columnar-storage-for-hadoop > >=== Alignment === > >We believe that having Parquet at Apache will help further the growth of >the big-data community, as it will encourage cooperation within the greater >ecosystem of projects spawned by Apache Hadoop. The alignment is also >beneficial to other Apache communities (such as Hadoop, Hive, Avro). > >== Known Risks == > >=== Orphaned Products === > >The risk of the Parquet project being abandoned is minimal. There are many >organizations using Parquet in production, including Twitter, Cloudera, >Stripe, and Salesforce ( >http://blog.cloudera.com/blog/2013/10/parquet-at-salesforce-com/). > >=== Inexperience with Open Source === > >Parquet has existed as a healthy open source for one year. During that >time, we have curated an open-source community successfully, attracting >over 40 contributors (see >https://github.com/Parquet/parquet-mr/graphs/contributors) from a diverse >group of companies. >Several of the core contributors to the project are deeply familiar with >OSS and Apache specifically: Julien Le Dem is the current PMC Chair for >Apache Pig, and Dmitriy Ryaboy, Aniket Mokashi, and Jonathan Coveney are >also Apache Pig committers with contributions to several other Apache >projects. Todd Lipcon and Tom White are committers to Apache Hadoop and >multiple other related projects. Brock Noland is a Hive committer. > >=== Homogenous Developers === > >The initial committers come from a number of companies and countries. >Parquet has an active community of developers, and we are committed to >recruiting additional committers based on their contributions to the >project. The java library component alone has contributions from 31 >individual github accounts, 14 of which contributed over 1000 lines of code. > >=== Reliance on Salaried Developers === > >It is expected that Parquet development will occur on both salaried time >and on volunteer time, after hours. The majority of initial committers are >paid by their employers to contribute to this project. However, they are >all passionate about the project, and we are confident that the project >will continue even if no salaried developers contribute to the project. As >evidence of this statement, we present the GitHub punchcard (see >https://github.com/Parquet/parquet-mr/graphs/punch-card) showing that a lot >of activity happens on weekends. We are committed to recruiting additional >committers including non-salaried developers. > >=== Relationships with Other Apache Products === > >As mentioned in the Alignment section, Parquet is closely related to >Hadoop, Pig, Avro, Thrift, YARN and Mesos in a numerous ways. We look >forward to collaborating with those communities, as well as other Apache >communities (including Apache S4 which focuses on stateful low-latency >processing). > >=== An Excessive Fascination with the Apache Brand === > >Parquet is an already healthy and well known open source project. This >proposal is not for the purpose of generating publicity. Rather, the >primary benefits to joining Apache are those outlined in the Rationale >section. > >== Documentation == > >Documentation is currently located as README markdown files: > >* https://github.com/Parquet/parquet-format >* https://github.com/Parquet/parquet-mr > >== Source and Intellectual Property Submission Plan == > >The Parquet codebase is currently hosted on Github: >https://github.com/Parquet. > >This is the exact codebase that we would migrate to the Apache foundation. > >== External Dependencies == > >* Junit: EPL >* Apache Commons: ALv2 >* Apache Thrift: ALv2 >* Apache Maven: ALv2 >* Apache Avro: ALv2 >* Apache Hadoop: ALv2 >* Google Guava: ALv2 > >== Cryptography == > >We do not expect Parquet to be a controlled export item due to the use of >encryption. > >== Required Resources == > >=== Mailing lists === > >* parquet-dev >* parquet-user > >== Subversion Directory == > >Git is the preferred source control system: git://git.apache.org/parquet > >== Issue Tracking == > >JIRA: Parquet (PARQUET) > >== Initial Committers == > >* Aniket Mokashi >* Brock Noland >* Chris Aniszczyk <[email protected]> >* Dmitriy Ryaboy <[email protected]> >* Jake Farrell >* Julien Le Dem <[email protected]> >* Lukas Nalezenec >* Marcel Kornacker >* Mickael Lacour >* Nong Li >* Remy Pecqueur >* Tianshuo Deng >* Tom White > >== Affiliations == > >* Aniket Mokashi - Twitter >* Brock Noland - Cloudera >* Chris Aniszczyk - Twitter >* Dmitriy Ryaboy - Twitter >* Jake Farrell >* Julien Le Dem - Twitter >* Lukas Nalezenec >* Marcel Kornacker - Cloudera >* Mickael Lacour - Criteo >* Nong Li - Cloudera >* Remy Pecqueur - Criteo >* Tianshuo Deng - Twitter >* Tom White - Cloudera > >== Sponsors == > >=== Champion === > >* Todd Lipcon > >=== Nominated Mentors === > >* Tom White >* Chris Mattmann >* Jake Farrell > >=== Sponsoring Entity === > >The Apache Incubator > >-- >Cheers, > >Chris Aniszczyk >http://aniszczyk.org >+1 512 961 6719 > > >
