Re: [PROPOSAL] Parquet

Mark Struberg Wed, 14 May 2014 02:51:27 -0700

Sounds good to me.
I'm not into Hadoop, but sounds like it's useful.
The code seems to be ALv2 since quite some time thus I don't see much legal 
issues in this respect.



LieGrue,
strub

On Tuesday, 13 May 2014, 6:09, Chris Aniszczyk <[email protected]> wrote:
 
We would like to propose Parquet as an Apache Incubator project.
>https://wiki.apache.org/incubator/ParquetProposal
>
>Feel free to comment, we'll go for a vote in a week or two or whenever
>consensus has been reached on the proposal.
>
>I've posted posted the text of the proposal below:
>
>== Abstract ==
>Parquet is a columnar storage format for Hadoop.
>
>== Proposal ==
>
>We created Parquet to make the advantages of compressed, efficient columnar
>data representation available to any project in the Hadoop ecosystem,
>regardless of the choice of data processing framework, data model, or
>programming language.
>
>== Background ==
>
>Parquet is built from the ground up with complex nested data structures in
>mind, and uses the repetition/definition level approach to encoding such
>data structures, as popularized by Google Dremel (
>https://blog.twitter.com/2013/dremel-made-simple-with-parquet). We believe
>this approach is superior to simple flattening of nested name spaces.
>
>Parquet is built to support very efficient compression and encoding
>schemes. Parquet allows compression schemes to be specified on a per-column
>level, and is future-proofed to allow adding more encodings as they are
>invented and implemented. We separate the concepts of encoding and
>compression, allowing parquet consumers to implement operators that work
>directly on encoded data without paying decompression and decoding penalty
>when possible.
>
>== Rationale ==
>
>Parquet is built to be used by anyone. We believe that an efficient,
>well-implemented columnar storage substrate should be useful to all
>frameworks without the cost of extensive and difficult to set up
>dependencies.
>
>Furthermore, the rapid growth of Parquet community is empowered by open
>source. We believe the Apache foundation is a great fit as the long-term
>home for Parquet, as it provides an established process for
>community-driven development and decision making by consensus. This is
>exactly the model we want for future Parquet development.
>
>== Initial Goals ==
>
>* Move the existing codebase to Apache
>* Integrate with the Apache development process
>* Ensure all dependencies are compliant with Apache License version 2.0
>* Incremental development and releases per Apache guidelines
>
>== Current Status ==
>
>Parquet has undergone 2 major releases:
>https://github.com/Parquet/parquet-format/releases of the core format and
>22 releases: https://github.com/Parquet/parquet-mr/releases of the
>supporting set of Java libraries.
>
>The Parquet source is currently hosted at GitHub, which will seed the
>Apache git repository.
>
>=== Meritocracy ===
>
>We plan to invest in supporting a meritocracy. We will discuss the
>requirements in an open forum. Several companies have already expressed
>interest in this project, and we intend to invite additional developers to
>participate. We will encourage and monitor community participation so that
>privileges can be extended to those that contribute.
>
>=== Community ===
>
>There is a large need for an advanced columnar storage format for Hadoop.
>Parquet is being used in production by many organizations (see
>https://github.com/Parquet/parquet-mr/blob/master/PoweredBy.md)
>
>* Cloudera: https://twitter.com/HenryR/statuses/324222874011451392
>* Criteo: https://twitter.com/julsimon/statuses/312114074911666177
>* Salesforce: https://twitter.com/TwitterOSS/statuses/392734610116726784
>* Stripe: https://twitter.com/avibryant/statuses/391339949250715648
>* Twitter: https://twitter.com/J_/statuses/315844725611581441
>
>By bringing Parquet into Apache, we believe that the community will grow
>even bigger.
>
>=== Core Developers ===
>
>Parquet was initially developed as a collaboration between Twitter,
>Cloudera and Criteo.
>
>See
>https://blog.twitter.com/2013/announcing-parquet-10-columnar-storage-for-hadoop
>
>=== Alignment ===
>
>We believe that having Parquet at Apache will help further the growth of
>the big-data community, as it will encourage cooperation within the greater
>ecosystem of projects spawned by Apache Hadoop. The alignment is also
>beneficial to other Apache communities (such as Hadoop, Hive, Avro).
>
>== Known Risks ==
>
>=== Orphaned Products ===
>
>The risk of the Parquet project being abandoned is minimal. There are many
>organizations using Parquet in production, including Twitter, Cloudera,
>Stripe, and Salesforce (
>http://blog.cloudera.com/blog/2013/10/parquet-at-salesforce-com/).
>
>=== Inexperience with Open Source ===
>
>Parquet has existed as a healthy open source for one year. During that
>time, we have curated an open-source community successfully, attracting
>over 40 contributors (see
>https://github.com/Parquet/parquet-mr/graphs/contributors) from a diverse
>group of companies.
>Several of the core contributors to the project are deeply familiar with
>OSS and Apache specifically: Julien Le Dem is the current PMC Chair for
>Apache Pig, and Dmitriy Ryaboy, Aniket Mokashi, and Jonathan Coveney are
>also Apache Pig committers with contributions to several other Apache
>projects. Todd Lipcon and Tom White are committers to Apache Hadoop and
>multiple other related projects. Brock Noland is a Hive committer.
>
>=== Homogenous Developers ===
>
>The initial committers come from a number of companies and countries.
>Parquet has an active community of developers, and we are committed to
>recruiting additional committers based on their contributions to the
>project. The java library component alone has contributions from 31
>individual github accounts, 14 of which contributed over 1000 lines of code.
>
>=== Reliance on Salaried Developers ===
>
>It is expected that Parquet development will occur on both salaried time
>and on volunteer time, after hours. The majority of initial committers are
>paid by their employers to contribute to this project. However, they are
>all passionate about the project, and we are confident that the project
>will continue even if no salaried developers contribute to the project. As
>evidence of this statement, we present the GitHub punchcard (see
>https://github.com/Parquet/parquet-mr/graphs/punch-card) showing that a lot
>of activity happens on weekends. We are committed to recruiting additional
>committers including non-salaried developers.
>
>=== Relationships with Other Apache Products ===
>
>As mentioned in the Alignment section, Parquet is closely related to
>Hadoop, Pig, Avro, Thrift, YARN and Mesos in a numerous ways. We look
>forward to collaborating with those communities, as well as other Apache
>communities (including Apache S4 which focuses on stateful low-latency
>processing).
>
>=== An Excessive Fascination with the Apache Brand ===
>
>Parquet is an already healthy and well known open source project. This
>proposal is not for the purpose of generating publicity. Rather, the
>primary benefits to joining Apache are those outlined in the Rationale
>section.
>
>== Documentation ==
>
>Documentation is currently located as README markdown files:
>
>* https://github.com/Parquet/parquet-format
>* https://github.com/Parquet/parquet-mr
>
>== Source and Intellectual Property Submission Plan ==
>
>The Parquet codebase is currently hosted on Github:
>https://github.com/Parquet.
>
>This is the exact codebase that we would migrate to the Apache foundation.
>
>== External Dependencies ==
>
>* Junit: EPL
>* Apache Commons: ALv2
>* Apache Thrift: ALv2
>* Apache Maven: ALv2
>* Apache Avro: ALv2
>* Apache Hadoop: ALv2
>* Google Guava: ALv2
>
>== Cryptography ==
>
>We do not expect Parquet to be a controlled export item due to the use of
>encryption.
>
>== Required Resources ==
>
>=== Mailing lists ===
>
>* parquet-dev
>* parquet-user
>
>== Subversion Directory ==
>
>Git is the preferred source control system: git://git.apache.org/parquet
>
>== Issue Tracking ==
>
>JIRA: Parquet (PARQUET)
>
>== Initial Committers ==
>
>* Aniket Mokashi
>* Brock Noland
>* Chris Aniszczyk <[email protected]>
>* Dmitriy Ryaboy <[email protected]>
>* Jake Farrell
>* Julien Le Dem <[email protected]>
>* Lukas Nalezenec
>* Marcel Kornacker
>* Mickael Lacour
>* Nong Li
>* Remy Pecqueur
>* Tianshuo Deng
>* Tom White
>
>== Affiliations ==
>
>* Aniket Mokashi - Twitter
>* Brock Noland - Cloudera
>* Chris Aniszczyk - Twitter
>* Dmitriy Ryaboy - Twitter
>* Jake Farrell
>* Julien Le Dem - Twitter
>* Lukas Nalezenec
>* Marcel Kornacker - Cloudera
>* Mickael Lacour - Criteo
>* Nong Li - Cloudera
>* Remy Pecqueur - Criteo
>* Tianshuo Deng - Twitter
>* Tom White - Cloudera
>
>== Sponsors ==
>
>=== Champion ===
>
>* Todd Lipcon
>
>=== Nominated Mentors ===
>
>* Tom White
>* Chris Mattmann
>* Jake Farrell
>
>=== Sponsoring Entity ===
>
>The Apache Incubator
>
>-- 
>Cheers,
>
>Chris Aniszczyk
>http://aniszczyk.org
>+1 512 961 6719
>
>
>

Re: [PROPOSAL] Parquet

Reply via email to