Thanks Dmitriy Thats what I thought we would do but just wanted to make sure, proposal updated to reflect the main imports that we will need
-Jake On Fri, May 16, 2014 at 6:55 PM, Dmitriy Ryaboy <dvrya...@gmail.com> wrote: > Roman -- you've been added, thanks for volunteering :) > > One Jake's repo question, I think it would be somewhat problematic, and > we'd prefer multiple git repos. Format and the java code are currently > versioned separately and have different builds (format is much lighter). > Plus, if we add, for example, a general-use C library for reading Parquet > files, we wouldn't want to conflate its build with the mvn-based builds for > parquet-mr. > > > > > On Fri, May 16, 2014 at 11:56 AM, Roman Shaposhnik <r...@apache.org> wrote: > > > Hi! > > > > proposal looks good to me and I am very much looking > > for a voting thread. > > > > One small request, since I plan to spend a fair amount > > of time on Parquet anyway, would you guys be ok > > with adding me as an extra mentor so I can help > > with that aspect of the project as well? > > > > Thanks, > > Roman. > > > > P.S. Plus it has an added benefit of increasing diversity > > of affiliations from the get go. > > > > On Mon, May 12, 2014 at 10:02 AM, Chris Aniszczyk <caniszc...@gmail.com> > > wrote: > > > We would like to propose Parquet as an Apache Incubator project. > > > https://wiki.apache.org/incubator/ParquetProposal > > > > > > Feel free to comment, we'll go for a vote in a week or two or whenever > > > consensus has been reached on the proposal. > > > > > > I've posted posted the text of the proposal below: > > > > > > == Abstract == > > > Parquet is a columnar storage format for Hadoop. > > > > > > == Proposal == > > > > > > We created Parquet to make the advantages of compressed, efficient > > columnar > > > data representation available to any project in the Hadoop ecosystem, > > > regardless of the choice of data processing framework, data model, or > > > programming language. > > > > > > == Background == > > > > > > Parquet is built from the ground up with complex nested data structures > > in > > > mind, and uses the repetition/definition level approach to encoding > such > > > data structures, as popularized by Google Dremel ( > > > https://blog.twitter.com/2013/dremel-made-simple-with-parquet). We > > believe > > > this approach is superior to simple flattening of nested name spaces. > > > > > > Parquet is built to support very efficient compression and encoding > > > schemes. Parquet allows compression schemes to be specified on a > > per-column > > > level, and is future-proofed to allow adding more encodings as they are > > > invented and implemented. We separate the concepts of encoding and > > > compression, allowing parquet consumers to implement operators that > work > > > directly on encoded data without paying decompression and decoding > > penalty > > > when possible. > > > > > > == Rationale == > > > > > > Parquet is built to be used by anyone. We believe that an efficient, > > > well-implemented columnar storage substrate should be useful to all > > > frameworks without the cost of extensive and difficult to set up > > > dependencies. > > > > > > Furthermore, the rapid growth of Parquet community is empowered by open > > > source. We believe the Apache foundation is a great fit as the > long-term > > > home for Parquet, as it provides an established process for > > > community-driven development and decision making by consensus. This is > > > exactly the model we want for future Parquet development. > > > > > > == Initial Goals == > > > > > > * Move the existing codebase to Apache > > > * Integrate with the Apache development process > > > * Ensure all dependencies are compliant with Apache License version 2.0 > > > * Incremental development and releases per Apache guidelines > > > > > > == Current Status == > > > > > > Parquet has undergone 2 major releases: > > > https://github.com/Parquet/parquet-format/releases of the core format > > and > > > 22 releases: https://github.com/Parquet/parquet-mr/releases of the > > > supporting set of Java libraries. > > > > > > The Parquet source is currently hosted at GitHub, which will seed the > > > Apache git repository. > > > > > > === Meritocracy === > > > > > > We plan to invest in supporting a meritocracy. We will discuss the > > > requirements in an open forum. Several companies have already expressed > > > interest in this project, and we intend to invite additional developers > > to > > > participate. We will encourage and monitor community participation so > > that > > > privileges can be extended to those that contribute. > > > > > > === Community === > > > > > > There is a large need for an advanced columnar storage format for > Hadoop. > > > Parquet is being used in production by many organizations (see > > > https://github.com/Parquet/parquet-mr/blob/master/PoweredBy.md) > > > > > > * Cloudera: https://twitter.com/HenryR/statuses/324222874011451392 > > > * Criteo: https://twitter.com/julsimon/statuses/312114074911666177 > > > * Salesforce: > > https://twitter.com/TwitterOSS/statuses/392734610116726784 > > > * Stripe: https://twitter.com/avibryant/statuses/391339949250715648 > > > * Twitter: https://twitter.com/J_/statuses/315844725611581441 > > > > > > By bringing Parquet into Apache, we believe that the community will > grow > > > even bigger. > > > > > > === Core Developers === > > > > > > Parquet was initially developed as a collaboration between Twitter, > > > Cloudera and Criteo. > > > > > > See > > > > > > https://blog.twitter.com/2013/announcing-parquet-10-columnar-storage-for-hadoop > > > > > > === Alignment === > > > > > > We believe that having Parquet at Apache will help further the growth > of > > > the big-data community, as it will encourage cooperation within the > > greater > > > ecosystem of projects spawned by Apache Hadoop. The alignment is also > > > beneficial to other Apache communities (such as Hadoop, Hive, Avro). > > > > > > == Known Risks == > > > > > > === Orphaned Products === > > > > > > The risk of the Parquet project being abandoned is minimal. There are > > many > > > organizations using Parquet in production, including Twitter, Cloudera, > > > Stripe, and Salesforce ( > > > http://blog.cloudera.com/blog/2013/10/parquet-at-salesforce-com/). > > > > > > === Inexperience with Open Source === > > > > > > Parquet has existed as a healthy open source for one year. During that > > > time, we have curated an open-source community successfully, attracting > > > over 40 contributors (see > > > https://github.com/Parquet/parquet-mr/graphs/contributors) from a > > diverse > > > group of companies. > > > Several of the core contributors to the project are deeply familiar > with > > > OSS and Apache specifically: Julien Le Dem is the current PMC Chair for > > > Apache Pig, and Dmitriy Ryaboy, Aniket Mokashi, and Jonathan Coveney > are > > > also Apache Pig committers with contributions to several other Apache > > > projects. Todd Lipcon and Tom White are committers to Apache Hadoop and > > > multiple other related projects. Brock Noland is a Hive committer. > > > > > > === Homogenous Developers === > > > > > > The initial committers come from a number of companies and countries. > > > Parquet has an active community of developers, and we are committed to > > > recruiting additional committers based on their contributions to the > > > project. The java library component alone has contributions from 31 > > > individual github accounts, 14 of which contributed over 1000 lines of > > code. > > > > > > === Reliance on Salaried Developers === > > > > > > It is expected that Parquet development will occur on both salaried > time > > > and on volunteer time, after hours. The majority of initial committers > > are > > > paid by their employers to contribute to this project. However, they > are > > > all passionate about the project, and we are confident that the project > > > will continue even if no salaried developers contribute to the project. > > As > > > evidence of this statement, we present the GitHub punchcard (see > > > https://github.com/Parquet/parquet-mr/graphs/punch-card) showing that > a > > lot > > > of activity happens on weekends. We are committed to recruiting > > additional > > > committers including non-salaried developers. > > > > > > === Relationships with Other Apache Products === > > > > > > As mentioned in the Alignment section, Parquet is closely related to > > > Hadoop, Pig, Avro, Thrift, YARN and Mesos in a numerous ways. We look > > > forward to collaborating with those communities, as well as other > Apache > > > communities (including Apache S4 which focuses on stateful low-latency > > > processing). > > > > > > === An Excessive Fascination with the Apache Brand === > > > > > > Parquet is an already healthy and well known open source project. This > > > proposal is not for the purpose of generating publicity. Rather, the > > > primary benefits to joining Apache are those outlined in the Rationale > > > section. > > > > > > == Documentation == > > > > > > Documentation is currently located as README markdown files: > > > > > > * https://github.com/Parquet/parquet-format > > > * https://github.com/Parquet/parquet-mr > > > > > > == Source and Intellectual Property Submission Plan == > > > > > > The Parquet codebase is currently hosted on Github: > > > https://github.com/Parquet. > > > > > > This is the exact codebase that we would migrate to the Apache > > foundation. > > > > > > == External Dependencies == > > > > > > * Junit: EPL > > > * Apache Commons: ALv2 > > > * Apache Thrift: ALv2 > > > * Apache Maven: ALv2 > > > * Apache Avro: ALv2 > > > * Apache Hadoop: ALv2 > > > * Google Guava: ALv2 > > > > > > == Cryptography == > > > > > > We do not expect Parquet to be a controlled export item due to the use > of > > > encryption. > > > > > > == Required Resources == > > > > > > === Mailing lists === > > > > > > * parquet-dev > > > * parquet-user > > > > > > == Subversion Directory == > > > > > > Git is the preferred source control system: git:// > git.apache.org/parquet > > > > > > == Issue Tracking == > > > > > > JIRA: Parquet (PARQUET) > > > > > > == Initial Committers == > > > > > > * Aniket Mokashi > > > * Brock Noland > > > * Chris Aniszczyk <z...@twitter.com> > > > * Dmitriy Ryaboy <dmit...@twitter.com> > > > * Jake Farrell > > > * Julien Le Dem <jul...@apache.org> > > > * Lukas Nalezenec > > > * Marcel Kornacker > > > * Mickael Lacour > > > * Nong Li > > > * Remy Pecqueur > > > * Tianshuo Deng > > > * Tom White > > > > > > == Affiliations == > > > > > > * Aniket Mokashi - Twitter > > > * Brock Noland - Cloudera > > > * Chris Aniszczyk - Twitter > > > * Dmitriy Ryaboy - Twitter > > > * Jake Farrell > > > * Julien Le Dem - Twitter > > > * Lukas Nalezenec > > > * Marcel Kornacker - Cloudera > > > * Mickael Lacour - Criteo > > > * Nong Li - Cloudera > > > * Remy Pecqueur - Criteo > > > * Tianshuo Deng - Twitter > > > * Tom White - Cloudera > > > > > > == Sponsors == > > > > > > === Champion === > > > > > > * Todd Lipcon > > > > > > === Nominated Mentors === > > > > > > * Tom White > > > * Chris Mattmann > > > * Jake Farrell > > > > > > === Sponsoring Entity === > > > > > > The Apache Incubator > > > > > > -- > > > Cheers, > > > > > > Chris Aniszczyk > > > http://aniszczyk.org > > > +1 512 961 6719 > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > > For additional commands, e-mail: general-h...@incubator.apache.org > > > > >