Re: [PROPOSAL] Parquet

Jake Farrell Fri, 16 May 2014 21:59:21 -0700

Thanks Dmitriy
Thats what I thought we would do but just wanted to make sure, proposal
updated to reflect the main imports that we will need


-Jake


On Fri, May 16, 2014 at 6:55 PM, Dmitriy Ryaboy <dvrya...@gmail.com> wrote:

> Roman -- you've been added, thanks for volunteering :)
>
> One Jake's repo question, I think it would be somewhat problematic, and
> we'd prefer multiple git repos. Format and the java code are currently
> versioned separately and have different builds (format is much lighter).
> Plus, if we add, for example, a general-use C library for reading Parquet
> files, we wouldn't want to conflate its build with the mvn-based builds for
> parquet-mr.
>
>
>
>
> On Fri, May 16, 2014 at 11:56 AM, Roman Shaposhnik <r...@apache.org> wrote:
>
> > Hi!
> >
> > proposal looks good to me and I am very much looking
> > for a voting thread.
> >
> > One small request, since I plan to spend a fair amount
> > of time on Parquet anyway, would you guys be ok
> > with adding me as an extra mentor so I can help
> > with that aspect of the project as well?
> >
> > Thanks,
> > Roman.
> >
> > P.S. Plus it has an added benefit of increasing diversity
> > of affiliations from the get go.
> >
> > On Mon, May 12, 2014 at 10:02 AM, Chris Aniszczyk <caniszc...@gmail.com>
> > wrote:
> > > We would like to propose Parquet as an Apache Incubator project.
> > > https://wiki.apache.org/incubator/ParquetProposal
> > >
> > > Feel free to comment, we'll go for a vote in a week or two or whenever
> > > consensus has been reached on the proposal.
> > >
> > > I've posted posted the text of the proposal below:
> > >
> > > == Abstract ==
> > > Parquet is a columnar storage format for Hadoop.
> > >
> > > == Proposal ==
> > >
> > > We created Parquet to make the advantages of compressed, efficient
> > columnar
> > > data representation available to any project in the Hadoop ecosystem,
> > > regardless of the choice of data processing framework, data model, or
> > > programming language.
> > >
> > > == Background ==
> > >
> > > Parquet is built from the ground up with complex nested data structures
> > in
> > > mind, and uses the repetition/definition level approach to encoding
> such
> > > data structures, as popularized by Google Dremel (
> > > https://blog.twitter.com/2013/dremel-made-simple-with-parquet). We
> > believe
> > > this approach is superior to simple flattening of nested name spaces.
> > >
> > > Parquet is built to support very efficient compression and encoding
> > > schemes. Parquet allows compression schemes to be specified on a
> > per-column
> > > level, and is future-proofed to allow adding more encodings as they are
> > > invented and implemented. We separate the concepts of encoding and
> > > compression, allowing parquet consumers to implement operators that
> work
> > > directly on encoded data without paying decompression and decoding
> > penalty
> > > when possible.
> > >
> > > == Rationale ==
> > >
> > > Parquet is built to be used by anyone. We believe that an efficient,
> > > well-implemented columnar storage substrate should be useful to all
> > > frameworks without the cost of extensive and difficult to set up
> > > dependencies.
> > >
> > > Furthermore, the rapid growth of Parquet community is empowered by open
> > > source. We believe the Apache foundation is a great fit as the
> long-term
> > > home for Parquet, as it provides an established process for
> > > community-driven development and decision making by consensus. This is
> > > exactly the model we want for future Parquet development.
> > >
> > > == Initial Goals ==
> > >
> > > * Move the existing codebase to Apache
> > > * Integrate with the Apache development process
> > > * Ensure all dependencies are compliant with Apache License version 2.0
> > > * Incremental development and releases per Apache guidelines
> > >
> > > == Current Status ==
> > >
> > > Parquet has undergone 2 major releases:
> > > https://github.com/Parquet/parquet-format/releases of the core format
> > and
> > > 22 releases: https://github.com/Parquet/parquet-mr/releases of the
> > > supporting set of Java libraries.
> > >
> > > The Parquet source is currently hosted at GitHub, which will seed the
> > > Apache git repository.
> > >
> > > === Meritocracy ===
> > >
> > > We plan to invest in supporting a meritocracy. We will discuss the
> > > requirements in an open forum. Several companies have already expressed
> > > interest in this project, and we intend to invite additional developers
> > to
> > > participate. We will encourage and monitor community participation so
> > that
> > > privileges can be extended to those that contribute.
> > >
> > > === Community ===
> > >
> > > There is a large need for an advanced columnar storage format for
> Hadoop.
> > > Parquet is being used in production by many organizations (see
> > > https://github.com/Parquet/parquet-mr/blob/master/PoweredBy.md)
> > >
> > >  * Cloudera: https://twitter.com/HenryR/statuses/324222874011451392
> > >  * Criteo: https://twitter.com/julsimon/statuses/312114074911666177
> > >  * Salesforce:
> > https://twitter.com/TwitterOSS/statuses/392734610116726784
> > >  * Stripe: https://twitter.com/avibryant/statuses/391339949250715648
> > >  * Twitter: https://twitter.com/J_/statuses/315844725611581441
> > >
> > > By bringing Parquet into Apache, we believe that the community will
> grow
> > > even bigger.
> > >
> > > === Core Developers ===
> > >
> > > Parquet was initially developed as a collaboration between Twitter,
> > > Cloudera and Criteo.
> > >
> > > See
> > >
> >
> https://blog.twitter.com/2013/announcing-parquet-10-columnar-storage-for-hadoop
> > >
> > > === Alignment ===
> > >
> > > We believe that having Parquet at Apache will help further the growth
> of
> > > the big-data community, as it will encourage cooperation within the
> > greater
> > > ecosystem of projects spawned by Apache Hadoop. The alignment is also
> > > beneficial to other Apache communities (such as Hadoop, Hive, Avro).
> > >
> > > == Known Risks ==
> > >
> > > === Orphaned Products ===
> > >
> > > The risk of the Parquet project being abandoned is minimal. There are
> > many
> > > organizations using Parquet in production, including Twitter, Cloudera,
> > > Stripe, and Salesforce (
> > > http://blog.cloudera.com/blog/2013/10/parquet-at-salesforce-com/).
> > >
> > > === Inexperience with Open Source ===
> > >
> > > Parquet has existed as a healthy open source for one year. During that
> > > time, we have curated an open-source community successfully, attracting
> > > over 40 contributors (see
> > > https://github.com/Parquet/parquet-mr/graphs/contributors) from a
> > diverse
> > > group of companies.
> > > Several of the core contributors to the project are deeply familiar
> with
> > > OSS and Apache specifically: Julien Le Dem is the current PMC Chair for
> > > Apache Pig, and Dmitriy Ryaboy, Aniket Mokashi, and Jonathan Coveney
> are
> > > also Apache Pig committers with contributions to several other Apache
> > > projects. Todd Lipcon and Tom White are committers to Apache Hadoop and
> > > multiple other related projects. Brock Noland is a Hive committer.
> > >
> > > === Homogenous Developers ===
> > >
> > > The initial committers come from a number of companies and countries.
> > > Parquet has an active community of developers, and we are committed to
> > > recruiting additional committers based on their contributions to the
> > > project. The java library component alone has contributions from 31
> > > individual github accounts, 14 of which contributed over 1000 lines of
> > code.
> > >
> > > === Reliance on Salaried Developers ===
> > >
> > > It is expected that Parquet development will occur on both salaried
> time
> > > and on volunteer time, after hours. The majority of initial committers
> > are
> > > paid by their employers to contribute to this project. However, they
> are
> > > all passionate about the project, and we are confident that the project
> > > will continue even if no salaried developers contribute to the project.
> > As
> > > evidence of this statement, we present the GitHub punchcard (see
> > > https://github.com/Parquet/parquet-mr/graphs/punch-card) showing that
> a
> > lot
> > > of activity happens on weekends. We are committed to recruiting
> > additional
> > > committers including non-salaried developers.
> > >
> > > === Relationships with Other Apache Products ===
> > >
> > > As mentioned in the Alignment section, Parquet is closely related to
> > > Hadoop, Pig, Avro, Thrift, YARN and Mesos in a numerous ways. We look
> > > forward to collaborating with those communities, as well as other
> Apache
> > > communities (including Apache S4 which focuses on stateful low-latency
> > > processing).
> > >
> > > === An Excessive Fascination with the Apache Brand ===
> > >
> > > Parquet is an already healthy and well known open source project. This
> > > proposal is not for the purpose of generating publicity. Rather, the
> > > primary benefits to joining Apache are those outlined in the Rationale
> > > section.
> > >
> > > == Documentation ==
> > >
> > > Documentation is currently located as README markdown files:
> > >
> > > * https://github.com/Parquet/parquet-format
> > > * https://github.com/Parquet/parquet-mr
> > >
> > > == Source and Intellectual Property Submission Plan ==
> > >
> > > The Parquet codebase is currently hosted on Github:
> > > https://github.com/Parquet.
> > >
> > > This is the exact codebase that we would migrate to the Apache
> > foundation.
> > >
> > > == External Dependencies ==
> > >
> > >  * Junit: EPL
> > >  * Apache Commons: ALv2
> > >  * Apache Thrift: ALv2
> > >  * Apache Maven: ALv2
> > >  * Apache Avro: ALv2
> > >  * Apache Hadoop: ALv2
> > >  * Google Guava: ALv2
> > >
> > > == Cryptography ==
> > >
> > > We do not expect Parquet to be a controlled export item due to the use
> of
> > > encryption.
> > >
> > > == Required Resources ==
> > >
> > > === Mailing lists ===
> > >
> > >  * parquet-dev
> > >  * parquet-user
> > >
> > > == Subversion Directory ==
> > >
> > > Git is the preferred source control system: git://
> git.apache.org/parquet
> > >
> > > == Issue Tracking ==
> > >
> > > JIRA: Parquet (PARQUET)
> > >
> > > == Initial Committers ==
> > >
> > >  * Aniket Mokashi
> > >  * Brock Noland
> > >  * Chris Aniszczyk <z...@twitter.com>
> > >  * Dmitriy Ryaboy <dmit...@twitter.com>
> > >  * Jake Farrell
> > >  * Julien Le Dem <jul...@apache.org>
> > >  * Lukas Nalezenec
> > >  * Marcel Kornacker
> > >  * Mickael Lacour
> > >  * Nong Li
> > >  * Remy Pecqueur
> > >  * Tianshuo Deng
> > >  * Tom White
> > >
> > > == Affiliations ==
> > >
> > >  * Aniket Mokashi - Twitter
> > >  * Brock Noland - Cloudera
> > >  * Chris Aniszczyk - Twitter
> > >  * Dmitriy Ryaboy - Twitter
> > >  * Jake Farrell
> > >  * Julien Le Dem - Twitter
> > >  * Lukas Nalezenec
> > >  * Marcel Kornacker - Cloudera
> > >  * Mickael Lacour - Criteo
> > >  * Nong Li - Cloudera
> > >  * Remy Pecqueur - Criteo
> > >  * Tianshuo Deng - Twitter
> > >  * Tom White - Cloudera
> > >
> > > == Sponsors ==
> > >
> > > === Champion ===
> > >
> > >  * Todd Lipcon
> > >
> > > === Nominated Mentors ===
> > >
> > >  * Tom White
> > >  * Chris Mattmann
> > >  * Jake Farrell
> > >
> > > === Sponsoring Entity ===
> > >
> > > The Apache Incubator
> > >
> > > --
> > > Cheers,
> > >
> > > Chris Aniszczyk
> > > http://aniszczyk.org
> > > +1 512 961 6719
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> > For additional commands, e-mail: general-h...@incubator.apache.org
> >
> >
>

Re: [PROPOSAL] Parquet

Reply via email to