[RESULT] [VOTE] Accept the Iceberg project for incubation

2018-11-16 Thread Ryan Blue
The vote passes with 13 binding +1 and 5 non-binding +1 votes.

Thank you for voting, everyone! I'll get started with the next steps.

+1 votes:
Ryan Blue*
Matt Sicker*
Felix Cheung
Dave Fisher*
Owen O'Malley*
Hugo Louro
Arthur Wiedmer
Julian Hyde*
Kevin A. McGrail*
Willem Jiang*
James Taylor*
Uwe Korn
Lars Francke*
Jean-Baptiste Onofré*
Olivier Lamy*
Michael Wall*
Kenneth Knowles
Julien Le Dem*

* = binding

On Tue, Nov 13, 2018 at 9:06 AM Ryan Blue  wrote:

> The discuss thread seems to have reached consensus, so I propose accepting
> the Iceberg project for incubation.
>
> The proposal is copied below and in the wiki:
> https://wiki.apache.org/incubator/IcebergProposal
>
> Please vote on whether to accept Iceberg in the next 72 hours:
>
> [ ] +1, accept Iceberg for incubation
> [ ] -1, reject the Iceberg proposal because . . .
>
> Thank you for reviewing the proposal and voting,
>
> rb
> --
> Iceberg Proposal Abstract
>
> Iceberg is a table format for large, slow-moving tabular data.
>
> It is designed to improve on the de-facto standard table layout built into
> Apache Hive, Presto, and Apache Spark.
> Proposal
>
> The purpose of Iceberg is to provide SQL-like tables that are backed by
> large sets of data files. Iceberg is similar to the Hive table layout, the
> de-facto standard structure used to track files in a table, but provides
> additional guarantees and performance optimizations:
>
>- Atomicity - Each change to the table is will be complete or will
>fail. “Do or do not. There is no try.”
>- Snapshot isolation - Reads use one and only one snapshot of a table
>at some time without holding a lock.
>- Safe schema evolution - A table’s schema can change in well-defined
>ways, without breaking older data files.
>- Column projection - An engine may request a subset of the available
>columns, including nested fields.
>- Predicate pushdown - An engine can push filters into read planning
>to improve performance using partition data and file-level statistics.
>
> Iceberg does NOT define a new file format. All data is stored in Apache
> Avro, Apache ORC, or Apache Parquet files.
>
> Additionally, Iceberg is designed to work well when data files are stored
> in cloud blob stores, even when those systems provide weaker guarantees
> than a file system, including:
>
>- Eventual consistency in the namespace
>- High latency for directory listings
>- No renames of objects
>- No folder hierarchy
>
> Rationale
>
> Initial benchmarks show dramatic improvements in query planning. For
> example, in Netflix’s Atlas use case, which stores time-series metrics from
> Netflix runtime systems and 1 month is stored across 2.7 million files in
> 2,688 partitions:
>
>- Hive table using Parquet:
>   - 400k+ splits, not combined
>   - Explain query: 9.6 minutes wall time (planning only)
>- Iceberg table with partition filtering:
>   - 15,218 splits, combined
>   - Planning: 10 seconds
>   - Query wall time: 13 minutes
>- Iceberg table with partition and min/max filtering:
>   - 412 splits
>   - Planning: 25 seconds
>   - Query wall time: 42 seconds
>
> These performance gains combined with the cross-engine compatibility are a
> very compelling story.
> Initial Goals
>
> The initial goal will be to move the existing codebase to Apache and
> integrate with the Apache development process and infrastructure. A primary
> goal of incubation will be to grow and diversify the Iceberg community. We
> are well aware that the project community is largely comprised of
> individuals from a single company. We aim to change that during incubation.
> Current Status
>
> As previously mentioned, Iceberg is under active development at Netflix,
> and is being used in processing large volumes of data in Amazon EC2.
>
> Iceberg license documentation is already based on Apache guidelines for
> LICENSE and NOTICE content.
> Meritocracy
>
> We value meritocracy and we understand that it is the basis for an open
> community that encourages multiple companies and individuals to contribute
> and be invested in the project’s future. We will encourage and monitor
> participation and make sure to extend privileges and responsibilities to
> all contributors.
> Community
>
> Iceberg is currently being used by developers at Netflix and a growing
> number of users are actively using it in production environments. Iceberg
> has received contributions from developers working at Hortonworks, WeWork,
> and Palantir. By bringing Iceberg to Apache we aim to assure current and
> future contributors that the Iceberg community is meritocratic and open, in
> order to broaden and diversity the user and developer community.
> Core Developers
>
> Iceberg was initially developed at Netflix and is under active
> development. We believe Netflix will be of interest to a broad range of
> users and developers and that incubating the project at the ASF 

Re: [VOTE] Accept the Iceberg project for incubation

2018-11-16 Thread Julien Le Dem
>
> +1

>
> From: Kenneth Knowles 
> Date: Thu, Nov 15, 2018 at 10:01 AM
> Subject: Re: [VOTE] Accept the Iceberg project for incubation
> To: 
>
>
> +1 (non-binding)
>
> On Thu, Nov 15, 2018 at 9:57 AM Michael Wall  wrote:
>
> > +1 (binding)
> >
> > On Thu, Nov 15, 2018 at 3:03 AM Olivier Lamy  wrote:
> >
> > > +1
> > >
> > > On Wed, 14 Nov 2018 at 03:07, Ryan Blue  wrote:
> > >
> > > > The discuss thread seems to have reached consensus, so I propose
> > > accepting
> > > > the Iceberg project for incubation.
> > > >
> > > > The proposal is copied below and in the wiki:
> > > > https://wiki.apache.org/incubator/IcebergProposal
> > > >
> > > > Please vote on whether to accept Iceberg in the next 72 hours:
> > > >
> > > > [ ] +1, accept Iceberg for incubation
> > > > [ ] -1, reject the Iceberg proposal because . . .
> > > >
> > > > Thank you for reviewing the proposal and voting,
> > > >
> > > > rb
> > > > --
> > > > Iceberg Proposal Abstract
> > > >
> > > > Iceberg is a table format for large, slow-moving tabular data.
> > > >
> > > > It is designed to improve on the de-facto standard table layout built
> > > into
> > > > Apache Hive, Presto, and Apache Spark.
> > > > Proposal
> > > >
> > > > The purpose of Iceberg is to provide SQL-like tables that are backed
> by
> > > > large sets of data files. Iceberg is similar to the Hive table
> layout,
> > > the
> > > > de-facto standard structure used to track files in a table, but
> > provides
> > > > additional guarantees and performance optimizations:
> > > >
> > > >- Atomicity - Each change to the table is will be complete or will
> > > fail.
> > > >“Do or do not. There is no try.”
> > > >- Snapshot isolation - Reads use one and only one snapshot of a
> > table
> > > at
> > > >some time without holding a lock.
> > > >- Safe schema evolution - A table’s schema can change in
> > well-defined
> > > >ways, without breaking older data files.
> > > >- Column projection - An engine may request a subset of the
> > available
> > > >columns, including nested fields.
> > > >- Predicate pushdown - An engine can push filters into read
> planning
> > > to
> > > >improve performance using partition data and file-level
> statistics.
> > > >
> > > > Iceberg does NOT define a new file format. All data is stored in
> Apache
> > > > Avro, Apache ORC, or Apache Parquet files.
> > > >
> > > > Additionally, Iceberg is designed to work well when data files are
> > stored
> > > > in cloud blob stores, even when those systems provide weaker
> guarantees
> > > > than a file system, including:
> > > >
> > > >- Eventual consistency in the namespace
> > > >- High latency for directory listings
> > > >- No renames of objects
> > > >- No folder hierarchy
> > > >
> > > > Rationale
> > > >
> > > > Initial benchmarks show dramatic improvements in query planning. For
> > > > example, in Netflix’s Atlas use case, which stores time-series
> metrics
> > > from
> > > > Netflix runtime systems and 1 month is stored across 2.7 million
> files
> > in
> > > > 2,688 partitions:
> > > >
> > > >- Hive table using Parquet:
> > > >   - 400k+ splits, not combined
> > > >   - Explain query: 9.6 minutes wall time (planning only)
> > > >- Iceberg table with partition filtering:
> > > >   - 15,218 splits, combined
> > > >   - Planning: 10 seconds
> > > >   - Query wall time: 13 minutes
> > > >- Iceberg table with partition and min/max filtering:
> > > >   - 412 splits
> > > >   - Planning: 25 seconds
> > > >   - Query wall time: 42 seconds
>
> > > >
> > > > These performance gains combined with the cross-engine compatibility
> > are
> > > a
> > > > very compelling story.
> > > > Initial Goals
> > > >
> > > > The initial goal will be to move the existing cod

Re: [VOTE] Accept the Iceberg project for incubation

2018-11-15 Thread Kenneth Knowles
+1 (non-binding)

On Thu, Nov 15, 2018 at 9:57 AM Michael Wall  wrote:

> +1 (binding)
>
> On Thu, Nov 15, 2018 at 3:03 AM Olivier Lamy  wrote:
>
> > +1
> >
> > On Wed, 14 Nov 2018 at 03:07, Ryan Blue  wrote:
> >
> > > The discuss thread seems to have reached consensus, so I propose
> > accepting
> > > the Iceberg project for incubation.
> > >
> > > The proposal is copied below and in the wiki:
> > > https://wiki.apache.org/incubator/IcebergProposal
> > >
> > > Please vote on whether to accept Iceberg in the next 72 hours:
> > >
> > > [ ] +1, accept Iceberg for incubation
> > > [ ] -1, reject the Iceberg proposal because . . .
> > >
> > > Thank you for reviewing the proposal and voting,
> > >
> > > rb
> > > --
> > > Iceberg Proposal Abstract
> > >
> > > Iceberg is a table format for large, slow-moving tabular data.
> > >
> > > It is designed to improve on the de-facto standard table layout built
> > into
> > > Apache Hive, Presto, and Apache Spark.
> > > Proposal
> > >
> > > The purpose of Iceberg is to provide SQL-like tables that are backed by
> > > large sets of data files. Iceberg is similar to the Hive table layout,
> > the
> > > de-facto standard structure used to track files in a table, but
> provides
> > > additional guarantees and performance optimizations:
> > >
> > >- Atomicity - Each change to the table is will be complete or will
> > fail.
> > >“Do or do not. There is no try.”
> > >- Snapshot isolation - Reads use one and only one snapshot of a
> table
> > at
> > >some time without holding a lock.
> > >- Safe schema evolution - A table’s schema can change in
> well-defined
> > >ways, without breaking older data files.
> > >- Column projection - An engine may request a subset of the
> available
> > >columns, including nested fields.
> > >- Predicate pushdown - An engine can push filters into read planning
> > to
> > >improve performance using partition data and file-level statistics.
> > >
> > > Iceberg does NOT define a new file format. All data is stored in Apache
> > > Avro, Apache ORC, or Apache Parquet files.
> > >
> > > Additionally, Iceberg is designed to work well when data files are
> stored
> > > in cloud blob stores, even when those systems provide weaker guarantees
> > > than a file system, including:
> > >
> > >- Eventual consistency in the namespace
> > >- High latency for directory listings
> > >- No renames of objects
> > >- No folder hierarchy
> > >
> > > Rationale
> > >
> > > Initial benchmarks show dramatic improvements in query planning. For
> > > example, in Netflix’s Atlas use case, which stores time-series metrics
> > from
> > > Netflix runtime systems and 1 month is stored across 2.7 million files
> in
> > > 2,688 partitions:
> > >
> > >- Hive table using Parquet:
> > >   - 400k+ splits, not combined
> > >   - Explain query: 9.6 minutes wall time (planning only)
> > >- Iceberg table with partition filtering:
> > >   - 15,218 splits, combined
> > >   - Planning: 10 seconds
> > >   - Query wall time: 13 minutes
> > >- Iceberg table with partition and min/max filtering:
> > >   - 412 splits
> > >   - Planning: 25 seconds
> > >   - Query wall time: 42 seconds
> > >
> > > These performance gains combined with the cross-engine compatibility
> are
> > a
> > > very compelling story.
> > > Initial Goals
> > >
> > > The initial goal will be to move the existing codebase to Apache and
> > > integrate with the Apache development process and infrastructure. A
> > primary
> > > goal of incubation will be to grow and diversify the Iceberg community.
> > We
> > > are well aware that the project community is largely comprised of
> > > individuals from a single company. We aim to change that during
> > incubation.
> > > Current Status
> > >
> > > As previously mentioned, Iceberg is under active development at
> Netflix,
> > > and is being used in processing large volumes of data in Amazon EC2.
> > >
> > > Iceberg license documentation is already based on Apache guidelines for
> > > LICENSE and NOTICE content.
> > > Meritocracy
> > >
> > > We value meritocracy and we understand that it is the basis for an open
> > > community that encourages multiple companies and individuals to
> > contribute
> > > and be invested in the project’s future. We will encourage and monitor
> > > participation and make sure to extend privileges and responsibilities
> to
> > > all contributors.
> > > Community
> > >
> > > Iceberg is currently being used by developers at Netflix and a growing
> > > number of users are actively using it in production environments.
> Iceberg
> > > has received contributions from developers working at Hortonworks,
> > WeWork,
> > > and Palantir. By bringing Iceberg to Apache we aim to assure current
> and
> > > future contributors that the Iceberg community is meritocratic and
> open,
> > in
> > > order to broaden and diversity the user and developer 

Re: [VOTE] Accept the Iceberg project for incubation

2018-11-15 Thread Michael Wall
+1 (binding)

On Thu, Nov 15, 2018 at 3:03 AM Olivier Lamy  wrote:

> +1
>
> On Wed, 14 Nov 2018 at 03:07, Ryan Blue  wrote:
>
> > The discuss thread seems to have reached consensus, so I propose
> accepting
> > the Iceberg project for incubation.
> >
> > The proposal is copied below and in the wiki:
> > https://wiki.apache.org/incubator/IcebergProposal
> >
> > Please vote on whether to accept Iceberg in the next 72 hours:
> >
> > [ ] +1, accept Iceberg for incubation
> > [ ] -1, reject the Iceberg proposal because . . .
> >
> > Thank you for reviewing the proposal and voting,
> >
> > rb
> > --
> > Iceberg Proposal Abstract
> >
> > Iceberg is a table format for large, slow-moving tabular data.
> >
> > It is designed to improve on the de-facto standard table layout built
> into
> > Apache Hive, Presto, and Apache Spark.
> > Proposal
> >
> > The purpose of Iceberg is to provide SQL-like tables that are backed by
> > large sets of data files. Iceberg is similar to the Hive table layout,
> the
> > de-facto standard structure used to track files in a table, but provides
> > additional guarantees and performance optimizations:
> >
> >- Atomicity - Each change to the table is will be complete or will
> fail.
> >“Do or do not. There is no try.”
> >- Snapshot isolation - Reads use one and only one snapshot of a table
> at
> >some time without holding a lock.
> >- Safe schema evolution - A table’s schema can change in well-defined
> >ways, without breaking older data files.
> >- Column projection - An engine may request a subset of the available
> >columns, including nested fields.
> >- Predicate pushdown - An engine can push filters into read planning
> to
> >improve performance using partition data and file-level statistics.
> >
> > Iceberg does NOT define a new file format. All data is stored in Apache
> > Avro, Apache ORC, or Apache Parquet files.
> >
> > Additionally, Iceberg is designed to work well when data files are stored
> > in cloud blob stores, even when those systems provide weaker guarantees
> > than a file system, including:
> >
> >- Eventual consistency in the namespace
> >- High latency for directory listings
> >- No renames of objects
> >- No folder hierarchy
> >
> > Rationale
> >
> > Initial benchmarks show dramatic improvements in query planning. For
> > example, in Netflix’s Atlas use case, which stores time-series metrics
> from
> > Netflix runtime systems and 1 month is stored across 2.7 million files in
> > 2,688 partitions:
> >
> >- Hive table using Parquet:
> >   - 400k+ splits, not combined
> >   - Explain query: 9.6 minutes wall time (planning only)
> >- Iceberg table with partition filtering:
> >   - 15,218 splits, combined
> >   - Planning: 10 seconds
> >   - Query wall time: 13 minutes
> >- Iceberg table with partition and min/max filtering:
> >   - 412 splits
> >   - Planning: 25 seconds
> >   - Query wall time: 42 seconds
> >
> > These performance gains combined with the cross-engine compatibility are
> a
> > very compelling story.
> > Initial Goals
> >
> > The initial goal will be to move the existing codebase to Apache and
> > integrate with the Apache development process and infrastructure. A
> primary
> > goal of incubation will be to grow and diversify the Iceberg community.
> We
> > are well aware that the project community is largely comprised of
> > individuals from a single company. We aim to change that during
> incubation.
> > Current Status
> >
> > As previously mentioned, Iceberg is under active development at Netflix,
> > and is being used in processing large volumes of data in Amazon EC2.
> >
> > Iceberg license documentation is already based on Apache guidelines for
> > LICENSE and NOTICE content.
> > Meritocracy
> >
> > We value meritocracy and we understand that it is the basis for an open
> > community that encourages multiple companies and individuals to
> contribute
> > and be invested in the project’s future. We will encourage and monitor
> > participation and make sure to extend privileges and responsibilities to
> > all contributors.
> > Community
> >
> > Iceberg is currently being used by developers at Netflix and a growing
> > number of users are actively using it in production environments. Iceberg
> > has received contributions from developers working at Hortonworks,
> WeWork,
> > and Palantir. By bringing Iceberg to Apache we aim to assure current and
> > future contributors that the Iceberg community is meritocratic and open,
> in
> > order to broaden and diversity the user and developer community.
> > Core Developers
> >
> > Iceberg was initially developed at Netflix and is under active
> development.
> > We believe Netflix will be of interest to a broad range of users and
> > developers and that incubating the project at the ASF will help us build
> a
> > diverse, sustainable community.
> > Alignment
> >
> > Iceberg utilizes 

Re: [VOTE] Accept the Iceberg project for incubation

2018-11-15 Thread Olivier Lamy
+1

On Wed, 14 Nov 2018 at 03:07, Ryan Blue  wrote:

> The discuss thread seems to have reached consensus, so I propose accepting
> the Iceberg project for incubation.
>
> The proposal is copied below and in the wiki:
> https://wiki.apache.org/incubator/IcebergProposal
>
> Please vote on whether to accept Iceberg in the next 72 hours:
>
> [ ] +1, accept Iceberg for incubation
> [ ] -1, reject the Iceberg proposal because . . .
>
> Thank you for reviewing the proposal and voting,
>
> rb
> --
> Iceberg Proposal Abstract
>
> Iceberg is a table format for large, slow-moving tabular data.
>
> It is designed to improve on the de-facto standard table layout built into
> Apache Hive, Presto, and Apache Spark.
> Proposal
>
> The purpose of Iceberg is to provide SQL-like tables that are backed by
> large sets of data files. Iceberg is similar to the Hive table layout, the
> de-facto standard structure used to track files in a table, but provides
> additional guarantees and performance optimizations:
>
>- Atomicity - Each change to the table is will be complete or will fail.
>“Do or do not. There is no try.”
>- Snapshot isolation - Reads use one and only one snapshot of a table at
>some time without holding a lock.
>- Safe schema evolution - A table’s schema can change in well-defined
>ways, without breaking older data files.
>- Column projection - An engine may request a subset of the available
>columns, including nested fields.
>- Predicate pushdown - An engine can push filters into read planning to
>improve performance using partition data and file-level statistics.
>
> Iceberg does NOT define a new file format. All data is stored in Apache
> Avro, Apache ORC, or Apache Parquet files.
>
> Additionally, Iceberg is designed to work well when data files are stored
> in cloud blob stores, even when those systems provide weaker guarantees
> than a file system, including:
>
>- Eventual consistency in the namespace
>- High latency for directory listings
>- No renames of objects
>- No folder hierarchy
>
> Rationale
>
> Initial benchmarks show dramatic improvements in query planning. For
> example, in Netflix’s Atlas use case, which stores time-series metrics from
> Netflix runtime systems and 1 month is stored across 2.7 million files in
> 2,688 partitions:
>
>- Hive table using Parquet:
>   - 400k+ splits, not combined
>   - Explain query: 9.6 minutes wall time (planning only)
>- Iceberg table with partition filtering:
>   - 15,218 splits, combined
>   - Planning: 10 seconds
>   - Query wall time: 13 minutes
>- Iceberg table with partition and min/max filtering:
>   - 412 splits
>   - Planning: 25 seconds
>   - Query wall time: 42 seconds
>
> These performance gains combined with the cross-engine compatibility are a
> very compelling story.
> Initial Goals
>
> The initial goal will be to move the existing codebase to Apache and
> integrate with the Apache development process and infrastructure. A primary
> goal of incubation will be to grow and diversify the Iceberg community. We
> are well aware that the project community is largely comprised of
> individuals from a single company. We aim to change that during incubation.
> Current Status
>
> As previously mentioned, Iceberg is under active development at Netflix,
> and is being used in processing large volumes of data in Amazon EC2.
>
> Iceberg license documentation is already based on Apache guidelines for
> LICENSE and NOTICE content.
> Meritocracy
>
> We value meritocracy and we understand that it is the basis for an open
> community that encourages multiple companies and individuals to contribute
> and be invested in the project’s future. We will encourage and monitor
> participation and make sure to extend privileges and responsibilities to
> all contributors.
> Community
>
> Iceberg is currently being used by developers at Netflix and a growing
> number of users are actively using it in production environments. Iceberg
> has received contributions from developers working at Hortonworks, WeWork,
> and Palantir. By bringing Iceberg to Apache we aim to assure current and
> future contributors that the Iceberg community is meritocratic and open, in
> order to broaden and diversity the user and developer community.
> Core Developers
>
> Iceberg was initially developed at Netflix and is under active development.
> We believe Netflix will be of interest to a broad range of users and
> developers and that incubating the project at the ASF will help us build a
> diverse, sustainable community.
> Alignment
>
> Iceberg utilizes other Apache projects such as Avro, Hadoop, Hive, ORC,
> Parquet, Pig, and Spark. We anticipate integration with additional Apache
> projects as the Iceberg community and interest in the project grows.
> Known Risks Orphaned Products
>
> Netflix is committed to the future development of Iceberg and understands
> that 

Re: [VOTE] Accept the Iceberg project for incubation

2018-11-14 Thread Ryan Blue
Quick update: James Taylor has offered to mentor the project as well, so
I've added him to the list. Thanks, James!

On Tue, Nov 13, 2018 at 9:06 AM Ryan Blue  wrote:

> The discuss thread seems to have reached consensus, so I propose accepting
> the Iceberg project for incubation.
>
> The proposal is copied below and in the wiki:
> https://wiki.apache.org/incubator/IcebergProposal
>
> Please vote on whether to accept Iceberg in the next 72 hours:
>
> [ ] +1, accept Iceberg for incubation
> [ ] -1, reject the Iceberg proposal because . . .
>
> Thank you for reviewing the proposal and voting,
>
> rb
> --
> Iceberg Proposal Abstract
>
> Iceberg is a table format for large, slow-moving tabular data.
>
> It is designed to improve on the de-facto standard table layout built into
> Apache Hive, Presto, and Apache Spark.
> Proposal
>
> The purpose of Iceberg is to provide SQL-like tables that are backed by
> large sets of data files. Iceberg is similar to the Hive table layout, the
> de-facto standard structure used to track files in a table, but provides
> additional guarantees and performance optimizations:
>
>- Atomicity - Each change to the table is will be complete or will
>fail. “Do or do not. There is no try.”
>- Snapshot isolation - Reads use one and only one snapshot of a table
>at some time without holding a lock.
>- Safe schema evolution - A table’s schema can change in well-defined
>ways, without breaking older data files.
>- Column projection - An engine may request a subset of the available
>columns, including nested fields.
>- Predicate pushdown - An engine can push filters into read planning
>to improve performance using partition data and file-level statistics.
>
> Iceberg does NOT define a new file format. All data is stored in Apache
> Avro, Apache ORC, or Apache Parquet files.
>
> Additionally, Iceberg is designed to work well when data files are stored
> in cloud blob stores, even when those systems provide weaker guarantees
> than a file system, including:
>
>- Eventual consistency in the namespace
>- High latency for directory listings
>- No renames of objects
>- No folder hierarchy
>
> Rationale
>
> Initial benchmarks show dramatic improvements in query planning. For
> example, in Netflix’s Atlas use case, which stores time-series metrics from
> Netflix runtime systems and 1 month is stored across 2.7 million files in
> 2,688 partitions:
>
>- Hive table using Parquet:
>   - 400k+ splits, not combined
>   - Explain query: 9.6 minutes wall time (planning only)
>- Iceberg table with partition filtering:
>   - 15,218 splits, combined
>   - Planning: 10 seconds
>   - Query wall time: 13 minutes
>- Iceberg table with partition and min/max filtering:
>   - 412 splits
>   - Planning: 25 seconds
>   - Query wall time: 42 seconds
>
> These performance gains combined with the cross-engine compatibility are a
> very compelling story.
> Initial Goals
>
> The initial goal will be to move the existing codebase to Apache and
> integrate with the Apache development process and infrastructure. A primary
> goal of incubation will be to grow and diversify the Iceberg community. We
> are well aware that the project community is largely comprised of
> individuals from a single company. We aim to change that during incubation.
> Current Status
>
> As previously mentioned, Iceberg is under active development at Netflix,
> and is being used in processing large volumes of data in Amazon EC2.
>
> Iceberg license documentation is already based on Apache guidelines for
> LICENSE and NOTICE content.
> Meritocracy
>
> We value meritocracy and we understand that it is the basis for an open
> community that encourages multiple companies and individuals to contribute
> and be invested in the project’s future. We will encourage and monitor
> participation and make sure to extend privileges and responsibilities to
> all contributors.
> Community
>
> Iceberg is currently being used by developers at Netflix and a growing
> number of users are actively using it in production environments. Iceberg
> has received contributions from developers working at Hortonworks, WeWork,
> and Palantir. By bringing Iceberg to Apache we aim to assure current and
> future contributors that the Iceberg community is meritocratic and open, in
> order to broaden and diversity the user and developer community.
> Core Developers
>
> Iceberg was initially developed at Netflix and is under active
> development. We believe Netflix will be of interest to a broad range of
> users and developers and that incubating the project at the ASF will help
> us build a diverse, sustainable community.
> Alignment
>
> Iceberg utilizes other Apache projects such as Avro, Hadoop, Hive, ORC,
> Parquet, Pig, and Spark. We anticipate integration with additional Apache
> projects as the Iceberg community and interest in the project grows.
> Known 

Re: [VOTE] Accept the Iceberg project for incubation

2018-11-14 Thread Jean-Baptiste Onofré
+1 (binding)

Regards
JB

On 13/11/2018 18:06, Ryan Blue wrote:
> The discuss thread seems to have reached consensus, so I propose accepting
> the Iceberg project for incubation.
> 
> The proposal is copied below and in the wiki:
> https://wiki.apache.org/incubator/IcebergProposal
> 
> Please vote on whether to accept Iceberg in the next 72 hours:
> 
> [ ] +1, accept Iceberg for incubation
> [ ] -1, reject the Iceberg proposal because . . .
> 
> Thank you for reviewing the proposal and voting,
> 
> rb
> --
> Iceberg Proposal Abstract
> 
> Iceberg is a table format for large, slow-moving tabular data.
> 
> It is designed to improve on the de-facto standard table layout built into
> Apache Hive, Presto, and Apache Spark.
> Proposal
> 
> The purpose of Iceberg is to provide SQL-like tables that are backed by
> large sets of data files. Iceberg is similar to the Hive table layout, the
> de-facto standard structure used to track files in a table, but provides
> additional guarantees and performance optimizations:
> 
>- Atomicity - Each change to the table is will be complete or will fail.
>“Do or do not. There is no try.”
>- Snapshot isolation - Reads use one and only one snapshot of a table at
>some time without holding a lock.
>- Safe schema evolution - A table’s schema can change in well-defined
>ways, without breaking older data files.
>- Column projection - An engine may request a subset of the available
>columns, including nested fields.
>- Predicate pushdown - An engine can push filters into read planning to
>improve performance using partition data and file-level statistics.
> 
> Iceberg does NOT define a new file format. All data is stored in Apache
> Avro, Apache ORC, or Apache Parquet files.
> 
> Additionally, Iceberg is designed to work well when data files are stored
> in cloud blob stores, even when those systems provide weaker guarantees
> than a file system, including:
> 
>- Eventual consistency in the namespace
>- High latency for directory listings
>- No renames of objects
>- No folder hierarchy
> 
> Rationale
> 
> Initial benchmarks show dramatic improvements in query planning. For
> example, in Netflix’s Atlas use case, which stores time-series metrics from
> Netflix runtime systems and 1 month is stored across 2.7 million files in
> 2,688 partitions:
> 
>- Hive table using Parquet:
>   - 400k+ splits, not combined
>   - Explain query: 9.6 minutes wall time (planning only)
>- Iceberg table with partition filtering:
>   - 15,218 splits, combined
>   - Planning: 10 seconds
>   - Query wall time: 13 minutes
>- Iceberg table with partition and min/max filtering:
>   - 412 splits
>   - Planning: 25 seconds
>   - Query wall time: 42 seconds
> 
> These performance gains combined with the cross-engine compatibility are a
> very compelling story.
> Initial Goals
> 
> The initial goal will be to move the existing codebase to Apache and
> integrate with the Apache development process and infrastructure. A primary
> goal of incubation will be to grow and diversify the Iceberg community. We
> are well aware that the project community is largely comprised of
> individuals from a single company. We aim to change that during incubation.
> Current Status
> 
> As previously mentioned, Iceberg is under active development at Netflix,
> and is being used in processing large volumes of data in Amazon EC2.
> 
> Iceberg license documentation is already based on Apache guidelines for
> LICENSE and NOTICE content.
> Meritocracy
> 
> We value meritocracy and we understand that it is the basis for an open
> community that encourages multiple companies and individuals to contribute
> and be invested in the project’s future. We will encourage and monitor
> participation and make sure to extend privileges and responsibilities to
> all contributors.
> Community
> 
> Iceberg is currently being used by developers at Netflix and a growing
> number of users are actively using it in production environments. Iceberg
> has received contributions from developers working at Hortonworks, WeWork,
> and Palantir. By bringing Iceberg to Apache we aim to assure current and
> future contributors that the Iceberg community is meritocratic and open, in
> order to broaden and diversity the user and developer community.
> Core Developers
> 
> Iceberg was initially developed at Netflix and is under active development.
> We believe Netflix will be of interest to a broad range of users and
> developers and that incubating the project at the ASF will help us build a
> diverse, sustainable community.
> Alignment
> 
> Iceberg utilizes other Apache projects such as Avro, Hadoop, Hive, ORC,
> Parquet, Pig, and Spark. We anticipate integration with additional Apache
> projects as the Iceberg community and interest in the project grows.
> Known Risks Orphaned Products
> 
> Netflix is committed to the future development of 

Re: [VOTE] Accept the Iceberg project for incubation

2018-11-14 Thread Lars Francke
+1 (binding)

On Wed, Nov 14, 2018 at 7:48 AM Uwe L. Korn  wrote:

> +1 (non-binding)
>
> Great to see this here!
>
> > Am 14.11.2018 um 04:07 schrieb James Taylor :
> >
> > +1 (binding)
> >
> >> On Tue, Nov 13, 2018 at 4:15 PM Willem Jiang 
> wrote:
> >>
> >> +1 (binding)
> >>
> >> Willem Jiang
> >>
> >> Twitter: willemjiang
> >> Weibo: 姜宁willem
> >>
> >>> On Wed, Nov 14, 2018 at 1:07 AM Ryan Blue  wrote:
> >>>
> >>> The discuss thread seems to have reached consensus, so I propose
> >> accepting
> >>> the Iceberg project for incubation.
> >>>
> >>> The proposal is copied below and in the wiki:
> >>> https://wiki.apache.org/incubator/IcebergProposal
> >>>
> >>> Please vote on whether to accept Iceberg in the next 72 hours:
> >>>
> >>> [ ] +1, accept Iceberg for incubation
> >>> [ ] -1, reject the Iceberg proposal because . . .
> >>>
> >>> Thank you for reviewing the proposal and voting,
> >>>
> >>> rb
> >>> --
> >>> Iceberg Proposal Abstract
> >>>
> >>> Iceberg is a table format for large, slow-moving tabular data.
> >>>
> >>> It is designed to improve on the de-facto standard table layout built
> >> into
> >>> Apache Hive, Presto, and Apache Spark.
> >>> Proposal
> >>>
> >>> The purpose of Iceberg is to provide SQL-like tables that are backed by
> >>> large sets of data files. Iceberg is similar to the Hive table layout,
> >> the
> >>> de-facto standard structure used to track files in a table, but
> provides
> >>> additional guarantees and performance optimizations:
> >>>
> >>>   - Atomicity - Each change to the table is will be complete or will
> >> fail.
> >>>   “Do or do not. There is no try.”
> >>>   - Snapshot isolation - Reads use one and only one snapshot of a table
> >> at
> >>>   some time without holding a lock.
> >>>   - Safe schema evolution - A table’s schema can change in well-defined
> >>>   ways, without breaking older data files.
> >>>   - Column projection - An engine may request a subset of the available
> >>>   columns, including nested fields.
> >>>   - Predicate pushdown - An engine can push filters into read planning
> >> to
> >>>   improve performance using partition data and file-level statistics.
> >>>
> >>> Iceberg does NOT define a new file format. All data is stored in Apache
> >>> Avro, Apache ORC, or Apache Parquet files.
> >>>
> >>> Additionally, Iceberg is designed to work well when data files are
> stored
> >>> in cloud blob stores, even when those systems provide weaker guarantees
> >>> than a file system, including:
> >>>
> >>>   - Eventual consistency in the namespace
> >>>   - High latency for directory listings
> >>>   - No renames of objects
> >>>   - No folder hierarchy
> >>>
> >>> Rationale
> >>>
> >>> Initial benchmarks show dramatic improvements in query planning. For
> >>> example, in Netflix’s Atlas use case, which stores time-series metrics
> >> from
> >>> Netflix runtime systems and 1 month is stored across 2.7 million files
> in
> >>> 2,688 partitions:
> >>>
> >>>   - Hive table using Parquet:
> >>>  - 400k+ splits, not combined
> >>>  - Explain query: 9.6 minutes wall time (planning only)
> >>>   - Iceberg table with partition filtering:
> >>>  - 15,218 splits, combined
> >>>  - Planning: 10 seconds
> >>>  - Query wall time: 13 minutes
> >>>   - Iceberg table with partition and min/max filtering:
> >>>  - 412 splits
> >>>  - Planning: 25 seconds
> >>>  - Query wall time: 42 seconds
> >>>
> >>> These performance gains combined with the cross-engine compatibility
> are
> >> a
> >>> very compelling story.
> >>> Initial Goals
> >>>
> >>> The initial goal will be to move the existing codebase to Apache and
> >>> integrate with the Apache development process and infrastructure. A
> >> primary
> >>> goal of incubation will be to grow and diversify the Iceberg community.
> >> We
> >>> are well aware that the project community is largely comprised of
> >>> individuals from a single company. We aim to change that during
> >> incubation.
> >>> Current Status
> >>>
> >>> As previously mentioned, Iceberg is under active development at
> Netflix,
> >>> and is being used in processing large volumes of data in Amazon EC2.
> >>>
> >>> Iceberg license documentation is already based on Apache guidelines for
> >>> LICENSE and NOTICE content.
> >>> Meritocracy
> >>>
> >>> We value meritocracy and we understand that it is the basis for an open
> >>> community that encourages multiple companies and individuals to
> >> contribute
> >>> and be invested in the project’s future. We will encourage and monitor
> >>> participation and make sure to extend privileges and responsibilities
> to
> >>> all contributors.
> >>> Community
> >>>
> >>> Iceberg is currently being used by developers at Netflix and a growing
> >>> number of users are actively using it in production environments.
> Iceberg
> >>> has received contributions from developers working at Hortonworks,
> >> WeWork,
> >>> and Palantir. By bringing Iceberg to Apache 

Re: [VOTE] Accept the Iceberg project for incubation

2018-11-13 Thread Uwe L. Korn
+1 (non-binding)

Great to see this here!

> Am 14.11.2018 um 04:07 schrieb James Taylor :
> 
> +1 (binding)
> 
>> On Tue, Nov 13, 2018 at 4:15 PM Willem Jiang  wrote:
>> 
>> +1 (binding)
>> 
>> Willem Jiang
>> 
>> Twitter: willemjiang
>> Weibo: 姜宁willem
>> 
>>> On Wed, Nov 14, 2018 at 1:07 AM Ryan Blue  wrote:
>>> 
>>> The discuss thread seems to have reached consensus, so I propose
>> accepting
>>> the Iceberg project for incubation.
>>> 
>>> The proposal is copied below and in the wiki:
>>> https://wiki.apache.org/incubator/IcebergProposal
>>> 
>>> Please vote on whether to accept Iceberg in the next 72 hours:
>>> 
>>> [ ] +1, accept Iceberg for incubation
>>> [ ] -1, reject the Iceberg proposal because . . .
>>> 
>>> Thank you for reviewing the proposal and voting,
>>> 
>>> rb
>>> --
>>> Iceberg Proposal Abstract
>>> 
>>> Iceberg is a table format for large, slow-moving tabular data.
>>> 
>>> It is designed to improve on the de-facto standard table layout built
>> into
>>> Apache Hive, Presto, and Apache Spark.
>>> Proposal
>>> 
>>> The purpose of Iceberg is to provide SQL-like tables that are backed by
>>> large sets of data files. Iceberg is similar to the Hive table layout,
>> the
>>> de-facto standard structure used to track files in a table, but provides
>>> additional guarantees and performance optimizations:
>>> 
>>>   - Atomicity - Each change to the table is will be complete or will
>> fail.
>>>   “Do or do not. There is no try.”
>>>   - Snapshot isolation - Reads use one and only one snapshot of a table
>> at
>>>   some time without holding a lock.
>>>   - Safe schema evolution - A table’s schema can change in well-defined
>>>   ways, without breaking older data files.
>>>   - Column projection - An engine may request a subset of the available
>>>   columns, including nested fields.
>>>   - Predicate pushdown - An engine can push filters into read planning
>> to
>>>   improve performance using partition data and file-level statistics.
>>> 
>>> Iceberg does NOT define a new file format. All data is stored in Apache
>>> Avro, Apache ORC, or Apache Parquet files.
>>> 
>>> Additionally, Iceberg is designed to work well when data files are stored
>>> in cloud blob stores, even when those systems provide weaker guarantees
>>> than a file system, including:
>>> 
>>>   - Eventual consistency in the namespace
>>>   - High latency for directory listings
>>>   - No renames of objects
>>>   - No folder hierarchy
>>> 
>>> Rationale
>>> 
>>> Initial benchmarks show dramatic improvements in query planning. For
>>> example, in Netflix’s Atlas use case, which stores time-series metrics
>> from
>>> Netflix runtime systems and 1 month is stored across 2.7 million files in
>>> 2,688 partitions:
>>> 
>>>   - Hive table using Parquet:
>>>  - 400k+ splits, not combined
>>>  - Explain query: 9.6 minutes wall time (planning only)
>>>   - Iceberg table with partition filtering:
>>>  - 15,218 splits, combined
>>>  - Planning: 10 seconds
>>>  - Query wall time: 13 minutes
>>>   - Iceberg table with partition and min/max filtering:
>>>  - 412 splits
>>>  - Planning: 25 seconds
>>>  - Query wall time: 42 seconds
>>> 
>>> These performance gains combined with the cross-engine compatibility are
>> a
>>> very compelling story.
>>> Initial Goals
>>> 
>>> The initial goal will be to move the existing codebase to Apache and
>>> integrate with the Apache development process and infrastructure. A
>> primary
>>> goal of incubation will be to grow and diversify the Iceberg community.
>> We
>>> are well aware that the project community is largely comprised of
>>> individuals from a single company. We aim to change that during
>> incubation.
>>> Current Status
>>> 
>>> As previously mentioned, Iceberg is under active development at Netflix,
>>> and is being used in processing large volumes of data in Amazon EC2.
>>> 
>>> Iceberg license documentation is already based on Apache guidelines for
>>> LICENSE and NOTICE content.
>>> Meritocracy
>>> 
>>> We value meritocracy and we understand that it is the basis for an open
>>> community that encourages multiple companies and individuals to
>> contribute
>>> and be invested in the project’s future. We will encourage and monitor
>>> participation and make sure to extend privileges and responsibilities to
>>> all contributors.
>>> Community
>>> 
>>> Iceberg is currently being used by developers at Netflix and a growing
>>> number of users are actively using it in production environments. Iceberg
>>> has received contributions from developers working at Hortonworks,
>> WeWork,
>>> and Palantir. By bringing Iceberg to Apache we aim to assure current and
>>> future contributors that the Iceberg community is meritocratic and open,
>> in
>>> order to broaden and diversity the user and developer community.
>>> Core Developers
>>> 
>>> Iceberg was initially developed at Netflix and is under active
>> development.
>>> We believe Netflix 

Re: [VOTE] Accept the Iceberg project for incubation

2018-11-13 Thread James Taylor
+1 (binding)

On Tue, Nov 13, 2018 at 4:15 PM Willem Jiang  wrote:

> +1 (binding)
>
> Willem Jiang
>
> Twitter: willemjiang
> Weibo: 姜宁willem
>
> On Wed, Nov 14, 2018 at 1:07 AM Ryan Blue  wrote:
> >
> > The discuss thread seems to have reached consensus, so I propose
> accepting
> > the Iceberg project for incubation.
> >
> > The proposal is copied below and in the wiki:
> > https://wiki.apache.org/incubator/IcebergProposal
> >
> > Please vote on whether to accept Iceberg in the next 72 hours:
> >
> > [ ] +1, accept Iceberg for incubation
> > [ ] -1, reject the Iceberg proposal because . . .
> >
> > Thank you for reviewing the proposal and voting,
> >
> > rb
> > --
> > Iceberg Proposal Abstract
> >
> > Iceberg is a table format for large, slow-moving tabular data.
> >
> > It is designed to improve on the de-facto standard table layout built
> into
> > Apache Hive, Presto, and Apache Spark.
> > Proposal
> >
> > The purpose of Iceberg is to provide SQL-like tables that are backed by
> > large sets of data files. Iceberg is similar to the Hive table layout,
> the
> > de-facto standard structure used to track files in a table, but provides
> > additional guarantees and performance optimizations:
> >
> >- Atomicity - Each change to the table is will be complete or will
> fail.
> >“Do or do not. There is no try.”
> >- Snapshot isolation - Reads use one and only one snapshot of a table
> at
> >some time without holding a lock.
> >- Safe schema evolution - A table’s schema can change in well-defined
> >ways, without breaking older data files.
> >- Column projection - An engine may request a subset of the available
> >columns, including nested fields.
> >- Predicate pushdown - An engine can push filters into read planning
> to
> >improve performance using partition data and file-level statistics.
> >
> > Iceberg does NOT define a new file format. All data is stored in Apache
> > Avro, Apache ORC, or Apache Parquet files.
> >
> > Additionally, Iceberg is designed to work well when data files are stored
> > in cloud blob stores, even when those systems provide weaker guarantees
> > than a file system, including:
> >
> >- Eventual consistency in the namespace
> >- High latency for directory listings
> >- No renames of objects
> >- No folder hierarchy
> >
> > Rationale
> >
> > Initial benchmarks show dramatic improvements in query planning. For
> > example, in Netflix’s Atlas use case, which stores time-series metrics
> from
> > Netflix runtime systems and 1 month is stored across 2.7 million files in
> > 2,688 partitions:
> >
> >- Hive table using Parquet:
> >   - 400k+ splits, not combined
> >   - Explain query: 9.6 minutes wall time (planning only)
> >- Iceberg table with partition filtering:
> >   - 15,218 splits, combined
> >   - Planning: 10 seconds
> >   - Query wall time: 13 minutes
> >- Iceberg table with partition and min/max filtering:
> >   - 412 splits
> >   - Planning: 25 seconds
> >   - Query wall time: 42 seconds
> >
> > These performance gains combined with the cross-engine compatibility are
> a
> > very compelling story.
> > Initial Goals
> >
> > The initial goal will be to move the existing codebase to Apache and
> > integrate with the Apache development process and infrastructure. A
> primary
> > goal of incubation will be to grow and diversify the Iceberg community.
> We
> > are well aware that the project community is largely comprised of
> > individuals from a single company. We aim to change that during
> incubation.
> > Current Status
> >
> > As previously mentioned, Iceberg is under active development at Netflix,
> > and is being used in processing large volumes of data in Amazon EC2.
> >
> > Iceberg license documentation is already based on Apache guidelines for
> > LICENSE and NOTICE content.
> > Meritocracy
> >
> > We value meritocracy and we understand that it is the basis for an open
> > community that encourages multiple companies and individuals to
> contribute
> > and be invested in the project’s future. We will encourage and monitor
> > participation and make sure to extend privileges and responsibilities to
> > all contributors.
> > Community
> >
> > Iceberg is currently being used by developers at Netflix and a growing
> > number of users are actively using it in production environments. Iceberg
> > has received contributions from developers working at Hortonworks,
> WeWork,
> > and Palantir. By bringing Iceberg to Apache we aim to assure current and
> > future contributors that the Iceberg community is meritocratic and open,
> in
> > order to broaden and diversity the user and developer community.
> > Core Developers
> >
> > Iceberg was initially developed at Netflix and is under active
> development.
> > We believe Netflix will be of interest to a broad range of users and
> > developers and that incubating the project at the ASF will help us build
> a

Re: [VOTE] Accept the Iceberg project for incubation

2018-11-13 Thread Willem Jiang
+1 (binding)

Willem Jiang

Twitter: willemjiang
Weibo: 姜宁willem

On Wed, Nov 14, 2018 at 1:07 AM Ryan Blue  wrote:
>
> The discuss thread seems to have reached consensus, so I propose accepting
> the Iceberg project for incubation.
>
> The proposal is copied below and in the wiki:
> https://wiki.apache.org/incubator/IcebergProposal
>
> Please vote on whether to accept Iceberg in the next 72 hours:
>
> [ ] +1, accept Iceberg for incubation
> [ ] -1, reject the Iceberg proposal because . . .
>
> Thank you for reviewing the proposal and voting,
>
> rb
> --
> Iceberg Proposal Abstract
>
> Iceberg is a table format for large, slow-moving tabular data.
>
> It is designed to improve on the de-facto standard table layout built into
> Apache Hive, Presto, and Apache Spark.
> Proposal
>
> The purpose of Iceberg is to provide SQL-like tables that are backed by
> large sets of data files. Iceberg is similar to the Hive table layout, the
> de-facto standard structure used to track files in a table, but provides
> additional guarantees and performance optimizations:
>
>- Atomicity - Each change to the table is will be complete or will fail.
>“Do or do not. There is no try.”
>- Snapshot isolation - Reads use one and only one snapshot of a table at
>some time without holding a lock.
>- Safe schema evolution - A table’s schema can change in well-defined
>ways, without breaking older data files.
>- Column projection - An engine may request a subset of the available
>columns, including nested fields.
>- Predicate pushdown - An engine can push filters into read planning to
>improve performance using partition data and file-level statistics.
>
> Iceberg does NOT define a new file format. All data is stored in Apache
> Avro, Apache ORC, or Apache Parquet files.
>
> Additionally, Iceberg is designed to work well when data files are stored
> in cloud blob stores, even when those systems provide weaker guarantees
> than a file system, including:
>
>- Eventual consistency in the namespace
>- High latency for directory listings
>- No renames of objects
>- No folder hierarchy
>
> Rationale
>
> Initial benchmarks show dramatic improvements in query planning. For
> example, in Netflix’s Atlas use case, which stores time-series metrics from
> Netflix runtime systems and 1 month is stored across 2.7 million files in
> 2,688 partitions:
>
>- Hive table using Parquet:
>   - 400k+ splits, not combined
>   - Explain query: 9.6 minutes wall time (planning only)
>- Iceberg table with partition filtering:
>   - 15,218 splits, combined
>   - Planning: 10 seconds
>   - Query wall time: 13 minutes
>- Iceberg table with partition and min/max filtering:
>   - 412 splits
>   - Planning: 25 seconds
>   - Query wall time: 42 seconds
>
> These performance gains combined with the cross-engine compatibility are a
> very compelling story.
> Initial Goals
>
> The initial goal will be to move the existing codebase to Apache and
> integrate with the Apache development process and infrastructure. A primary
> goal of incubation will be to grow and diversify the Iceberg community. We
> are well aware that the project community is largely comprised of
> individuals from a single company. We aim to change that during incubation.
> Current Status
>
> As previously mentioned, Iceberg is under active development at Netflix,
> and is being used in processing large volumes of data in Amazon EC2.
>
> Iceberg license documentation is already based on Apache guidelines for
> LICENSE and NOTICE content.
> Meritocracy
>
> We value meritocracy and we understand that it is the basis for an open
> community that encourages multiple companies and individuals to contribute
> and be invested in the project’s future. We will encourage and monitor
> participation and make sure to extend privileges and responsibilities to
> all contributors.
> Community
>
> Iceberg is currently being used by developers at Netflix and a growing
> number of users are actively using it in production environments. Iceberg
> has received contributions from developers working at Hortonworks, WeWork,
> and Palantir. By bringing Iceberg to Apache we aim to assure current and
> future contributors that the Iceberg community is meritocratic and open, in
> order to broaden and diversity the user and developer community.
> Core Developers
>
> Iceberg was initially developed at Netflix and is under active development.
> We believe Netflix will be of interest to a broad range of users and
> developers and that incubating the project at the ASF will help us build a
> diverse, sustainable community.
> Alignment
>
> Iceberg utilizes other Apache projects such as Avro, Hadoop, Hive, ORC,
> Parquet, Pig, and Spark. We anticipate integration with additional Apache
> projects as the Iceberg community and interest in the project grows.
> Known Risks Orphaned Products
>
> Netflix is committed 

Re: [VOTE] Accept the Iceberg project for incubation

2018-11-13 Thread Kevin A. McGrail
+1 (binding)

On 11/13/2018 12:40 PM, Julian Hyde wrote:
> +1 (binding)
>
> Julian
>
>
>> On Nov 13, 2018, at 9:28 AM, Arthur Wiedmer  wrote:
>>
>> +1
>>
>> (Non-binding)
>>
>> Best,
>> Arthur
>>
>> On Tue, Nov 13, 2018, 09:24 Hugo Louro >
>>> +1 (non-binding)
>>>
 On Nov 13, 2018, at 9:19 AM, Owen O'Malley 
>>> wrote:
 +1 (binding)

> On Tue, Nov 13, 2018 at 12:12 PM Dave Fisher 
>>> wrote:
> +1 (binding)
>
>> On Nov 13, 2018, at 9:10 AM, Matt Sicker  wrote:
>>
>> +1 binding
>>
>>> On Tue, 13 Nov 2018 at 11:09, Ryan Blue  wrote:
>>>
>>> +1 (binding)
>>>
 On Tue, Nov 13, 2018 at 9:06 AM Ryan Blue  wrote:

 The discuss thread seems to have reached consensus, so I propose
>>> accepting
 the Iceberg project for incubation.

 The proposal is copied below and in the wiki:
 https://wiki.apache.org/incubator/IcebergProposal

 Please vote on whether to accept Iceberg in the next 72 hours:

 [ ] +1, accept Iceberg for incubation
 [ ] -1, reject the Iceberg proposal because . . .

 Thank you for reviewing the proposal and voting,

 rb
 --
 Iceberg Proposal Abstract

 Iceberg is a table format for large, slow-moving tabular data.

 It is designed to improve on the de-facto standard table layout built
>>> into
 Apache Hive, Presto, and Apache Spark.
 Proposal

 The purpose of Iceberg is to provide SQL-like tables that are backed
>>> by
 large sets of data files. Iceberg is similar to the Hive table
>>> layout,
>>> the
 de-facto standard structure used to track files in a table, but
> provides
 additional guarantees and performance optimizations:

 - Atomicity - Each change to the table is will be complete or will
 fail. “Do or do not. There is no try.”
 - Snapshot isolation - Reads use one and only one snapshot of a
>>> table
 at some time without holding a lock.
 - Safe schema evolution - A table’s schema can change in
>>> well-defined
 ways, without breaking older data files.
 - Column projection - An engine may request a subset of the
>>> available
 columns, including nested fields.
 - Predicate pushdown - An engine can push filters into read planning
 to improve performance using partition data and file-level
> statistics.
 Iceberg does NOT define a new file format. All data is stored in
>>> Apache
 Avro, Apache ORC, or Apache Parquet files.

 Additionally, Iceberg is designed to work well when data files are
> stored
 in cloud blob stores, even when those systems provide weaker
>>> guarantees
 than a file system, including:

 - Eventual consistency in the namespace
 - High latency for directory listings
 - No renames of objects
 - No folder hierarchy

 Rationale

 Initial benchmarks show dramatic improvements in query planning. For
 example, in Netflix’s Atlas use case, which stores time-series
>>> metrics
>>> from
 Netflix runtime systems and 1 month is stored across 2.7 million
>>> files
> in
 2,688 partitions:

 - Hive table using Parquet:
- 400k+ splits, not combined
- Explain query: 9.6 minutes wall time (planning only)
 - Iceberg table with partition filtering:
- 15,218 splits, combined
- Planning: 10 seconds
- Query wall time: 13 minutes
 - Iceberg table with partition and min/max filtering:
- 412 splits
- Planning: 25 seconds
- Query wall time: 42 seconds

 These performance gains combined with the cross-engine compatibility
> are
>>> a
 very compelling story.
 Initial Goals

 The initial goal will be to move the existing codebase to Apache and
 integrate with the Apache development process and infrastructure. A
>>> primary
 goal of incubation will be to grow and diversify the Iceberg
>>> community.
>>> We
 are well aware that the project community is largely comprised of
 individuals from a single company. We aim to change that during
>>> incubation.
 Current Status

 As previously mentioned, Iceberg is under active development at
> Netflix,
 and is being used in processing large volumes of data in Amazon EC2.

 Iceberg license documentation is already based on Apache guidelines
>>> for
 LICENSE and NOTICE content.
 Meritocracy

 We value meritocracy and we understand that it is the basis for an
>>> open
 community that 

Re: [VOTE] Accept the Iceberg project for incubation

2018-11-13 Thread Julian Hyde
+1 (binding)

Julian


> On Nov 13, 2018, at 9:28 AM, Arthur Wiedmer  wrote:
> 
> +1
> 
> (Non-binding)
> 
> Best,
> Arthur
> 
> On Tue, Nov 13, 2018, 09:24 Hugo Louro  
>> +1 (non-binding)
>> 
>>> On Nov 13, 2018, at 9:19 AM, Owen O'Malley 
>> wrote:
>>> 
>>> +1 (binding)
>>> 
 On Tue, Nov 13, 2018 at 12:12 PM Dave Fisher 
>> wrote:
 
 +1 (binding)
 
> On Nov 13, 2018, at 9:10 AM, Matt Sicker  wrote:
> 
> +1 binding
> 
>> On Tue, 13 Nov 2018 at 11:09, Ryan Blue  wrote:
>> 
>> +1 (binding)
>> 
>>> On Tue, Nov 13, 2018 at 9:06 AM Ryan Blue  wrote:
>>> 
>>> The discuss thread seems to have reached consensus, so I propose
>> accepting
>>> the Iceberg project for incubation.
>>> 
>>> The proposal is copied below and in the wiki:
>>> https://wiki.apache.org/incubator/IcebergProposal
>>> 
>>> Please vote on whether to accept Iceberg in the next 72 hours:
>>> 
>>> [ ] +1, accept Iceberg for incubation
>>> [ ] -1, reject the Iceberg proposal because . . .
>>> 
>>> Thank you for reviewing the proposal and voting,
>>> 
>>> rb
>>> --
>>> Iceberg Proposal Abstract
>>> 
>>> Iceberg is a table format for large, slow-moving tabular data.
>>> 
>>> It is designed to improve on the de-facto standard table layout built
>> into
>>> Apache Hive, Presto, and Apache Spark.
>>> Proposal
>>> 
>>> The purpose of Iceberg is to provide SQL-like tables that are backed
>> by
>>> large sets of data files. Iceberg is similar to the Hive table
>> layout,
>> the
>>> de-facto standard structure used to track files in a table, but
 provides
>>> additional guarantees and performance optimizations:
>>> 
>>> - Atomicity - Each change to the table is will be complete or will
>>> fail. “Do or do not. There is no try.”
>>> - Snapshot isolation - Reads use one and only one snapshot of a
>> table
>>> at some time without holding a lock.
>>> - Safe schema evolution - A table’s schema can change in
>> well-defined
>>> ways, without breaking older data files.
>>> - Column projection - An engine may request a subset of the
>> available
>>> columns, including nested fields.
>>> - Predicate pushdown - An engine can push filters into read planning
>>> to improve performance using partition data and file-level
 statistics.
>>> 
>>> Iceberg does NOT define a new file format. All data is stored in
>> Apache
>>> Avro, Apache ORC, or Apache Parquet files.
>>> 
>>> Additionally, Iceberg is designed to work well when data files are
 stored
>>> in cloud blob stores, even when those systems provide weaker
>> guarantees
>>> than a file system, including:
>>> 
>>> - Eventual consistency in the namespace
>>> - High latency for directory listings
>>> - No renames of objects
>>> - No folder hierarchy
>>> 
>>> Rationale
>>> 
>>> Initial benchmarks show dramatic improvements in query planning. For
>>> example, in Netflix’s Atlas use case, which stores time-series
>> metrics
>> from
>>> Netflix runtime systems and 1 month is stored across 2.7 million
>> files
 in
>>> 2,688 partitions:
>>> 
>>> - Hive table using Parquet:
>>>- 400k+ splits, not combined
>>>- Explain query: 9.6 minutes wall time (planning only)
>>> - Iceberg table with partition filtering:
>>>- 15,218 splits, combined
>>>- Planning: 10 seconds
>>>- Query wall time: 13 minutes
>>> - Iceberg table with partition and min/max filtering:
>>>- 412 splits
>>>- Planning: 25 seconds
>>>- Query wall time: 42 seconds
>>> 
>>> These performance gains combined with the cross-engine compatibility
 are
>> a
>>> very compelling story.
>>> Initial Goals
>>> 
>>> The initial goal will be to move the existing codebase to Apache and
>>> integrate with the Apache development process and infrastructure. A
>> primary
>>> goal of incubation will be to grow and diversify the Iceberg
>> community.
>> We
>>> are well aware that the project community is largely comprised of
>>> individuals from a single company. We aim to change that during
>> incubation.
>>> Current Status
>>> 
>>> As previously mentioned, Iceberg is under active development at
 Netflix,
>>> and is being used in processing large volumes of data in Amazon EC2.
>>> 
>>> Iceberg license documentation is already based on Apache guidelines
>> for
>>> LICENSE and NOTICE content.
>>> Meritocracy
>>> 
>>> We value meritocracy and we understand that it is the basis for an
>> open
>>> community that encourages multiple companies and individuals to
>> contribute
>>> and be invested in the project’s future. We will encourage and
>> monitor
>>> 

Re: [VOTE] Accept the Iceberg project for incubation

2018-11-13 Thread Arthur Wiedmer
+1

(Non-binding)

Best,
Arthur

On Tue, Nov 13, 2018, 09:24 Hugo Louro  +1 (non-binding)
>
> > On Nov 13, 2018, at 9:19 AM, Owen O'Malley 
> wrote:
> >
> > +1 (binding)
> >
> >> On Tue, Nov 13, 2018 at 12:12 PM Dave Fisher 
> wrote:
> >>
> >> +1 (binding)
> >>
> >>> On Nov 13, 2018, at 9:10 AM, Matt Sicker  wrote:
> >>>
> >>> +1 binding
> >>>
>  On Tue, 13 Nov 2018 at 11:09, Ryan Blue  wrote:
> 
>  +1 (binding)
> 
> > On Tue, Nov 13, 2018 at 9:06 AM Ryan Blue  wrote:
> >
> > The discuss thread seems to have reached consensus, so I propose
>  accepting
> > the Iceberg project for incubation.
> >
> > The proposal is copied below and in the wiki:
> > https://wiki.apache.org/incubator/IcebergProposal
> >
> > Please vote on whether to accept Iceberg in the next 72 hours:
> >
> > [ ] +1, accept Iceberg for incubation
> > [ ] -1, reject the Iceberg proposal because . . .
> >
> > Thank you for reviewing the proposal and voting,
> >
> > rb
> > --
> > Iceberg Proposal Abstract
> >
> > Iceberg is a table format for large, slow-moving tabular data.
> >
> > It is designed to improve on the de-facto standard table layout built
>  into
> > Apache Hive, Presto, and Apache Spark.
> > Proposal
> >
> > The purpose of Iceberg is to provide SQL-like tables that are backed
> by
> > large sets of data files. Iceberg is similar to the Hive table
> layout,
>  the
> > de-facto standard structure used to track files in a table, but
> >> provides
> > additional guarantees and performance optimizations:
> >
> >  - Atomicity - Each change to the table is will be complete or will
> >  fail. “Do or do not. There is no try.”
> >  - Snapshot isolation - Reads use one and only one snapshot of a
> table
> >  at some time without holding a lock.
> >  - Safe schema evolution - A table’s schema can change in
> well-defined
> >  ways, without breaking older data files.
> >  - Column projection - An engine may request a subset of the
> available
> >  columns, including nested fields.
> >  - Predicate pushdown - An engine can push filters into read planning
> >  to improve performance using partition data and file-level
> >> statistics.
> >
> > Iceberg does NOT define a new file format. All data is stored in
> Apache
> > Avro, Apache ORC, or Apache Parquet files.
> >
> > Additionally, Iceberg is designed to work well when data files are
> >> stored
> > in cloud blob stores, even when those systems provide weaker
> guarantees
> > than a file system, including:
> >
> >  - Eventual consistency in the namespace
> >  - High latency for directory listings
> >  - No renames of objects
> >  - No folder hierarchy
> >
> > Rationale
> >
> > Initial benchmarks show dramatic improvements in query planning. For
> > example, in Netflix’s Atlas use case, which stores time-series
> metrics
>  from
> > Netflix runtime systems and 1 month is stored across 2.7 million
> files
> >> in
> > 2,688 partitions:
> >
> >  - Hive table using Parquet:
> > - 400k+ splits, not combined
> > - Explain query: 9.6 minutes wall time (planning only)
> >  - Iceberg table with partition filtering:
> > - 15,218 splits, combined
> > - Planning: 10 seconds
> > - Query wall time: 13 minutes
> >  - Iceberg table with partition and min/max filtering:
> > - 412 splits
> > - Planning: 25 seconds
> > - Query wall time: 42 seconds
> >
> > These performance gains combined with the cross-engine compatibility
> >> are
>  a
> > very compelling story.
> > Initial Goals
> >
> > The initial goal will be to move the existing codebase to Apache and
> > integrate with the Apache development process and infrastructure. A
>  primary
> > goal of incubation will be to grow and diversify the Iceberg
> community.
>  We
> > are well aware that the project community is largely comprised of
> > individuals from a single company. We aim to change that during
>  incubation.
> > Current Status
> >
> > As previously mentioned, Iceberg is under active development at
> >> Netflix,
> > and is being used in processing large volumes of data in Amazon EC2.
> >
> > Iceberg license documentation is already based on Apache guidelines
> for
> > LICENSE and NOTICE content.
> > Meritocracy
> >
> > We value meritocracy and we understand that it is the basis for an
> open
> > community that encourages multiple companies and individuals to
>  contribute
> > and be invested in the project’s future. We will encourage and
> monitor
> > participation and make sure to extend privileges and responsibilities
> >> to
> > all contributors.
> > Community

Re: [VOTE] Accept the Iceberg project for incubation

2018-11-13 Thread Hugo Louro
+1 (non-binding)

> On Nov 13, 2018, at 9:19 AM, Owen O'Malley  wrote:
> 
> +1 (binding)
> 
>> On Tue, Nov 13, 2018 at 12:12 PM Dave Fisher  wrote:
>> 
>> +1 (binding)
>> 
>>> On Nov 13, 2018, at 9:10 AM, Matt Sicker  wrote:
>>> 
>>> +1 binding
>>> 
 On Tue, 13 Nov 2018 at 11:09, Ryan Blue  wrote:
 
 +1 (binding)
 
> On Tue, Nov 13, 2018 at 9:06 AM Ryan Blue  wrote:
> 
> The discuss thread seems to have reached consensus, so I propose
 accepting
> the Iceberg project for incubation.
> 
> The proposal is copied below and in the wiki:
> https://wiki.apache.org/incubator/IcebergProposal
> 
> Please vote on whether to accept Iceberg in the next 72 hours:
> 
> [ ] +1, accept Iceberg for incubation
> [ ] -1, reject the Iceberg proposal because . . .
> 
> Thank you for reviewing the proposal and voting,
> 
> rb
> --
> Iceberg Proposal Abstract
> 
> Iceberg is a table format for large, slow-moving tabular data.
> 
> It is designed to improve on the de-facto standard table layout built
 into
> Apache Hive, Presto, and Apache Spark.
> Proposal
> 
> The purpose of Iceberg is to provide SQL-like tables that are backed by
> large sets of data files. Iceberg is similar to the Hive table layout,
 the
> de-facto standard structure used to track files in a table, but
>> provides
> additional guarantees and performance optimizations:
> 
>  - Atomicity - Each change to the table is will be complete or will
>  fail. “Do or do not. There is no try.”
>  - Snapshot isolation - Reads use one and only one snapshot of a table
>  at some time without holding a lock.
>  - Safe schema evolution - A table’s schema can change in well-defined
>  ways, without breaking older data files.
>  - Column projection - An engine may request a subset of the available
>  columns, including nested fields.
>  - Predicate pushdown - An engine can push filters into read planning
>  to improve performance using partition data and file-level
>> statistics.
> 
> Iceberg does NOT define a new file format. All data is stored in Apache
> Avro, Apache ORC, or Apache Parquet files.
> 
> Additionally, Iceberg is designed to work well when data files are
>> stored
> in cloud blob stores, even when those systems provide weaker guarantees
> than a file system, including:
> 
>  - Eventual consistency in the namespace
>  - High latency for directory listings
>  - No renames of objects
>  - No folder hierarchy
> 
> Rationale
> 
> Initial benchmarks show dramatic improvements in query planning. For
> example, in Netflix’s Atlas use case, which stores time-series metrics
 from
> Netflix runtime systems and 1 month is stored across 2.7 million files
>> in
> 2,688 partitions:
> 
>  - Hive table using Parquet:
> - 400k+ splits, not combined
> - Explain query: 9.6 minutes wall time (planning only)
>  - Iceberg table with partition filtering:
> - 15,218 splits, combined
> - Planning: 10 seconds
> - Query wall time: 13 minutes
>  - Iceberg table with partition and min/max filtering:
> - 412 splits
> - Planning: 25 seconds
> - Query wall time: 42 seconds
> 
> These performance gains combined with the cross-engine compatibility
>> are
 a
> very compelling story.
> Initial Goals
> 
> The initial goal will be to move the existing codebase to Apache and
> integrate with the Apache development process and infrastructure. A
 primary
> goal of incubation will be to grow and diversify the Iceberg community.
 We
> are well aware that the project community is largely comprised of
> individuals from a single company. We aim to change that during
 incubation.
> Current Status
> 
> As previously mentioned, Iceberg is under active development at
>> Netflix,
> and is being used in processing large volumes of data in Amazon EC2.
> 
> Iceberg license documentation is already based on Apache guidelines for
> LICENSE and NOTICE content.
> Meritocracy
> 
> We value meritocracy and we understand that it is the basis for an open
> community that encourages multiple companies and individuals to
 contribute
> and be invested in the project’s future. We will encourage and monitor
> participation and make sure to extend privileges and responsibilities
>> to
> all contributors.
> Community
> 
> Iceberg is currently being used by developers at Netflix and a growing
> number of users are actively using it in production environments.
>> Iceberg
> has received contributions from developers working at Hortonworks,
 WeWork,
> and Palantir. By bringing Iceberg to Apache we aim to assure current
>> 

Re: [VOTE] Accept the Iceberg project for incubation

2018-11-13 Thread Owen O'Malley
+1 (binding)

On Tue, Nov 13, 2018 at 12:12 PM Dave Fisher  wrote:

> +1 (binding)
>
> > On Nov 13, 2018, at 9:10 AM, Matt Sicker  wrote:
> >
> > +1 binding
> >
> > On Tue, 13 Nov 2018 at 11:09, Ryan Blue  wrote:
> >
> >> +1 (binding)
> >>
> >> On Tue, Nov 13, 2018 at 9:06 AM Ryan Blue  wrote:
> >>
> >>> The discuss thread seems to have reached consensus, so I propose
> >> accepting
> >>> the Iceberg project for incubation.
> >>>
> >>> The proposal is copied below and in the wiki:
> >>> https://wiki.apache.org/incubator/IcebergProposal
> >>>
> >>> Please vote on whether to accept Iceberg in the next 72 hours:
> >>>
> >>> [ ] +1, accept Iceberg for incubation
> >>> [ ] -1, reject the Iceberg proposal because . . .
> >>>
> >>> Thank you for reviewing the proposal and voting,
> >>>
> >>> rb
> >>> --
> >>> Iceberg Proposal Abstract
> >>>
> >>> Iceberg is a table format for large, slow-moving tabular data.
> >>>
> >>> It is designed to improve on the de-facto standard table layout built
> >> into
> >>> Apache Hive, Presto, and Apache Spark.
> >>> Proposal
> >>>
> >>> The purpose of Iceberg is to provide SQL-like tables that are backed by
> >>> large sets of data files. Iceberg is similar to the Hive table layout,
> >> the
> >>> de-facto standard structure used to track files in a table, but
> provides
> >>> additional guarantees and performance optimizations:
> >>>
> >>>   - Atomicity - Each change to the table is will be complete or will
> >>>   fail. “Do or do not. There is no try.”
> >>>   - Snapshot isolation - Reads use one and only one snapshot of a table
> >>>   at some time without holding a lock.
> >>>   - Safe schema evolution - A table’s schema can change in well-defined
> >>>   ways, without breaking older data files.
> >>>   - Column projection - An engine may request a subset of the available
> >>>   columns, including nested fields.
> >>>   - Predicate pushdown - An engine can push filters into read planning
> >>>   to improve performance using partition data and file-level
> statistics.
> >>>
> >>> Iceberg does NOT define a new file format. All data is stored in Apache
> >>> Avro, Apache ORC, or Apache Parquet files.
> >>>
> >>> Additionally, Iceberg is designed to work well when data files are
> stored
> >>> in cloud blob stores, even when those systems provide weaker guarantees
> >>> than a file system, including:
> >>>
> >>>   - Eventual consistency in the namespace
> >>>   - High latency for directory listings
> >>>   - No renames of objects
> >>>   - No folder hierarchy
> >>>
> >>> Rationale
> >>>
> >>> Initial benchmarks show dramatic improvements in query planning. For
> >>> example, in Netflix’s Atlas use case, which stores time-series metrics
> >> from
> >>> Netflix runtime systems and 1 month is stored across 2.7 million files
> in
> >>> 2,688 partitions:
> >>>
> >>>   - Hive table using Parquet:
> >>>  - 400k+ splits, not combined
> >>>  - Explain query: 9.6 minutes wall time (planning only)
> >>>   - Iceberg table with partition filtering:
> >>>  - 15,218 splits, combined
> >>>  - Planning: 10 seconds
> >>>  - Query wall time: 13 minutes
> >>>   - Iceberg table with partition and min/max filtering:
> >>>  - 412 splits
> >>>  - Planning: 25 seconds
> >>>  - Query wall time: 42 seconds
> >>>
> >>> These performance gains combined with the cross-engine compatibility
> are
> >> a
> >>> very compelling story.
> >>> Initial Goals
> >>>
> >>> The initial goal will be to move the existing codebase to Apache and
> >>> integrate with the Apache development process and infrastructure. A
> >> primary
> >>> goal of incubation will be to grow and diversify the Iceberg community.
> >> We
> >>> are well aware that the project community is largely comprised of
> >>> individuals from a single company. We aim to change that during
> >> incubation.
> >>> Current Status
> >>>
> >>> As previously mentioned, Iceberg is under active development at
> Netflix,
> >>> and is being used in processing large volumes of data in Amazon EC2.
> >>>
> >>> Iceberg license documentation is already based on Apache guidelines for
> >>> LICENSE and NOTICE content.
> >>> Meritocracy
> >>>
> >>> We value meritocracy and we understand that it is the basis for an open
> >>> community that encourages multiple companies and individuals to
> >> contribute
> >>> and be invested in the project’s future. We will encourage and monitor
> >>> participation and make sure to extend privileges and responsibilities
> to
> >>> all contributors.
> >>> Community
> >>>
> >>> Iceberg is currently being used by developers at Netflix and a growing
> >>> number of users are actively using it in production environments.
> Iceberg
> >>> has received contributions from developers working at Hortonworks,
> >> WeWork,
> >>> and Palantir. By bringing Iceberg to Apache we aim to assure current
> and
> >>> future contributors that the Iceberg community is meritocratic and
> open,
> >> in
> >>> 

Re: [VOTE] Accept the Iceberg project for incubation

2018-11-13 Thread Dave Fisher
+1 (binding)

> On Nov 13, 2018, at 9:10 AM, Matt Sicker  wrote:
> 
> +1 binding
> 
> On Tue, 13 Nov 2018 at 11:09, Ryan Blue  wrote:
> 
>> +1 (binding)
>> 
>> On Tue, Nov 13, 2018 at 9:06 AM Ryan Blue  wrote:
>> 
>>> The discuss thread seems to have reached consensus, so I propose
>> accepting
>>> the Iceberg project for incubation.
>>> 
>>> The proposal is copied below and in the wiki:
>>> https://wiki.apache.org/incubator/IcebergProposal
>>> 
>>> Please vote on whether to accept Iceberg in the next 72 hours:
>>> 
>>> [ ] +1, accept Iceberg for incubation
>>> [ ] -1, reject the Iceberg proposal because . . .
>>> 
>>> Thank you for reviewing the proposal and voting,
>>> 
>>> rb
>>> --
>>> Iceberg Proposal Abstract
>>> 
>>> Iceberg is a table format for large, slow-moving tabular data.
>>> 
>>> It is designed to improve on the de-facto standard table layout built
>> into
>>> Apache Hive, Presto, and Apache Spark.
>>> Proposal
>>> 
>>> The purpose of Iceberg is to provide SQL-like tables that are backed by
>>> large sets of data files. Iceberg is similar to the Hive table layout,
>> the
>>> de-facto standard structure used to track files in a table, but provides
>>> additional guarantees and performance optimizations:
>>> 
>>>   - Atomicity - Each change to the table is will be complete or will
>>>   fail. “Do or do not. There is no try.”
>>>   - Snapshot isolation - Reads use one and only one snapshot of a table
>>>   at some time without holding a lock.
>>>   - Safe schema evolution - A table’s schema can change in well-defined
>>>   ways, without breaking older data files.
>>>   - Column projection - An engine may request a subset of the available
>>>   columns, including nested fields.
>>>   - Predicate pushdown - An engine can push filters into read planning
>>>   to improve performance using partition data and file-level statistics.
>>> 
>>> Iceberg does NOT define a new file format. All data is stored in Apache
>>> Avro, Apache ORC, or Apache Parquet files.
>>> 
>>> Additionally, Iceberg is designed to work well when data files are stored
>>> in cloud blob stores, even when those systems provide weaker guarantees
>>> than a file system, including:
>>> 
>>>   - Eventual consistency in the namespace
>>>   - High latency for directory listings
>>>   - No renames of objects
>>>   - No folder hierarchy
>>> 
>>> Rationale
>>> 
>>> Initial benchmarks show dramatic improvements in query planning. For
>>> example, in Netflix’s Atlas use case, which stores time-series metrics
>> from
>>> Netflix runtime systems and 1 month is stored across 2.7 million files in
>>> 2,688 partitions:
>>> 
>>>   - Hive table using Parquet:
>>>  - 400k+ splits, not combined
>>>  - Explain query: 9.6 minutes wall time (planning only)
>>>   - Iceberg table with partition filtering:
>>>  - 15,218 splits, combined
>>>  - Planning: 10 seconds
>>>  - Query wall time: 13 minutes
>>>   - Iceberg table with partition and min/max filtering:
>>>  - 412 splits
>>>  - Planning: 25 seconds
>>>  - Query wall time: 42 seconds
>>> 
>>> These performance gains combined with the cross-engine compatibility are
>> a
>>> very compelling story.
>>> Initial Goals
>>> 
>>> The initial goal will be to move the existing codebase to Apache and
>>> integrate with the Apache development process and infrastructure. A
>> primary
>>> goal of incubation will be to grow and diversify the Iceberg community.
>> We
>>> are well aware that the project community is largely comprised of
>>> individuals from a single company. We aim to change that during
>> incubation.
>>> Current Status
>>> 
>>> As previously mentioned, Iceberg is under active development at Netflix,
>>> and is being used in processing large volumes of data in Amazon EC2.
>>> 
>>> Iceberg license documentation is already based on Apache guidelines for
>>> LICENSE and NOTICE content.
>>> Meritocracy
>>> 
>>> We value meritocracy and we understand that it is the basis for an open
>>> community that encourages multiple companies and individuals to
>> contribute
>>> and be invested in the project’s future. We will encourage and monitor
>>> participation and make sure to extend privileges and responsibilities to
>>> all contributors.
>>> Community
>>> 
>>> Iceberg is currently being used by developers at Netflix and a growing
>>> number of users are actively using it in production environments. Iceberg
>>> has received contributions from developers working at Hortonworks,
>> WeWork,
>>> and Palantir. By bringing Iceberg to Apache we aim to assure current and
>>> future contributors that the Iceberg community is meritocratic and open,
>> in
>>> order to broaden and diversity the user and developer community.
>>> Core Developers
>>> 
>>> Iceberg was initially developed at Netflix and is under active
>>> development. We believe Netflix will be of interest to a broad range of
>>> users and developers and that incubating the project at the ASF will help

Re: [VOTE] Accept the Iceberg project for incubation

2018-11-13 Thread Matt Sicker
+1 binding

On Tue, 13 Nov 2018 at 11:09, Ryan Blue  wrote:

> +1 (binding)
>
> On Tue, Nov 13, 2018 at 9:06 AM Ryan Blue  wrote:
>
> > The discuss thread seems to have reached consensus, so I propose
> accepting
> > the Iceberg project for incubation.
> >
> > The proposal is copied below and in the wiki:
> > https://wiki.apache.org/incubator/IcebergProposal
> >
> > Please vote on whether to accept Iceberg in the next 72 hours:
> >
> > [ ] +1, accept Iceberg for incubation
> > [ ] -1, reject the Iceberg proposal because . . .
> >
> > Thank you for reviewing the proposal and voting,
> >
> > rb
> > --
> > Iceberg Proposal Abstract
> >
> > Iceberg is a table format for large, slow-moving tabular data.
> >
> > It is designed to improve on the de-facto standard table layout built
> into
> > Apache Hive, Presto, and Apache Spark.
> > Proposal
> >
> > The purpose of Iceberg is to provide SQL-like tables that are backed by
> > large sets of data files. Iceberg is similar to the Hive table layout,
> the
> > de-facto standard structure used to track files in a table, but provides
> > additional guarantees and performance optimizations:
> >
> >- Atomicity - Each change to the table is will be complete or will
> >fail. “Do or do not. There is no try.”
> >- Snapshot isolation - Reads use one and only one snapshot of a table
> >at some time without holding a lock.
> >- Safe schema evolution - A table’s schema can change in well-defined
> >ways, without breaking older data files.
> >- Column projection - An engine may request a subset of the available
> >columns, including nested fields.
> >- Predicate pushdown - An engine can push filters into read planning
> >to improve performance using partition data and file-level statistics.
> >
> > Iceberg does NOT define a new file format. All data is stored in Apache
> > Avro, Apache ORC, or Apache Parquet files.
> >
> > Additionally, Iceberg is designed to work well when data files are stored
> > in cloud blob stores, even when those systems provide weaker guarantees
> > than a file system, including:
> >
> >- Eventual consistency in the namespace
> >- High latency for directory listings
> >- No renames of objects
> >- No folder hierarchy
> >
> > Rationale
> >
> > Initial benchmarks show dramatic improvements in query planning. For
> > example, in Netflix’s Atlas use case, which stores time-series metrics
> from
> > Netflix runtime systems and 1 month is stored across 2.7 million files in
> > 2,688 partitions:
> >
> >- Hive table using Parquet:
> >   - 400k+ splits, not combined
> >   - Explain query: 9.6 minutes wall time (planning only)
> >- Iceberg table with partition filtering:
> >   - 15,218 splits, combined
> >   - Planning: 10 seconds
> >   - Query wall time: 13 minutes
> >- Iceberg table with partition and min/max filtering:
> >   - 412 splits
> >   - Planning: 25 seconds
> >   - Query wall time: 42 seconds
> >
> > These performance gains combined with the cross-engine compatibility are
> a
> > very compelling story.
> > Initial Goals
> >
> > The initial goal will be to move the existing codebase to Apache and
> > integrate with the Apache development process and infrastructure. A
> primary
> > goal of incubation will be to grow and diversify the Iceberg community.
> We
> > are well aware that the project community is largely comprised of
> > individuals from a single company. We aim to change that during
> incubation.
> > Current Status
> >
> > As previously mentioned, Iceberg is under active development at Netflix,
> > and is being used in processing large volumes of data in Amazon EC2.
> >
> > Iceberg license documentation is already based on Apache guidelines for
> > LICENSE and NOTICE content.
> > Meritocracy
> >
> > We value meritocracy and we understand that it is the basis for an open
> > community that encourages multiple companies and individuals to
> contribute
> > and be invested in the project’s future. We will encourage and monitor
> > participation and make sure to extend privileges and responsibilities to
> > all contributors.
> > Community
> >
> > Iceberg is currently being used by developers at Netflix and a growing
> > number of users are actively using it in production environments. Iceberg
> > has received contributions from developers working at Hortonworks,
> WeWork,
> > and Palantir. By bringing Iceberg to Apache we aim to assure current and
> > future contributors that the Iceberg community is meritocratic and open,
> in
> > order to broaden and diversity the user and developer community.
> > Core Developers
> >
> > Iceberg was initially developed at Netflix and is under active
> > development. We believe Netflix will be of interest to a broad range of
> > users and developers and that incubating the project at the ASF will help
> > us build a diverse, sustainable community.
> > Alignment
> >
> > Iceberg utilizes 

Re: [VOTE] Accept the Iceberg project for incubation

2018-11-13 Thread Felix Cheung
+1 (non binding)

awesome to see this is taken forward to the incubator and looking forward
to collaborate with the community!


On Tue, Nov 13, 2018 at 9:09 AM Ryan Blue  wrote:

> +1 (binding)
>
> On Tue, Nov 13, 2018 at 9:06 AM Ryan Blue  wrote:
>
> > The discuss thread seems to have reached consensus, so I propose
> accepting
> > the Iceberg project for incubation.
> >
> > The proposal is copied below and in the wiki:
> > https://wiki.apache.org/incubator/IcebergProposal
> >
> > Please vote on whether to accept Iceberg in the next 72 hours:
> >
> > [ ] +1, accept Iceberg for incubation
> > [ ] -1, reject the Iceberg proposal because . . .
> >
> > Thank you for reviewing the proposal and voting,
> >
> > rb
> > --
> > Iceberg Proposal Abstract
> >
> > Iceberg is a table format for large, slow-moving tabular data.
> >
> > It is designed to improve on the de-facto standard table layout built
> into
> > Apache Hive, Presto, and Apache Spark.
> > Proposal
> >
> > The purpose of Iceberg is to provide SQL-like tables that are backed by
> > large sets of data files. Iceberg is similar to the Hive table layout,
> the
> > de-facto standard structure used to track files in a table, but provides
> > additional guarantees and performance optimizations:
> >
> >- Atomicity - Each change to the table is will be complete or will
> >fail. “Do or do not. There is no try.”
> >- Snapshot isolation - Reads use one and only one snapshot of a table
> >at some time without holding a lock.
> >- Safe schema evolution - A table’s schema can change in well-defined
> >ways, without breaking older data files.
> >- Column projection - An engine may request a subset of the available
> >columns, including nested fields.
> >- Predicate pushdown - An engine can push filters into read planning
> >to improve performance using partition data and file-level statistics.
> >
> > Iceberg does NOT define a new file format. All data is stored in Apache
> > Avro, Apache ORC, or Apache Parquet files.
> >
> > Additionally, Iceberg is designed to work well when data files are stored
> > in cloud blob stores, even when those systems provide weaker guarantees
> > than a file system, including:
> >
> >- Eventual consistency in the namespace
> >- High latency for directory listings
> >- No renames of objects
> >- No folder hierarchy
> >
> > Rationale
> >
> > Initial benchmarks show dramatic improvements in query planning. For
> > example, in Netflix’s Atlas use case, which stores time-series metrics
> from
> > Netflix runtime systems and 1 month is stored across 2.7 million files in
> > 2,688 partitions:
> >
> >- Hive table using Parquet:
> >   - 400k+ splits, not combined
> >   - Explain query: 9.6 minutes wall time (planning only)
> >- Iceberg table with partition filtering:
> >   - 15,218 splits, combined
> >   - Planning: 10 seconds
> >   - Query wall time: 13 minutes
> >- Iceberg table with partition and min/max filtering:
> >   - 412 splits
> >   - Planning: 25 seconds
> >   - Query wall time: 42 seconds
> >
> > These performance gains combined with the cross-engine compatibility are
> a
> > very compelling story.
> > Initial Goals
> >
> > The initial goal will be to move the existing codebase to Apache and
> > integrate with the Apache development process and infrastructure. A
> primary
> > goal of incubation will be to grow and diversify the Iceberg community.
> We
> > are well aware that the project community is largely comprised of
> > individuals from a single company. We aim to change that during
> incubation.
> > Current Status
> >
> > As previously mentioned, Iceberg is under active development at Netflix,
> > and is being used in processing large volumes of data in Amazon EC2.
> >
> > Iceberg license documentation is already based on Apache guidelines for
> > LICENSE and NOTICE content.
> > Meritocracy
> >
> > We value meritocracy and we understand that it is the basis for an open
> > community that encourages multiple companies and individuals to
> contribute
> > and be invested in the project’s future. We will encourage and monitor
> > participation and make sure to extend privileges and responsibilities to
> > all contributors.
> > Community
> >
> > Iceberg is currently being used by developers at Netflix and a growing
> > number of users are actively using it in production environments. Iceberg
> > has received contributions from developers working at Hortonworks,
> WeWork,
> > and Palantir. By bringing Iceberg to Apache we aim to assure current and
> > future contributors that the Iceberg community is meritocratic and open,
> in
> > order to broaden and diversity the user and developer community.
> > Core Developers
> >
> > Iceberg was initially developed at Netflix and is under active
> > development. We believe Netflix will be of interest to a broad range of
> > users and developers and that incubating the 

Re: [VOTE] Accept the Iceberg project for incubation

2018-11-13 Thread Ryan Blue
+1 (binding)

On Tue, Nov 13, 2018 at 9:06 AM Ryan Blue  wrote:

> The discuss thread seems to have reached consensus, so I propose accepting
> the Iceberg project for incubation.
>
> The proposal is copied below and in the wiki:
> https://wiki.apache.org/incubator/IcebergProposal
>
> Please vote on whether to accept Iceberg in the next 72 hours:
>
> [ ] +1, accept Iceberg for incubation
> [ ] -1, reject the Iceberg proposal because . . .
>
> Thank you for reviewing the proposal and voting,
>
> rb
> --
> Iceberg Proposal Abstract
>
> Iceberg is a table format for large, slow-moving tabular data.
>
> It is designed to improve on the de-facto standard table layout built into
> Apache Hive, Presto, and Apache Spark.
> Proposal
>
> The purpose of Iceberg is to provide SQL-like tables that are backed by
> large sets of data files. Iceberg is similar to the Hive table layout, the
> de-facto standard structure used to track files in a table, but provides
> additional guarantees and performance optimizations:
>
>- Atomicity - Each change to the table is will be complete or will
>fail. “Do or do not. There is no try.”
>- Snapshot isolation - Reads use one and only one snapshot of a table
>at some time without holding a lock.
>- Safe schema evolution - A table’s schema can change in well-defined
>ways, without breaking older data files.
>- Column projection - An engine may request a subset of the available
>columns, including nested fields.
>- Predicate pushdown - An engine can push filters into read planning
>to improve performance using partition data and file-level statistics.
>
> Iceberg does NOT define a new file format. All data is stored in Apache
> Avro, Apache ORC, or Apache Parquet files.
>
> Additionally, Iceberg is designed to work well when data files are stored
> in cloud blob stores, even when those systems provide weaker guarantees
> than a file system, including:
>
>- Eventual consistency in the namespace
>- High latency for directory listings
>- No renames of objects
>- No folder hierarchy
>
> Rationale
>
> Initial benchmarks show dramatic improvements in query planning. For
> example, in Netflix’s Atlas use case, which stores time-series metrics from
> Netflix runtime systems and 1 month is stored across 2.7 million files in
> 2,688 partitions:
>
>- Hive table using Parquet:
>   - 400k+ splits, not combined
>   - Explain query: 9.6 minutes wall time (planning only)
>- Iceberg table with partition filtering:
>   - 15,218 splits, combined
>   - Planning: 10 seconds
>   - Query wall time: 13 minutes
>- Iceberg table with partition and min/max filtering:
>   - 412 splits
>   - Planning: 25 seconds
>   - Query wall time: 42 seconds
>
> These performance gains combined with the cross-engine compatibility are a
> very compelling story.
> Initial Goals
>
> The initial goal will be to move the existing codebase to Apache and
> integrate with the Apache development process and infrastructure. A primary
> goal of incubation will be to grow and diversify the Iceberg community. We
> are well aware that the project community is largely comprised of
> individuals from a single company. We aim to change that during incubation.
> Current Status
>
> As previously mentioned, Iceberg is under active development at Netflix,
> and is being used in processing large volumes of data in Amazon EC2.
>
> Iceberg license documentation is already based on Apache guidelines for
> LICENSE and NOTICE content.
> Meritocracy
>
> We value meritocracy and we understand that it is the basis for an open
> community that encourages multiple companies and individuals to contribute
> and be invested in the project’s future. We will encourage and monitor
> participation and make sure to extend privileges and responsibilities to
> all contributors.
> Community
>
> Iceberg is currently being used by developers at Netflix and a growing
> number of users are actively using it in production environments. Iceberg
> has received contributions from developers working at Hortonworks, WeWork,
> and Palantir. By bringing Iceberg to Apache we aim to assure current and
> future contributors that the Iceberg community is meritocratic and open, in
> order to broaden and diversity the user and developer community.
> Core Developers
>
> Iceberg was initially developed at Netflix and is under active
> development. We believe Netflix will be of interest to a broad range of
> users and developers and that incubating the project at the ASF will help
> us build a diverse, sustainable community.
> Alignment
>
> Iceberg utilizes other Apache projects such as Avro, Hadoop, Hive, ORC,
> Parquet, Pig, and Spark. We anticipate integration with additional Apache
> projects as the Iceberg community and interest in the project grows.
> Known Risks Orphaned Products
>
> Netflix is committed to the future development of Iceberg and understands
> 

[VOTE] Accept the Iceberg project for incubation

2018-11-13 Thread Ryan Blue
The discuss thread seems to have reached consensus, so I propose accepting
the Iceberg project for incubation.

The proposal is copied below and in the wiki:
https://wiki.apache.org/incubator/IcebergProposal

Please vote on whether to accept Iceberg in the next 72 hours:

[ ] +1, accept Iceberg for incubation
[ ] -1, reject the Iceberg proposal because . . .

Thank you for reviewing the proposal and voting,

rb
--
Iceberg Proposal Abstract

Iceberg is a table format for large, slow-moving tabular data.

It is designed to improve on the de-facto standard table layout built into
Apache Hive, Presto, and Apache Spark.
Proposal

The purpose of Iceberg is to provide SQL-like tables that are backed by
large sets of data files. Iceberg is similar to the Hive table layout, the
de-facto standard structure used to track files in a table, but provides
additional guarantees and performance optimizations:

   - Atomicity - Each change to the table is will be complete or will fail.
   “Do or do not. There is no try.”
   - Snapshot isolation - Reads use one and only one snapshot of a table at
   some time without holding a lock.
   - Safe schema evolution - A table’s schema can change in well-defined
   ways, without breaking older data files.
   - Column projection - An engine may request a subset of the available
   columns, including nested fields.
   - Predicate pushdown - An engine can push filters into read planning to
   improve performance using partition data and file-level statistics.

Iceberg does NOT define a new file format. All data is stored in Apache
Avro, Apache ORC, or Apache Parquet files.

Additionally, Iceberg is designed to work well when data files are stored
in cloud blob stores, even when those systems provide weaker guarantees
than a file system, including:

   - Eventual consistency in the namespace
   - High latency for directory listings
   - No renames of objects
   - No folder hierarchy

Rationale

Initial benchmarks show dramatic improvements in query planning. For
example, in Netflix’s Atlas use case, which stores time-series metrics from
Netflix runtime systems and 1 month is stored across 2.7 million files in
2,688 partitions:

   - Hive table using Parquet:
  - 400k+ splits, not combined
  - Explain query: 9.6 minutes wall time (planning only)
   - Iceberg table with partition filtering:
  - 15,218 splits, combined
  - Planning: 10 seconds
  - Query wall time: 13 minutes
   - Iceberg table with partition and min/max filtering:
  - 412 splits
  - Planning: 25 seconds
  - Query wall time: 42 seconds

These performance gains combined with the cross-engine compatibility are a
very compelling story.
Initial Goals

The initial goal will be to move the existing codebase to Apache and
integrate with the Apache development process and infrastructure. A primary
goal of incubation will be to grow and diversify the Iceberg community. We
are well aware that the project community is largely comprised of
individuals from a single company. We aim to change that during incubation.
Current Status

As previously mentioned, Iceberg is under active development at Netflix,
and is being used in processing large volumes of data in Amazon EC2.

Iceberg license documentation is already based on Apache guidelines for
LICENSE and NOTICE content.
Meritocracy

We value meritocracy and we understand that it is the basis for an open
community that encourages multiple companies and individuals to contribute
and be invested in the project’s future. We will encourage and monitor
participation and make sure to extend privileges and responsibilities to
all contributors.
Community

Iceberg is currently being used by developers at Netflix and a growing
number of users are actively using it in production environments. Iceberg
has received contributions from developers working at Hortonworks, WeWork,
and Palantir. By bringing Iceberg to Apache we aim to assure current and
future contributors that the Iceberg community is meritocratic and open, in
order to broaden and diversity the user and developer community.
Core Developers

Iceberg was initially developed at Netflix and is under active development.
We believe Netflix will be of interest to a broad range of users and
developers and that incubating the project at the ASF will help us build a
diverse, sustainable community.
Alignment

Iceberg utilizes other Apache projects such as Avro, Hadoop, Hive, ORC,
Parquet, Pig, and Spark. We anticipate integration with additional Apache
projects as the Iceberg community and interest in the project grows.
Known Risks Orphaned Products

Netflix is committed to the future development of Iceberg and understands
that graduation to a TLP, while preferable, is not the only positive
outcome of incubation.

Should the Iceberg project be accepted by the Incubator, the prospective
PPMC would be willing to agree to a target incubation period of 2 years or
less, knowing that every