[RESULT] [VOTE] Accept the Iceberg project for incubation
The vote passes with 13 binding +1 and 5 non-binding +1 votes. Thank you for voting, everyone! I'll get started with the next steps. +1 votes: Ryan Blue* Matt Sicker* Felix Cheung Dave Fisher* Owen O'Malley* Hugo Louro Arthur Wiedmer Julian Hyde* Kevin A. McGrail* Willem Jiang* James Taylor* Uwe Korn Lars Francke* Jean-Baptiste Onofré* Olivier Lamy* Michael Wall* Kenneth Knowles Julien Le Dem* * = binding On Tue, Nov 13, 2018 at 9:06 AM Ryan Blue wrote: > The discuss thread seems to have reached consensus, so I propose accepting > the Iceberg project for incubation. > > The proposal is copied below and in the wiki: > https://wiki.apache.org/incubator/IcebergProposal > > Please vote on whether to accept Iceberg in the next 72 hours: > > [ ] +1, accept Iceberg for incubation > [ ] -1, reject the Iceberg proposal because . . . > > Thank you for reviewing the proposal and voting, > > rb > -- > Iceberg Proposal Abstract > > Iceberg is a table format for large, slow-moving tabular data. > > It is designed to improve on the de-facto standard table layout built into > Apache Hive, Presto, and Apache Spark. > Proposal > > The purpose of Iceberg is to provide SQL-like tables that are backed by > large sets of data files. Iceberg is similar to the Hive table layout, the > de-facto standard structure used to track files in a table, but provides > additional guarantees and performance optimizations: > >- Atomicity - Each change to the table is will be complete or will >fail. “Do or do not. There is no try.” >- Snapshot isolation - Reads use one and only one snapshot of a table >at some time without holding a lock. >- Safe schema evolution - A table’s schema can change in well-defined >ways, without breaking older data files. >- Column projection - An engine may request a subset of the available >columns, including nested fields. >- Predicate pushdown - An engine can push filters into read planning >to improve performance using partition data and file-level statistics. > > Iceberg does NOT define a new file format. All data is stored in Apache > Avro, Apache ORC, or Apache Parquet files. > > Additionally, Iceberg is designed to work well when data files are stored > in cloud blob stores, even when those systems provide weaker guarantees > than a file system, including: > >- Eventual consistency in the namespace >- High latency for directory listings >- No renames of objects >- No folder hierarchy > > Rationale > > Initial benchmarks show dramatic improvements in query planning. For > example, in Netflix’s Atlas use case, which stores time-series metrics from > Netflix runtime systems and 1 month is stored across 2.7 million files in > 2,688 partitions: > >- Hive table using Parquet: > - 400k+ splits, not combined > - Explain query: 9.6 minutes wall time (planning only) >- Iceberg table with partition filtering: > - 15,218 splits, combined > - Planning: 10 seconds > - Query wall time: 13 minutes >- Iceberg table with partition and min/max filtering: > - 412 splits > - Planning: 25 seconds > - Query wall time: 42 seconds > > These performance gains combined with the cross-engine compatibility are a > very compelling story. > Initial Goals > > The initial goal will be to move the existing codebase to Apache and > integrate with the Apache development process and infrastructure. A primary > goal of incubation will be to grow and diversify the Iceberg community. We > are well aware that the project community is largely comprised of > individuals from a single company. We aim to change that during incubation. > Current Status > > As previously mentioned, Iceberg is under active development at Netflix, > and is being used in processing large volumes of data in Amazon EC2. > > Iceberg license documentation is already based on Apache guidelines for > LICENSE and NOTICE content. > Meritocracy > > We value meritocracy and we understand that it is the basis for an open > community that encourages multiple companies and individuals to contribute > and be invested in the project’s future. We will encourage and monitor > participation and make sure to extend privileges and responsibilities to > all contributors. > Community > > Iceberg is currently being used by developers at Netflix and a growing > number of users are actively using it in production environments. Iceberg > has received contributions from developers working at Hortonworks, WeWork, > and Palantir. By bringing Iceberg to Apache we aim to assure current and > future contributors that the Iceberg community is meritocratic and open, in > order to broaden and diversity the user and developer community. > Core Developers > > Iceberg was initially developed at Netflix and is under active > development. We believe Netflix will be of interest to a broad range of > users and developers and that incubating the project at the ASF
Re: [VOTE] Accept the Iceberg project for incubation
> > +1 > > From: Kenneth Knowles > Date: Thu, Nov 15, 2018 at 10:01 AM > Subject: Re: [VOTE] Accept the Iceberg project for incubation > To: > > > +1 (non-binding) > > On Thu, Nov 15, 2018 at 9:57 AM Michael Wall wrote: > > > +1 (binding) > > > > On Thu, Nov 15, 2018 at 3:03 AM Olivier Lamy wrote: > > > > > +1 > > > > > > On Wed, 14 Nov 2018 at 03:07, Ryan Blue wrote: > > > > > > > The discuss thread seems to have reached consensus, so I propose > > > accepting > > > > the Iceberg project for incubation. > > > > > > > > The proposal is copied below and in the wiki: > > > > https://wiki.apache.org/incubator/IcebergProposal > > > > > > > > Please vote on whether to accept Iceberg in the next 72 hours: > > > > > > > > [ ] +1, accept Iceberg for incubation > > > > [ ] -1, reject the Iceberg proposal because . . . > > > > > > > > Thank you for reviewing the proposal and voting, > > > > > > > > rb > > > > -- > > > > Iceberg Proposal Abstract > > > > > > > > Iceberg is a table format for large, slow-moving tabular data. > > > > > > > > It is designed to improve on the de-facto standard table layout built > > > into > > > > Apache Hive, Presto, and Apache Spark. > > > > Proposal > > > > > > > > The purpose of Iceberg is to provide SQL-like tables that are backed > by > > > > large sets of data files. Iceberg is similar to the Hive table > layout, > > > the > > > > de-facto standard structure used to track files in a table, but > > provides > > > > additional guarantees and performance optimizations: > > > > > > > >- Atomicity - Each change to the table is will be complete or will > > > fail. > > > >“Do or do not. There is no try.” > > > >- Snapshot isolation - Reads use one and only one snapshot of a > > table > > > at > > > >some time without holding a lock. > > > >- Safe schema evolution - A table’s schema can change in > > well-defined > > > >ways, without breaking older data files. > > > >- Column projection - An engine may request a subset of the > > available > > > >columns, including nested fields. > > > >- Predicate pushdown - An engine can push filters into read > planning > > > to > > > >improve performance using partition data and file-level > statistics. > > > > > > > > Iceberg does NOT define a new file format. All data is stored in > Apache > > > > Avro, Apache ORC, or Apache Parquet files. > > > > > > > > Additionally, Iceberg is designed to work well when data files are > > stored > > > > in cloud blob stores, even when those systems provide weaker > guarantees > > > > than a file system, including: > > > > > > > >- Eventual consistency in the namespace > > > >- High latency for directory listings > > > >- No renames of objects > > > >- No folder hierarchy > > > > > > > > Rationale > > > > > > > > Initial benchmarks show dramatic improvements in query planning. For > > > > example, in Netflix’s Atlas use case, which stores time-series > metrics > > > from > > > > Netflix runtime systems and 1 month is stored across 2.7 million > files > > in > > > > 2,688 partitions: > > > > > > > >- Hive table using Parquet: > > > > - 400k+ splits, not combined > > > > - Explain query: 9.6 minutes wall time (planning only) > > > >- Iceberg table with partition filtering: > > > > - 15,218 splits, combined > > > > - Planning: 10 seconds > > > > - Query wall time: 13 minutes > > > >- Iceberg table with partition and min/max filtering: > > > > - 412 splits > > > > - Planning: 25 seconds > > > > - Query wall time: 42 seconds > > > > > > > > > These performance gains combined with the cross-engine compatibility > > are > > > a > > > > very compelling story. > > > > Initial Goals > > > > > > > > The initial goal will be to move the existing cod
Re: [VOTE] Accept the Iceberg project for incubation
+1 (non-binding) On Thu, Nov 15, 2018 at 9:57 AM Michael Wall wrote: > +1 (binding) > > On Thu, Nov 15, 2018 at 3:03 AM Olivier Lamy wrote: > > > +1 > > > > On Wed, 14 Nov 2018 at 03:07, Ryan Blue wrote: > > > > > The discuss thread seems to have reached consensus, so I propose > > accepting > > > the Iceberg project for incubation. > > > > > > The proposal is copied below and in the wiki: > > > https://wiki.apache.org/incubator/IcebergProposal > > > > > > Please vote on whether to accept Iceberg in the next 72 hours: > > > > > > [ ] +1, accept Iceberg for incubation > > > [ ] -1, reject the Iceberg proposal because . . . > > > > > > Thank you for reviewing the proposal and voting, > > > > > > rb > > > -- > > > Iceberg Proposal Abstract > > > > > > Iceberg is a table format for large, slow-moving tabular data. > > > > > > It is designed to improve on the de-facto standard table layout built > > into > > > Apache Hive, Presto, and Apache Spark. > > > Proposal > > > > > > The purpose of Iceberg is to provide SQL-like tables that are backed by > > > large sets of data files. Iceberg is similar to the Hive table layout, > > the > > > de-facto standard structure used to track files in a table, but > provides > > > additional guarantees and performance optimizations: > > > > > >- Atomicity - Each change to the table is will be complete or will > > fail. > > >“Do or do not. There is no try.” > > >- Snapshot isolation - Reads use one and only one snapshot of a > table > > at > > >some time without holding a lock. > > >- Safe schema evolution - A table’s schema can change in > well-defined > > >ways, without breaking older data files. > > >- Column projection - An engine may request a subset of the > available > > >columns, including nested fields. > > >- Predicate pushdown - An engine can push filters into read planning > > to > > >improve performance using partition data and file-level statistics. > > > > > > Iceberg does NOT define a new file format. All data is stored in Apache > > > Avro, Apache ORC, or Apache Parquet files. > > > > > > Additionally, Iceberg is designed to work well when data files are > stored > > > in cloud blob stores, even when those systems provide weaker guarantees > > > than a file system, including: > > > > > >- Eventual consistency in the namespace > > >- High latency for directory listings > > >- No renames of objects > > >- No folder hierarchy > > > > > > Rationale > > > > > > Initial benchmarks show dramatic improvements in query planning. For > > > example, in Netflix’s Atlas use case, which stores time-series metrics > > from > > > Netflix runtime systems and 1 month is stored across 2.7 million files > in > > > 2,688 partitions: > > > > > >- Hive table using Parquet: > > > - 400k+ splits, not combined > > > - Explain query: 9.6 minutes wall time (planning only) > > >- Iceberg table with partition filtering: > > > - 15,218 splits, combined > > > - Planning: 10 seconds > > > - Query wall time: 13 minutes > > >- Iceberg table with partition and min/max filtering: > > > - 412 splits > > > - Planning: 25 seconds > > > - Query wall time: 42 seconds > > > > > > These performance gains combined with the cross-engine compatibility > are > > a > > > very compelling story. > > > Initial Goals > > > > > > The initial goal will be to move the existing codebase to Apache and > > > integrate with the Apache development process and infrastructure. A > > primary > > > goal of incubation will be to grow and diversify the Iceberg community. > > We > > > are well aware that the project community is largely comprised of > > > individuals from a single company. We aim to change that during > > incubation. > > > Current Status > > > > > > As previously mentioned, Iceberg is under active development at > Netflix, > > > and is being used in processing large volumes of data in Amazon EC2. > > > > > > Iceberg license documentation is already based on Apache guidelines for > > > LICENSE and NOTICE content. > > > Meritocracy > > > > > > We value meritocracy and we understand that it is the basis for an open > > > community that encourages multiple companies and individuals to > > contribute > > > and be invested in the project’s future. We will encourage and monitor > > > participation and make sure to extend privileges and responsibilities > to > > > all contributors. > > > Community > > > > > > Iceberg is currently being used by developers at Netflix and a growing > > > number of users are actively using it in production environments. > Iceberg > > > has received contributions from developers working at Hortonworks, > > WeWork, > > > and Palantir. By bringing Iceberg to Apache we aim to assure current > and > > > future contributors that the Iceberg community is meritocratic and > open, > > in > > > order to broaden and diversity the user and developer
Re: [VOTE] Accept the Iceberg project for incubation
+1 (binding) On Thu, Nov 15, 2018 at 3:03 AM Olivier Lamy wrote: > +1 > > On Wed, 14 Nov 2018 at 03:07, Ryan Blue wrote: > > > The discuss thread seems to have reached consensus, so I propose > accepting > > the Iceberg project for incubation. > > > > The proposal is copied below and in the wiki: > > https://wiki.apache.org/incubator/IcebergProposal > > > > Please vote on whether to accept Iceberg in the next 72 hours: > > > > [ ] +1, accept Iceberg for incubation > > [ ] -1, reject the Iceberg proposal because . . . > > > > Thank you for reviewing the proposal and voting, > > > > rb > > -- > > Iceberg Proposal Abstract > > > > Iceberg is a table format for large, slow-moving tabular data. > > > > It is designed to improve on the de-facto standard table layout built > into > > Apache Hive, Presto, and Apache Spark. > > Proposal > > > > The purpose of Iceberg is to provide SQL-like tables that are backed by > > large sets of data files. Iceberg is similar to the Hive table layout, > the > > de-facto standard structure used to track files in a table, but provides > > additional guarantees and performance optimizations: > > > >- Atomicity - Each change to the table is will be complete or will > fail. > >“Do or do not. There is no try.” > >- Snapshot isolation - Reads use one and only one snapshot of a table > at > >some time without holding a lock. > >- Safe schema evolution - A table’s schema can change in well-defined > >ways, without breaking older data files. > >- Column projection - An engine may request a subset of the available > >columns, including nested fields. > >- Predicate pushdown - An engine can push filters into read planning > to > >improve performance using partition data and file-level statistics. > > > > Iceberg does NOT define a new file format. All data is stored in Apache > > Avro, Apache ORC, or Apache Parquet files. > > > > Additionally, Iceberg is designed to work well when data files are stored > > in cloud blob stores, even when those systems provide weaker guarantees > > than a file system, including: > > > >- Eventual consistency in the namespace > >- High latency for directory listings > >- No renames of objects > >- No folder hierarchy > > > > Rationale > > > > Initial benchmarks show dramatic improvements in query planning. For > > example, in Netflix’s Atlas use case, which stores time-series metrics > from > > Netflix runtime systems and 1 month is stored across 2.7 million files in > > 2,688 partitions: > > > >- Hive table using Parquet: > > - 400k+ splits, not combined > > - Explain query: 9.6 minutes wall time (planning only) > >- Iceberg table with partition filtering: > > - 15,218 splits, combined > > - Planning: 10 seconds > > - Query wall time: 13 minutes > >- Iceberg table with partition and min/max filtering: > > - 412 splits > > - Planning: 25 seconds > > - Query wall time: 42 seconds > > > > These performance gains combined with the cross-engine compatibility are > a > > very compelling story. > > Initial Goals > > > > The initial goal will be to move the existing codebase to Apache and > > integrate with the Apache development process and infrastructure. A > primary > > goal of incubation will be to grow and diversify the Iceberg community. > We > > are well aware that the project community is largely comprised of > > individuals from a single company. We aim to change that during > incubation. > > Current Status > > > > As previously mentioned, Iceberg is under active development at Netflix, > > and is being used in processing large volumes of data in Amazon EC2. > > > > Iceberg license documentation is already based on Apache guidelines for > > LICENSE and NOTICE content. > > Meritocracy > > > > We value meritocracy and we understand that it is the basis for an open > > community that encourages multiple companies and individuals to > contribute > > and be invested in the project’s future. We will encourage and monitor > > participation and make sure to extend privileges and responsibilities to > > all contributors. > > Community > > > > Iceberg is currently being used by developers at Netflix and a growing > > number of users are actively using it in production environments. Iceberg > > has received contributions from developers working at Hortonworks, > WeWork, > > and Palantir. By bringing Iceberg to Apache we aim to assure current and > > future contributors that the Iceberg community is meritocratic and open, > in > > order to broaden and diversity the user and developer community. > > Core Developers > > > > Iceberg was initially developed at Netflix and is under active > development. > > We believe Netflix will be of interest to a broad range of users and > > developers and that incubating the project at the ASF will help us build > a > > diverse, sustainable community. > > Alignment > > > > Iceberg utilizes
Re: [VOTE] Accept the Iceberg project for incubation
+1 On Wed, 14 Nov 2018 at 03:07, Ryan Blue wrote: > The discuss thread seems to have reached consensus, so I propose accepting > the Iceberg project for incubation. > > The proposal is copied below and in the wiki: > https://wiki.apache.org/incubator/IcebergProposal > > Please vote on whether to accept Iceberg in the next 72 hours: > > [ ] +1, accept Iceberg for incubation > [ ] -1, reject the Iceberg proposal because . . . > > Thank you for reviewing the proposal and voting, > > rb > -- > Iceberg Proposal Abstract > > Iceberg is a table format for large, slow-moving tabular data. > > It is designed to improve on the de-facto standard table layout built into > Apache Hive, Presto, and Apache Spark. > Proposal > > The purpose of Iceberg is to provide SQL-like tables that are backed by > large sets of data files. Iceberg is similar to the Hive table layout, the > de-facto standard structure used to track files in a table, but provides > additional guarantees and performance optimizations: > >- Atomicity - Each change to the table is will be complete or will fail. >“Do or do not. There is no try.” >- Snapshot isolation - Reads use one and only one snapshot of a table at >some time without holding a lock. >- Safe schema evolution - A table’s schema can change in well-defined >ways, without breaking older data files. >- Column projection - An engine may request a subset of the available >columns, including nested fields. >- Predicate pushdown - An engine can push filters into read planning to >improve performance using partition data and file-level statistics. > > Iceberg does NOT define a new file format. All data is stored in Apache > Avro, Apache ORC, or Apache Parquet files. > > Additionally, Iceberg is designed to work well when data files are stored > in cloud blob stores, even when those systems provide weaker guarantees > than a file system, including: > >- Eventual consistency in the namespace >- High latency for directory listings >- No renames of objects >- No folder hierarchy > > Rationale > > Initial benchmarks show dramatic improvements in query planning. For > example, in Netflix’s Atlas use case, which stores time-series metrics from > Netflix runtime systems and 1 month is stored across 2.7 million files in > 2,688 partitions: > >- Hive table using Parquet: > - 400k+ splits, not combined > - Explain query: 9.6 minutes wall time (planning only) >- Iceberg table with partition filtering: > - 15,218 splits, combined > - Planning: 10 seconds > - Query wall time: 13 minutes >- Iceberg table with partition and min/max filtering: > - 412 splits > - Planning: 25 seconds > - Query wall time: 42 seconds > > These performance gains combined with the cross-engine compatibility are a > very compelling story. > Initial Goals > > The initial goal will be to move the existing codebase to Apache and > integrate with the Apache development process and infrastructure. A primary > goal of incubation will be to grow and diversify the Iceberg community. We > are well aware that the project community is largely comprised of > individuals from a single company. We aim to change that during incubation. > Current Status > > As previously mentioned, Iceberg is under active development at Netflix, > and is being used in processing large volumes of data in Amazon EC2. > > Iceberg license documentation is already based on Apache guidelines for > LICENSE and NOTICE content. > Meritocracy > > We value meritocracy and we understand that it is the basis for an open > community that encourages multiple companies and individuals to contribute > and be invested in the project’s future. We will encourage and monitor > participation and make sure to extend privileges and responsibilities to > all contributors. > Community > > Iceberg is currently being used by developers at Netflix and a growing > number of users are actively using it in production environments. Iceberg > has received contributions from developers working at Hortonworks, WeWork, > and Palantir. By bringing Iceberg to Apache we aim to assure current and > future contributors that the Iceberg community is meritocratic and open, in > order to broaden and diversity the user and developer community. > Core Developers > > Iceberg was initially developed at Netflix and is under active development. > We believe Netflix will be of interest to a broad range of users and > developers and that incubating the project at the ASF will help us build a > diverse, sustainable community. > Alignment > > Iceberg utilizes other Apache projects such as Avro, Hadoop, Hive, ORC, > Parquet, Pig, and Spark. We anticipate integration with additional Apache > projects as the Iceberg community and interest in the project grows. > Known Risks Orphaned Products > > Netflix is committed to the future development of Iceberg and understands > that
Re: [VOTE] Accept the Iceberg project for incubation
Quick update: James Taylor has offered to mentor the project as well, so I've added him to the list. Thanks, James! On Tue, Nov 13, 2018 at 9:06 AM Ryan Blue wrote: > The discuss thread seems to have reached consensus, so I propose accepting > the Iceberg project for incubation. > > The proposal is copied below and in the wiki: > https://wiki.apache.org/incubator/IcebergProposal > > Please vote on whether to accept Iceberg in the next 72 hours: > > [ ] +1, accept Iceberg for incubation > [ ] -1, reject the Iceberg proposal because . . . > > Thank you for reviewing the proposal and voting, > > rb > -- > Iceberg Proposal Abstract > > Iceberg is a table format for large, slow-moving tabular data. > > It is designed to improve on the de-facto standard table layout built into > Apache Hive, Presto, and Apache Spark. > Proposal > > The purpose of Iceberg is to provide SQL-like tables that are backed by > large sets of data files. Iceberg is similar to the Hive table layout, the > de-facto standard structure used to track files in a table, but provides > additional guarantees and performance optimizations: > >- Atomicity - Each change to the table is will be complete or will >fail. “Do or do not. There is no try.” >- Snapshot isolation - Reads use one and only one snapshot of a table >at some time without holding a lock. >- Safe schema evolution - A table’s schema can change in well-defined >ways, without breaking older data files. >- Column projection - An engine may request a subset of the available >columns, including nested fields. >- Predicate pushdown - An engine can push filters into read planning >to improve performance using partition data and file-level statistics. > > Iceberg does NOT define a new file format. All data is stored in Apache > Avro, Apache ORC, or Apache Parquet files. > > Additionally, Iceberg is designed to work well when data files are stored > in cloud blob stores, even when those systems provide weaker guarantees > than a file system, including: > >- Eventual consistency in the namespace >- High latency for directory listings >- No renames of objects >- No folder hierarchy > > Rationale > > Initial benchmarks show dramatic improvements in query planning. For > example, in Netflix’s Atlas use case, which stores time-series metrics from > Netflix runtime systems and 1 month is stored across 2.7 million files in > 2,688 partitions: > >- Hive table using Parquet: > - 400k+ splits, not combined > - Explain query: 9.6 minutes wall time (planning only) >- Iceberg table with partition filtering: > - 15,218 splits, combined > - Planning: 10 seconds > - Query wall time: 13 minutes >- Iceberg table with partition and min/max filtering: > - 412 splits > - Planning: 25 seconds > - Query wall time: 42 seconds > > These performance gains combined with the cross-engine compatibility are a > very compelling story. > Initial Goals > > The initial goal will be to move the existing codebase to Apache and > integrate with the Apache development process and infrastructure. A primary > goal of incubation will be to grow and diversify the Iceberg community. We > are well aware that the project community is largely comprised of > individuals from a single company. We aim to change that during incubation. > Current Status > > As previously mentioned, Iceberg is under active development at Netflix, > and is being used in processing large volumes of data in Amazon EC2. > > Iceberg license documentation is already based on Apache guidelines for > LICENSE and NOTICE content. > Meritocracy > > We value meritocracy and we understand that it is the basis for an open > community that encourages multiple companies and individuals to contribute > and be invested in the project’s future. We will encourage and monitor > participation and make sure to extend privileges and responsibilities to > all contributors. > Community > > Iceberg is currently being used by developers at Netflix and a growing > number of users are actively using it in production environments. Iceberg > has received contributions from developers working at Hortonworks, WeWork, > and Palantir. By bringing Iceberg to Apache we aim to assure current and > future contributors that the Iceberg community is meritocratic and open, in > order to broaden and diversity the user and developer community. > Core Developers > > Iceberg was initially developed at Netflix and is under active > development. We believe Netflix will be of interest to a broad range of > users and developers and that incubating the project at the ASF will help > us build a diverse, sustainable community. > Alignment > > Iceberg utilizes other Apache projects such as Avro, Hadoop, Hive, ORC, > Parquet, Pig, and Spark. We anticipate integration with additional Apache > projects as the Iceberg community and interest in the project grows. > Known
Re: [VOTE] Accept the Iceberg project for incubation
+1 (binding) Regards JB On 13/11/2018 18:06, Ryan Blue wrote: > The discuss thread seems to have reached consensus, so I propose accepting > the Iceberg project for incubation. > > The proposal is copied below and in the wiki: > https://wiki.apache.org/incubator/IcebergProposal > > Please vote on whether to accept Iceberg in the next 72 hours: > > [ ] +1, accept Iceberg for incubation > [ ] -1, reject the Iceberg proposal because . . . > > Thank you for reviewing the proposal and voting, > > rb > -- > Iceberg Proposal Abstract > > Iceberg is a table format for large, slow-moving tabular data. > > It is designed to improve on the de-facto standard table layout built into > Apache Hive, Presto, and Apache Spark. > Proposal > > The purpose of Iceberg is to provide SQL-like tables that are backed by > large sets of data files. Iceberg is similar to the Hive table layout, the > de-facto standard structure used to track files in a table, but provides > additional guarantees and performance optimizations: > >- Atomicity - Each change to the table is will be complete or will fail. >“Do or do not. There is no try.” >- Snapshot isolation - Reads use one and only one snapshot of a table at >some time without holding a lock. >- Safe schema evolution - A table’s schema can change in well-defined >ways, without breaking older data files. >- Column projection - An engine may request a subset of the available >columns, including nested fields. >- Predicate pushdown - An engine can push filters into read planning to >improve performance using partition data and file-level statistics. > > Iceberg does NOT define a new file format. All data is stored in Apache > Avro, Apache ORC, or Apache Parquet files. > > Additionally, Iceberg is designed to work well when data files are stored > in cloud blob stores, even when those systems provide weaker guarantees > than a file system, including: > >- Eventual consistency in the namespace >- High latency for directory listings >- No renames of objects >- No folder hierarchy > > Rationale > > Initial benchmarks show dramatic improvements in query planning. For > example, in Netflix’s Atlas use case, which stores time-series metrics from > Netflix runtime systems and 1 month is stored across 2.7 million files in > 2,688 partitions: > >- Hive table using Parquet: > - 400k+ splits, not combined > - Explain query: 9.6 minutes wall time (planning only) >- Iceberg table with partition filtering: > - 15,218 splits, combined > - Planning: 10 seconds > - Query wall time: 13 minutes >- Iceberg table with partition and min/max filtering: > - 412 splits > - Planning: 25 seconds > - Query wall time: 42 seconds > > These performance gains combined with the cross-engine compatibility are a > very compelling story. > Initial Goals > > The initial goal will be to move the existing codebase to Apache and > integrate with the Apache development process and infrastructure. A primary > goal of incubation will be to grow and diversify the Iceberg community. We > are well aware that the project community is largely comprised of > individuals from a single company. We aim to change that during incubation. > Current Status > > As previously mentioned, Iceberg is under active development at Netflix, > and is being used in processing large volumes of data in Amazon EC2. > > Iceberg license documentation is already based on Apache guidelines for > LICENSE and NOTICE content. > Meritocracy > > We value meritocracy and we understand that it is the basis for an open > community that encourages multiple companies and individuals to contribute > and be invested in the project’s future. We will encourage and monitor > participation and make sure to extend privileges and responsibilities to > all contributors. > Community > > Iceberg is currently being used by developers at Netflix and a growing > number of users are actively using it in production environments. Iceberg > has received contributions from developers working at Hortonworks, WeWork, > and Palantir. By bringing Iceberg to Apache we aim to assure current and > future contributors that the Iceberg community is meritocratic and open, in > order to broaden and diversity the user and developer community. > Core Developers > > Iceberg was initially developed at Netflix and is under active development. > We believe Netflix will be of interest to a broad range of users and > developers and that incubating the project at the ASF will help us build a > diverse, sustainable community. > Alignment > > Iceberg utilizes other Apache projects such as Avro, Hadoop, Hive, ORC, > Parquet, Pig, and Spark. We anticipate integration with additional Apache > projects as the Iceberg community and interest in the project grows. > Known Risks Orphaned Products > > Netflix is committed to the future development of
Re: [VOTE] Accept the Iceberg project for incubation
+1 (binding) On Wed, Nov 14, 2018 at 7:48 AM Uwe L. Korn wrote: > +1 (non-binding) > > Great to see this here! > > > Am 14.11.2018 um 04:07 schrieb James Taylor : > > > > +1 (binding) > > > >> On Tue, Nov 13, 2018 at 4:15 PM Willem Jiang > wrote: > >> > >> +1 (binding) > >> > >> Willem Jiang > >> > >> Twitter: willemjiang > >> Weibo: 姜宁willem > >> > >>> On Wed, Nov 14, 2018 at 1:07 AM Ryan Blue wrote: > >>> > >>> The discuss thread seems to have reached consensus, so I propose > >> accepting > >>> the Iceberg project for incubation. > >>> > >>> The proposal is copied below and in the wiki: > >>> https://wiki.apache.org/incubator/IcebergProposal > >>> > >>> Please vote on whether to accept Iceberg in the next 72 hours: > >>> > >>> [ ] +1, accept Iceberg for incubation > >>> [ ] -1, reject the Iceberg proposal because . . . > >>> > >>> Thank you for reviewing the proposal and voting, > >>> > >>> rb > >>> -- > >>> Iceberg Proposal Abstract > >>> > >>> Iceberg is a table format for large, slow-moving tabular data. > >>> > >>> It is designed to improve on the de-facto standard table layout built > >> into > >>> Apache Hive, Presto, and Apache Spark. > >>> Proposal > >>> > >>> The purpose of Iceberg is to provide SQL-like tables that are backed by > >>> large sets of data files. Iceberg is similar to the Hive table layout, > >> the > >>> de-facto standard structure used to track files in a table, but > provides > >>> additional guarantees and performance optimizations: > >>> > >>> - Atomicity - Each change to the table is will be complete or will > >> fail. > >>> “Do or do not. There is no try.” > >>> - Snapshot isolation - Reads use one and only one snapshot of a table > >> at > >>> some time without holding a lock. > >>> - Safe schema evolution - A table’s schema can change in well-defined > >>> ways, without breaking older data files. > >>> - Column projection - An engine may request a subset of the available > >>> columns, including nested fields. > >>> - Predicate pushdown - An engine can push filters into read planning > >> to > >>> improve performance using partition data and file-level statistics. > >>> > >>> Iceberg does NOT define a new file format. All data is stored in Apache > >>> Avro, Apache ORC, or Apache Parquet files. > >>> > >>> Additionally, Iceberg is designed to work well when data files are > stored > >>> in cloud blob stores, even when those systems provide weaker guarantees > >>> than a file system, including: > >>> > >>> - Eventual consistency in the namespace > >>> - High latency for directory listings > >>> - No renames of objects > >>> - No folder hierarchy > >>> > >>> Rationale > >>> > >>> Initial benchmarks show dramatic improvements in query planning. For > >>> example, in Netflix’s Atlas use case, which stores time-series metrics > >> from > >>> Netflix runtime systems and 1 month is stored across 2.7 million files > in > >>> 2,688 partitions: > >>> > >>> - Hive table using Parquet: > >>> - 400k+ splits, not combined > >>> - Explain query: 9.6 minutes wall time (planning only) > >>> - Iceberg table with partition filtering: > >>> - 15,218 splits, combined > >>> - Planning: 10 seconds > >>> - Query wall time: 13 minutes > >>> - Iceberg table with partition and min/max filtering: > >>> - 412 splits > >>> - Planning: 25 seconds > >>> - Query wall time: 42 seconds > >>> > >>> These performance gains combined with the cross-engine compatibility > are > >> a > >>> very compelling story. > >>> Initial Goals > >>> > >>> The initial goal will be to move the existing codebase to Apache and > >>> integrate with the Apache development process and infrastructure. A > >> primary > >>> goal of incubation will be to grow and diversify the Iceberg community. > >> We > >>> are well aware that the project community is largely comprised of > >>> individuals from a single company. We aim to change that during > >> incubation. > >>> Current Status > >>> > >>> As previously mentioned, Iceberg is under active development at > Netflix, > >>> and is being used in processing large volumes of data in Amazon EC2. > >>> > >>> Iceberg license documentation is already based on Apache guidelines for > >>> LICENSE and NOTICE content. > >>> Meritocracy > >>> > >>> We value meritocracy and we understand that it is the basis for an open > >>> community that encourages multiple companies and individuals to > >> contribute > >>> and be invested in the project’s future. We will encourage and monitor > >>> participation and make sure to extend privileges and responsibilities > to > >>> all contributors. > >>> Community > >>> > >>> Iceberg is currently being used by developers at Netflix and a growing > >>> number of users are actively using it in production environments. > Iceberg > >>> has received contributions from developers working at Hortonworks, > >> WeWork, > >>> and Palantir. By bringing Iceberg to Apache
Re: [VOTE] Accept the Iceberg project for incubation
+1 (non-binding) Great to see this here! > Am 14.11.2018 um 04:07 schrieb James Taylor : > > +1 (binding) > >> On Tue, Nov 13, 2018 at 4:15 PM Willem Jiang wrote: >> >> +1 (binding) >> >> Willem Jiang >> >> Twitter: willemjiang >> Weibo: 姜宁willem >> >>> On Wed, Nov 14, 2018 at 1:07 AM Ryan Blue wrote: >>> >>> The discuss thread seems to have reached consensus, so I propose >> accepting >>> the Iceberg project for incubation. >>> >>> The proposal is copied below and in the wiki: >>> https://wiki.apache.org/incubator/IcebergProposal >>> >>> Please vote on whether to accept Iceberg in the next 72 hours: >>> >>> [ ] +1, accept Iceberg for incubation >>> [ ] -1, reject the Iceberg proposal because . . . >>> >>> Thank you for reviewing the proposal and voting, >>> >>> rb >>> -- >>> Iceberg Proposal Abstract >>> >>> Iceberg is a table format for large, slow-moving tabular data. >>> >>> It is designed to improve on the de-facto standard table layout built >> into >>> Apache Hive, Presto, and Apache Spark. >>> Proposal >>> >>> The purpose of Iceberg is to provide SQL-like tables that are backed by >>> large sets of data files. Iceberg is similar to the Hive table layout, >> the >>> de-facto standard structure used to track files in a table, but provides >>> additional guarantees and performance optimizations: >>> >>> - Atomicity - Each change to the table is will be complete or will >> fail. >>> “Do or do not. There is no try.” >>> - Snapshot isolation - Reads use one and only one snapshot of a table >> at >>> some time without holding a lock. >>> - Safe schema evolution - A table’s schema can change in well-defined >>> ways, without breaking older data files. >>> - Column projection - An engine may request a subset of the available >>> columns, including nested fields. >>> - Predicate pushdown - An engine can push filters into read planning >> to >>> improve performance using partition data and file-level statistics. >>> >>> Iceberg does NOT define a new file format. All data is stored in Apache >>> Avro, Apache ORC, or Apache Parquet files. >>> >>> Additionally, Iceberg is designed to work well when data files are stored >>> in cloud blob stores, even when those systems provide weaker guarantees >>> than a file system, including: >>> >>> - Eventual consistency in the namespace >>> - High latency for directory listings >>> - No renames of objects >>> - No folder hierarchy >>> >>> Rationale >>> >>> Initial benchmarks show dramatic improvements in query planning. For >>> example, in Netflix’s Atlas use case, which stores time-series metrics >> from >>> Netflix runtime systems and 1 month is stored across 2.7 million files in >>> 2,688 partitions: >>> >>> - Hive table using Parquet: >>> - 400k+ splits, not combined >>> - Explain query: 9.6 minutes wall time (planning only) >>> - Iceberg table with partition filtering: >>> - 15,218 splits, combined >>> - Planning: 10 seconds >>> - Query wall time: 13 minutes >>> - Iceberg table with partition and min/max filtering: >>> - 412 splits >>> - Planning: 25 seconds >>> - Query wall time: 42 seconds >>> >>> These performance gains combined with the cross-engine compatibility are >> a >>> very compelling story. >>> Initial Goals >>> >>> The initial goal will be to move the existing codebase to Apache and >>> integrate with the Apache development process and infrastructure. A >> primary >>> goal of incubation will be to grow and diversify the Iceberg community. >> We >>> are well aware that the project community is largely comprised of >>> individuals from a single company. We aim to change that during >> incubation. >>> Current Status >>> >>> As previously mentioned, Iceberg is under active development at Netflix, >>> and is being used in processing large volumes of data in Amazon EC2. >>> >>> Iceberg license documentation is already based on Apache guidelines for >>> LICENSE and NOTICE content. >>> Meritocracy >>> >>> We value meritocracy and we understand that it is the basis for an open >>> community that encourages multiple companies and individuals to >> contribute >>> and be invested in the project’s future. We will encourage and monitor >>> participation and make sure to extend privileges and responsibilities to >>> all contributors. >>> Community >>> >>> Iceberg is currently being used by developers at Netflix and a growing >>> number of users are actively using it in production environments. Iceberg >>> has received contributions from developers working at Hortonworks, >> WeWork, >>> and Palantir. By bringing Iceberg to Apache we aim to assure current and >>> future contributors that the Iceberg community is meritocratic and open, >> in >>> order to broaden and diversity the user and developer community. >>> Core Developers >>> >>> Iceberg was initially developed at Netflix and is under active >> development. >>> We believe Netflix
Re: [VOTE] Accept the Iceberg project for incubation
+1 (binding) On Tue, Nov 13, 2018 at 4:15 PM Willem Jiang wrote: > +1 (binding) > > Willem Jiang > > Twitter: willemjiang > Weibo: 姜宁willem > > On Wed, Nov 14, 2018 at 1:07 AM Ryan Blue wrote: > > > > The discuss thread seems to have reached consensus, so I propose > accepting > > the Iceberg project for incubation. > > > > The proposal is copied below and in the wiki: > > https://wiki.apache.org/incubator/IcebergProposal > > > > Please vote on whether to accept Iceberg in the next 72 hours: > > > > [ ] +1, accept Iceberg for incubation > > [ ] -1, reject the Iceberg proposal because . . . > > > > Thank you for reviewing the proposal and voting, > > > > rb > > -- > > Iceberg Proposal Abstract > > > > Iceberg is a table format for large, slow-moving tabular data. > > > > It is designed to improve on the de-facto standard table layout built > into > > Apache Hive, Presto, and Apache Spark. > > Proposal > > > > The purpose of Iceberg is to provide SQL-like tables that are backed by > > large sets of data files. Iceberg is similar to the Hive table layout, > the > > de-facto standard structure used to track files in a table, but provides > > additional guarantees and performance optimizations: > > > >- Atomicity - Each change to the table is will be complete or will > fail. > >“Do or do not. There is no try.” > >- Snapshot isolation - Reads use one and only one snapshot of a table > at > >some time without holding a lock. > >- Safe schema evolution - A table’s schema can change in well-defined > >ways, without breaking older data files. > >- Column projection - An engine may request a subset of the available > >columns, including nested fields. > >- Predicate pushdown - An engine can push filters into read planning > to > >improve performance using partition data and file-level statistics. > > > > Iceberg does NOT define a new file format. All data is stored in Apache > > Avro, Apache ORC, or Apache Parquet files. > > > > Additionally, Iceberg is designed to work well when data files are stored > > in cloud blob stores, even when those systems provide weaker guarantees > > than a file system, including: > > > >- Eventual consistency in the namespace > >- High latency for directory listings > >- No renames of objects > >- No folder hierarchy > > > > Rationale > > > > Initial benchmarks show dramatic improvements in query planning. For > > example, in Netflix’s Atlas use case, which stores time-series metrics > from > > Netflix runtime systems and 1 month is stored across 2.7 million files in > > 2,688 partitions: > > > >- Hive table using Parquet: > > - 400k+ splits, not combined > > - Explain query: 9.6 minutes wall time (planning only) > >- Iceberg table with partition filtering: > > - 15,218 splits, combined > > - Planning: 10 seconds > > - Query wall time: 13 minutes > >- Iceberg table with partition and min/max filtering: > > - 412 splits > > - Planning: 25 seconds > > - Query wall time: 42 seconds > > > > These performance gains combined with the cross-engine compatibility are > a > > very compelling story. > > Initial Goals > > > > The initial goal will be to move the existing codebase to Apache and > > integrate with the Apache development process and infrastructure. A > primary > > goal of incubation will be to grow and diversify the Iceberg community. > We > > are well aware that the project community is largely comprised of > > individuals from a single company. We aim to change that during > incubation. > > Current Status > > > > As previously mentioned, Iceberg is under active development at Netflix, > > and is being used in processing large volumes of data in Amazon EC2. > > > > Iceberg license documentation is already based on Apache guidelines for > > LICENSE and NOTICE content. > > Meritocracy > > > > We value meritocracy and we understand that it is the basis for an open > > community that encourages multiple companies and individuals to > contribute > > and be invested in the project’s future. We will encourage and monitor > > participation and make sure to extend privileges and responsibilities to > > all contributors. > > Community > > > > Iceberg is currently being used by developers at Netflix and a growing > > number of users are actively using it in production environments. Iceberg > > has received contributions from developers working at Hortonworks, > WeWork, > > and Palantir. By bringing Iceberg to Apache we aim to assure current and > > future contributors that the Iceberg community is meritocratic and open, > in > > order to broaden and diversity the user and developer community. > > Core Developers > > > > Iceberg was initially developed at Netflix and is under active > development. > > We believe Netflix will be of interest to a broad range of users and > > developers and that incubating the project at the ASF will help us build > a
Re: [VOTE] Accept the Iceberg project for incubation
+1 (binding) Willem Jiang Twitter: willemjiang Weibo: 姜宁willem On Wed, Nov 14, 2018 at 1:07 AM Ryan Blue wrote: > > The discuss thread seems to have reached consensus, so I propose accepting > the Iceberg project for incubation. > > The proposal is copied below and in the wiki: > https://wiki.apache.org/incubator/IcebergProposal > > Please vote on whether to accept Iceberg in the next 72 hours: > > [ ] +1, accept Iceberg for incubation > [ ] -1, reject the Iceberg proposal because . . . > > Thank you for reviewing the proposal and voting, > > rb > -- > Iceberg Proposal Abstract > > Iceberg is a table format for large, slow-moving tabular data. > > It is designed to improve on the de-facto standard table layout built into > Apache Hive, Presto, and Apache Spark. > Proposal > > The purpose of Iceberg is to provide SQL-like tables that are backed by > large sets of data files. Iceberg is similar to the Hive table layout, the > de-facto standard structure used to track files in a table, but provides > additional guarantees and performance optimizations: > >- Atomicity - Each change to the table is will be complete or will fail. >“Do or do not. There is no try.” >- Snapshot isolation - Reads use one and only one snapshot of a table at >some time without holding a lock. >- Safe schema evolution - A table’s schema can change in well-defined >ways, without breaking older data files. >- Column projection - An engine may request a subset of the available >columns, including nested fields. >- Predicate pushdown - An engine can push filters into read planning to >improve performance using partition data and file-level statistics. > > Iceberg does NOT define a new file format. All data is stored in Apache > Avro, Apache ORC, or Apache Parquet files. > > Additionally, Iceberg is designed to work well when data files are stored > in cloud blob stores, even when those systems provide weaker guarantees > than a file system, including: > >- Eventual consistency in the namespace >- High latency for directory listings >- No renames of objects >- No folder hierarchy > > Rationale > > Initial benchmarks show dramatic improvements in query planning. For > example, in Netflix’s Atlas use case, which stores time-series metrics from > Netflix runtime systems and 1 month is stored across 2.7 million files in > 2,688 partitions: > >- Hive table using Parquet: > - 400k+ splits, not combined > - Explain query: 9.6 minutes wall time (planning only) >- Iceberg table with partition filtering: > - 15,218 splits, combined > - Planning: 10 seconds > - Query wall time: 13 minutes >- Iceberg table with partition and min/max filtering: > - 412 splits > - Planning: 25 seconds > - Query wall time: 42 seconds > > These performance gains combined with the cross-engine compatibility are a > very compelling story. > Initial Goals > > The initial goal will be to move the existing codebase to Apache and > integrate with the Apache development process and infrastructure. A primary > goal of incubation will be to grow and diversify the Iceberg community. We > are well aware that the project community is largely comprised of > individuals from a single company. We aim to change that during incubation. > Current Status > > As previously mentioned, Iceberg is under active development at Netflix, > and is being used in processing large volumes of data in Amazon EC2. > > Iceberg license documentation is already based on Apache guidelines for > LICENSE and NOTICE content. > Meritocracy > > We value meritocracy and we understand that it is the basis for an open > community that encourages multiple companies and individuals to contribute > and be invested in the project’s future. We will encourage and monitor > participation and make sure to extend privileges and responsibilities to > all contributors. > Community > > Iceberg is currently being used by developers at Netflix and a growing > number of users are actively using it in production environments. Iceberg > has received contributions from developers working at Hortonworks, WeWork, > and Palantir. By bringing Iceberg to Apache we aim to assure current and > future contributors that the Iceberg community is meritocratic and open, in > order to broaden and diversity the user and developer community. > Core Developers > > Iceberg was initially developed at Netflix and is under active development. > We believe Netflix will be of interest to a broad range of users and > developers and that incubating the project at the ASF will help us build a > diverse, sustainable community. > Alignment > > Iceberg utilizes other Apache projects such as Avro, Hadoop, Hive, ORC, > Parquet, Pig, and Spark. We anticipate integration with additional Apache > projects as the Iceberg community and interest in the project grows. > Known Risks Orphaned Products > > Netflix is committed
Re: [VOTE] Accept the Iceberg project for incubation
+1 (binding) On 11/13/2018 12:40 PM, Julian Hyde wrote: > +1 (binding) > > Julian > > >> On Nov 13, 2018, at 9:28 AM, Arthur Wiedmer wrote: >> >> +1 >> >> (Non-binding) >> >> Best, >> Arthur >> >> On Tue, Nov 13, 2018, 09:24 Hugo Louro > >>> +1 (non-binding) >>> On Nov 13, 2018, at 9:19 AM, Owen O'Malley >>> wrote: +1 (binding) > On Tue, Nov 13, 2018 at 12:12 PM Dave Fisher >>> wrote: > +1 (binding) > >> On Nov 13, 2018, at 9:10 AM, Matt Sicker wrote: >> >> +1 binding >> >>> On Tue, 13 Nov 2018 at 11:09, Ryan Blue wrote: >>> >>> +1 (binding) >>> On Tue, Nov 13, 2018 at 9:06 AM Ryan Blue wrote: The discuss thread seems to have reached consensus, so I propose >>> accepting the Iceberg project for incubation. The proposal is copied below and in the wiki: https://wiki.apache.org/incubator/IcebergProposal Please vote on whether to accept Iceberg in the next 72 hours: [ ] +1, accept Iceberg for incubation [ ] -1, reject the Iceberg proposal because . . . Thank you for reviewing the proposal and voting, rb -- Iceberg Proposal Abstract Iceberg is a table format for large, slow-moving tabular data. It is designed to improve on the de-facto standard table layout built >>> into Apache Hive, Presto, and Apache Spark. Proposal The purpose of Iceberg is to provide SQL-like tables that are backed >>> by large sets of data files. Iceberg is similar to the Hive table >>> layout, >>> the de-facto standard structure used to track files in a table, but > provides additional guarantees and performance optimizations: - Atomicity - Each change to the table is will be complete or will fail. “Do or do not. There is no try.” - Snapshot isolation - Reads use one and only one snapshot of a >>> table at some time without holding a lock. - Safe schema evolution - A table’s schema can change in >>> well-defined ways, without breaking older data files. - Column projection - An engine may request a subset of the >>> available columns, including nested fields. - Predicate pushdown - An engine can push filters into read planning to improve performance using partition data and file-level > statistics. Iceberg does NOT define a new file format. All data is stored in >>> Apache Avro, Apache ORC, or Apache Parquet files. Additionally, Iceberg is designed to work well when data files are > stored in cloud blob stores, even when those systems provide weaker >>> guarantees than a file system, including: - Eventual consistency in the namespace - High latency for directory listings - No renames of objects - No folder hierarchy Rationale Initial benchmarks show dramatic improvements in query planning. For example, in Netflix’s Atlas use case, which stores time-series >>> metrics >>> from Netflix runtime systems and 1 month is stored across 2.7 million >>> files > in 2,688 partitions: - Hive table using Parquet: - 400k+ splits, not combined - Explain query: 9.6 minutes wall time (planning only) - Iceberg table with partition filtering: - 15,218 splits, combined - Planning: 10 seconds - Query wall time: 13 minutes - Iceberg table with partition and min/max filtering: - 412 splits - Planning: 25 seconds - Query wall time: 42 seconds These performance gains combined with the cross-engine compatibility > are >>> a very compelling story. Initial Goals The initial goal will be to move the existing codebase to Apache and integrate with the Apache development process and infrastructure. A >>> primary goal of incubation will be to grow and diversify the Iceberg >>> community. >>> We are well aware that the project community is largely comprised of individuals from a single company. We aim to change that during >>> incubation. Current Status As previously mentioned, Iceberg is under active development at > Netflix, and is being used in processing large volumes of data in Amazon EC2. Iceberg license documentation is already based on Apache guidelines >>> for LICENSE and NOTICE content. Meritocracy We value meritocracy and we understand that it is the basis for an >>> open community that
Re: [VOTE] Accept the Iceberg project for incubation
+1 (binding) Julian > On Nov 13, 2018, at 9:28 AM, Arthur Wiedmer wrote: > > +1 > > (Non-binding) > > Best, > Arthur > > On Tue, Nov 13, 2018, 09:24 Hugo Louro >> +1 (non-binding) >> >>> On Nov 13, 2018, at 9:19 AM, Owen O'Malley >> wrote: >>> >>> +1 (binding) >>> On Tue, Nov 13, 2018 at 12:12 PM Dave Fisher >> wrote: +1 (binding) > On Nov 13, 2018, at 9:10 AM, Matt Sicker wrote: > > +1 binding > >> On Tue, 13 Nov 2018 at 11:09, Ryan Blue wrote: >> >> +1 (binding) >> >>> On Tue, Nov 13, 2018 at 9:06 AM Ryan Blue wrote: >>> >>> The discuss thread seems to have reached consensus, so I propose >> accepting >>> the Iceberg project for incubation. >>> >>> The proposal is copied below and in the wiki: >>> https://wiki.apache.org/incubator/IcebergProposal >>> >>> Please vote on whether to accept Iceberg in the next 72 hours: >>> >>> [ ] +1, accept Iceberg for incubation >>> [ ] -1, reject the Iceberg proposal because . . . >>> >>> Thank you for reviewing the proposal and voting, >>> >>> rb >>> -- >>> Iceberg Proposal Abstract >>> >>> Iceberg is a table format for large, slow-moving tabular data. >>> >>> It is designed to improve on the de-facto standard table layout built >> into >>> Apache Hive, Presto, and Apache Spark. >>> Proposal >>> >>> The purpose of Iceberg is to provide SQL-like tables that are backed >> by >>> large sets of data files. Iceberg is similar to the Hive table >> layout, >> the >>> de-facto standard structure used to track files in a table, but provides >>> additional guarantees and performance optimizations: >>> >>> - Atomicity - Each change to the table is will be complete or will >>> fail. “Do or do not. There is no try.” >>> - Snapshot isolation - Reads use one and only one snapshot of a >> table >>> at some time without holding a lock. >>> - Safe schema evolution - A table’s schema can change in >> well-defined >>> ways, without breaking older data files. >>> - Column projection - An engine may request a subset of the >> available >>> columns, including nested fields. >>> - Predicate pushdown - An engine can push filters into read planning >>> to improve performance using partition data and file-level statistics. >>> >>> Iceberg does NOT define a new file format. All data is stored in >> Apache >>> Avro, Apache ORC, or Apache Parquet files. >>> >>> Additionally, Iceberg is designed to work well when data files are stored >>> in cloud blob stores, even when those systems provide weaker >> guarantees >>> than a file system, including: >>> >>> - Eventual consistency in the namespace >>> - High latency for directory listings >>> - No renames of objects >>> - No folder hierarchy >>> >>> Rationale >>> >>> Initial benchmarks show dramatic improvements in query planning. For >>> example, in Netflix’s Atlas use case, which stores time-series >> metrics >> from >>> Netflix runtime systems and 1 month is stored across 2.7 million >> files in >>> 2,688 partitions: >>> >>> - Hive table using Parquet: >>>- 400k+ splits, not combined >>>- Explain query: 9.6 minutes wall time (planning only) >>> - Iceberg table with partition filtering: >>>- 15,218 splits, combined >>>- Planning: 10 seconds >>>- Query wall time: 13 minutes >>> - Iceberg table with partition and min/max filtering: >>>- 412 splits >>>- Planning: 25 seconds >>>- Query wall time: 42 seconds >>> >>> These performance gains combined with the cross-engine compatibility are >> a >>> very compelling story. >>> Initial Goals >>> >>> The initial goal will be to move the existing codebase to Apache and >>> integrate with the Apache development process and infrastructure. A >> primary >>> goal of incubation will be to grow and diversify the Iceberg >> community. >> We >>> are well aware that the project community is largely comprised of >>> individuals from a single company. We aim to change that during >> incubation. >>> Current Status >>> >>> As previously mentioned, Iceberg is under active development at Netflix, >>> and is being used in processing large volumes of data in Amazon EC2. >>> >>> Iceberg license documentation is already based on Apache guidelines >> for >>> LICENSE and NOTICE content. >>> Meritocracy >>> >>> We value meritocracy and we understand that it is the basis for an >> open >>> community that encourages multiple companies and individuals to >> contribute >>> and be invested in the project’s future. We will encourage and >> monitor >>>
Re: [VOTE] Accept the Iceberg project for incubation
+1 (Non-binding) Best, Arthur On Tue, Nov 13, 2018, 09:24 Hugo Louro +1 (non-binding) > > > On Nov 13, 2018, at 9:19 AM, Owen O'Malley > wrote: > > > > +1 (binding) > > > >> On Tue, Nov 13, 2018 at 12:12 PM Dave Fisher > wrote: > >> > >> +1 (binding) > >> > >>> On Nov 13, 2018, at 9:10 AM, Matt Sicker wrote: > >>> > >>> +1 binding > >>> > On Tue, 13 Nov 2018 at 11:09, Ryan Blue wrote: > > +1 (binding) > > > On Tue, Nov 13, 2018 at 9:06 AM Ryan Blue wrote: > > > > The discuss thread seems to have reached consensus, so I propose > accepting > > the Iceberg project for incubation. > > > > The proposal is copied below and in the wiki: > > https://wiki.apache.org/incubator/IcebergProposal > > > > Please vote on whether to accept Iceberg in the next 72 hours: > > > > [ ] +1, accept Iceberg for incubation > > [ ] -1, reject the Iceberg proposal because . . . > > > > Thank you for reviewing the proposal and voting, > > > > rb > > -- > > Iceberg Proposal Abstract > > > > Iceberg is a table format for large, slow-moving tabular data. > > > > It is designed to improve on the de-facto standard table layout built > into > > Apache Hive, Presto, and Apache Spark. > > Proposal > > > > The purpose of Iceberg is to provide SQL-like tables that are backed > by > > large sets of data files. Iceberg is similar to the Hive table > layout, > the > > de-facto standard structure used to track files in a table, but > >> provides > > additional guarantees and performance optimizations: > > > > - Atomicity - Each change to the table is will be complete or will > > fail. “Do or do not. There is no try.” > > - Snapshot isolation - Reads use one and only one snapshot of a > table > > at some time without holding a lock. > > - Safe schema evolution - A table’s schema can change in > well-defined > > ways, without breaking older data files. > > - Column projection - An engine may request a subset of the > available > > columns, including nested fields. > > - Predicate pushdown - An engine can push filters into read planning > > to improve performance using partition data and file-level > >> statistics. > > > > Iceberg does NOT define a new file format. All data is stored in > Apache > > Avro, Apache ORC, or Apache Parquet files. > > > > Additionally, Iceberg is designed to work well when data files are > >> stored > > in cloud blob stores, even when those systems provide weaker > guarantees > > than a file system, including: > > > > - Eventual consistency in the namespace > > - High latency for directory listings > > - No renames of objects > > - No folder hierarchy > > > > Rationale > > > > Initial benchmarks show dramatic improvements in query planning. For > > example, in Netflix’s Atlas use case, which stores time-series > metrics > from > > Netflix runtime systems and 1 month is stored across 2.7 million > files > >> in > > 2,688 partitions: > > > > - Hive table using Parquet: > > - 400k+ splits, not combined > > - Explain query: 9.6 minutes wall time (planning only) > > - Iceberg table with partition filtering: > > - 15,218 splits, combined > > - Planning: 10 seconds > > - Query wall time: 13 minutes > > - Iceberg table with partition and min/max filtering: > > - 412 splits > > - Planning: 25 seconds > > - Query wall time: 42 seconds > > > > These performance gains combined with the cross-engine compatibility > >> are > a > > very compelling story. > > Initial Goals > > > > The initial goal will be to move the existing codebase to Apache and > > integrate with the Apache development process and infrastructure. A > primary > > goal of incubation will be to grow and diversify the Iceberg > community. > We > > are well aware that the project community is largely comprised of > > individuals from a single company. We aim to change that during > incubation. > > Current Status > > > > As previously mentioned, Iceberg is under active development at > >> Netflix, > > and is being used in processing large volumes of data in Amazon EC2. > > > > Iceberg license documentation is already based on Apache guidelines > for > > LICENSE and NOTICE content. > > Meritocracy > > > > We value meritocracy and we understand that it is the basis for an > open > > community that encourages multiple companies and individuals to > contribute > > and be invested in the project’s future. We will encourage and > monitor > > participation and make sure to extend privileges and responsibilities > >> to > > all contributors. > > Community
Re: [VOTE] Accept the Iceberg project for incubation
+1 (non-binding) > On Nov 13, 2018, at 9:19 AM, Owen O'Malley wrote: > > +1 (binding) > >> On Tue, Nov 13, 2018 at 12:12 PM Dave Fisher wrote: >> >> +1 (binding) >> >>> On Nov 13, 2018, at 9:10 AM, Matt Sicker wrote: >>> >>> +1 binding >>> On Tue, 13 Nov 2018 at 11:09, Ryan Blue wrote: +1 (binding) > On Tue, Nov 13, 2018 at 9:06 AM Ryan Blue wrote: > > The discuss thread seems to have reached consensus, so I propose accepting > the Iceberg project for incubation. > > The proposal is copied below and in the wiki: > https://wiki.apache.org/incubator/IcebergProposal > > Please vote on whether to accept Iceberg in the next 72 hours: > > [ ] +1, accept Iceberg for incubation > [ ] -1, reject the Iceberg proposal because . . . > > Thank you for reviewing the proposal and voting, > > rb > -- > Iceberg Proposal Abstract > > Iceberg is a table format for large, slow-moving tabular data. > > It is designed to improve on the de-facto standard table layout built into > Apache Hive, Presto, and Apache Spark. > Proposal > > The purpose of Iceberg is to provide SQL-like tables that are backed by > large sets of data files. Iceberg is similar to the Hive table layout, the > de-facto standard structure used to track files in a table, but >> provides > additional guarantees and performance optimizations: > > - Atomicity - Each change to the table is will be complete or will > fail. “Do or do not. There is no try.” > - Snapshot isolation - Reads use one and only one snapshot of a table > at some time without holding a lock. > - Safe schema evolution - A table’s schema can change in well-defined > ways, without breaking older data files. > - Column projection - An engine may request a subset of the available > columns, including nested fields. > - Predicate pushdown - An engine can push filters into read planning > to improve performance using partition data and file-level >> statistics. > > Iceberg does NOT define a new file format. All data is stored in Apache > Avro, Apache ORC, or Apache Parquet files. > > Additionally, Iceberg is designed to work well when data files are >> stored > in cloud blob stores, even when those systems provide weaker guarantees > than a file system, including: > > - Eventual consistency in the namespace > - High latency for directory listings > - No renames of objects > - No folder hierarchy > > Rationale > > Initial benchmarks show dramatic improvements in query planning. For > example, in Netflix’s Atlas use case, which stores time-series metrics from > Netflix runtime systems and 1 month is stored across 2.7 million files >> in > 2,688 partitions: > > - Hive table using Parquet: > - 400k+ splits, not combined > - Explain query: 9.6 minutes wall time (planning only) > - Iceberg table with partition filtering: > - 15,218 splits, combined > - Planning: 10 seconds > - Query wall time: 13 minutes > - Iceberg table with partition and min/max filtering: > - 412 splits > - Planning: 25 seconds > - Query wall time: 42 seconds > > These performance gains combined with the cross-engine compatibility >> are a > very compelling story. > Initial Goals > > The initial goal will be to move the existing codebase to Apache and > integrate with the Apache development process and infrastructure. A primary > goal of incubation will be to grow and diversify the Iceberg community. We > are well aware that the project community is largely comprised of > individuals from a single company. We aim to change that during incubation. > Current Status > > As previously mentioned, Iceberg is under active development at >> Netflix, > and is being used in processing large volumes of data in Amazon EC2. > > Iceberg license documentation is already based on Apache guidelines for > LICENSE and NOTICE content. > Meritocracy > > We value meritocracy and we understand that it is the basis for an open > community that encourages multiple companies and individuals to contribute > and be invested in the project’s future. We will encourage and monitor > participation and make sure to extend privileges and responsibilities >> to > all contributors. > Community > > Iceberg is currently being used by developers at Netflix and a growing > number of users are actively using it in production environments. >> Iceberg > has received contributions from developers working at Hortonworks, WeWork, > and Palantir. By bringing Iceberg to Apache we aim to assure current >>
Re: [VOTE] Accept the Iceberg project for incubation
+1 (binding) On Tue, Nov 13, 2018 at 12:12 PM Dave Fisher wrote: > +1 (binding) > > > On Nov 13, 2018, at 9:10 AM, Matt Sicker wrote: > > > > +1 binding > > > > On Tue, 13 Nov 2018 at 11:09, Ryan Blue wrote: > > > >> +1 (binding) > >> > >> On Tue, Nov 13, 2018 at 9:06 AM Ryan Blue wrote: > >> > >>> The discuss thread seems to have reached consensus, so I propose > >> accepting > >>> the Iceberg project for incubation. > >>> > >>> The proposal is copied below and in the wiki: > >>> https://wiki.apache.org/incubator/IcebergProposal > >>> > >>> Please vote on whether to accept Iceberg in the next 72 hours: > >>> > >>> [ ] +1, accept Iceberg for incubation > >>> [ ] -1, reject the Iceberg proposal because . . . > >>> > >>> Thank you for reviewing the proposal and voting, > >>> > >>> rb > >>> -- > >>> Iceberg Proposal Abstract > >>> > >>> Iceberg is a table format for large, slow-moving tabular data. > >>> > >>> It is designed to improve on the de-facto standard table layout built > >> into > >>> Apache Hive, Presto, and Apache Spark. > >>> Proposal > >>> > >>> The purpose of Iceberg is to provide SQL-like tables that are backed by > >>> large sets of data files. Iceberg is similar to the Hive table layout, > >> the > >>> de-facto standard structure used to track files in a table, but > provides > >>> additional guarantees and performance optimizations: > >>> > >>> - Atomicity - Each change to the table is will be complete or will > >>> fail. “Do or do not. There is no try.” > >>> - Snapshot isolation - Reads use one and only one snapshot of a table > >>> at some time without holding a lock. > >>> - Safe schema evolution - A table’s schema can change in well-defined > >>> ways, without breaking older data files. > >>> - Column projection - An engine may request a subset of the available > >>> columns, including nested fields. > >>> - Predicate pushdown - An engine can push filters into read planning > >>> to improve performance using partition data and file-level > statistics. > >>> > >>> Iceberg does NOT define a new file format. All data is stored in Apache > >>> Avro, Apache ORC, or Apache Parquet files. > >>> > >>> Additionally, Iceberg is designed to work well when data files are > stored > >>> in cloud blob stores, even when those systems provide weaker guarantees > >>> than a file system, including: > >>> > >>> - Eventual consistency in the namespace > >>> - High latency for directory listings > >>> - No renames of objects > >>> - No folder hierarchy > >>> > >>> Rationale > >>> > >>> Initial benchmarks show dramatic improvements in query planning. For > >>> example, in Netflix’s Atlas use case, which stores time-series metrics > >> from > >>> Netflix runtime systems and 1 month is stored across 2.7 million files > in > >>> 2,688 partitions: > >>> > >>> - Hive table using Parquet: > >>> - 400k+ splits, not combined > >>> - Explain query: 9.6 minutes wall time (planning only) > >>> - Iceberg table with partition filtering: > >>> - 15,218 splits, combined > >>> - Planning: 10 seconds > >>> - Query wall time: 13 minutes > >>> - Iceberg table with partition and min/max filtering: > >>> - 412 splits > >>> - Planning: 25 seconds > >>> - Query wall time: 42 seconds > >>> > >>> These performance gains combined with the cross-engine compatibility > are > >> a > >>> very compelling story. > >>> Initial Goals > >>> > >>> The initial goal will be to move the existing codebase to Apache and > >>> integrate with the Apache development process and infrastructure. A > >> primary > >>> goal of incubation will be to grow and diversify the Iceberg community. > >> We > >>> are well aware that the project community is largely comprised of > >>> individuals from a single company. We aim to change that during > >> incubation. > >>> Current Status > >>> > >>> As previously mentioned, Iceberg is under active development at > Netflix, > >>> and is being used in processing large volumes of data in Amazon EC2. > >>> > >>> Iceberg license documentation is already based on Apache guidelines for > >>> LICENSE and NOTICE content. > >>> Meritocracy > >>> > >>> We value meritocracy and we understand that it is the basis for an open > >>> community that encourages multiple companies and individuals to > >> contribute > >>> and be invested in the project’s future. We will encourage and monitor > >>> participation and make sure to extend privileges and responsibilities > to > >>> all contributors. > >>> Community > >>> > >>> Iceberg is currently being used by developers at Netflix and a growing > >>> number of users are actively using it in production environments. > Iceberg > >>> has received contributions from developers working at Hortonworks, > >> WeWork, > >>> and Palantir. By bringing Iceberg to Apache we aim to assure current > and > >>> future contributors that the Iceberg community is meritocratic and > open, > >> in > >>>
Re: [VOTE] Accept the Iceberg project for incubation
+1 (binding) > On Nov 13, 2018, at 9:10 AM, Matt Sicker wrote: > > +1 binding > > On Tue, 13 Nov 2018 at 11:09, Ryan Blue wrote: > >> +1 (binding) >> >> On Tue, Nov 13, 2018 at 9:06 AM Ryan Blue wrote: >> >>> The discuss thread seems to have reached consensus, so I propose >> accepting >>> the Iceberg project for incubation. >>> >>> The proposal is copied below and in the wiki: >>> https://wiki.apache.org/incubator/IcebergProposal >>> >>> Please vote on whether to accept Iceberg in the next 72 hours: >>> >>> [ ] +1, accept Iceberg for incubation >>> [ ] -1, reject the Iceberg proposal because . . . >>> >>> Thank you for reviewing the proposal and voting, >>> >>> rb >>> -- >>> Iceberg Proposal Abstract >>> >>> Iceberg is a table format for large, slow-moving tabular data. >>> >>> It is designed to improve on the de-facto standard table layout built >> into >>> Apache Hive, Presto, and Apache Spark. >>> Proposal >>> >>> The purpose of Iceberg is to provide SQL-like tables that are backed by >>> large sets of data files. Iceberg is similar to the Hive table layout, >> the >>> de-facto standard structure used to track files in a table, but provides >>> additional guarantees and performance optimizations: >>> >>> - Atomicity - Each change to the table is will be complete or will >>> fail. “Do or do not. There is no try.” >>> - Snapshot isolation - Reads use one and only one snapshot of a table >>> at some time without holding a lock. >>> - Safe schema evolution - A table’s schema can change in well-defined >>> ways, without breaking older data files. >>> - Column projection - An engine may request a subset of the available >>> columns, including nested fields. >>> - Predicate pushdown - An engine can push filters into read planning >>> to improve performance using partition data and file-level statistics. >>> >>> Iceberg does NOT define a new file format. All data is stored in Apache >>> Avro, Apache ORC, or Apache Parquet files. >>> >>> Additionally, Iceberg is designed to work well when data files are stored >>> in cloud blob stores, even when those systems provide weaker guarantees >>> than a file system, including: >>> >>> - Eventual consistency in the namespace >>> - High latency for directory listings >>> - No renames of objects >>> - No folder hierarchy >>> >>> Rationale >>> >>> Initial benchmarks show dramatic improvements in query planning. For >>> example, in Netflix’s Atlas use case, which stores time-series metrics >> from >>> Netflix runtime systems and 1 month is stored across 2.7 million files in >>> 2,688 partitions: >>> >>> - Hive table using Parquet: >>> - 400k+ splits, not combined >>> - Explain query: 9.6 minutes wall time (planning only) >>> - Iceberg table with partition filtering: >>> - 15,218 splits, combined >>> - Planning: 10 seconds >>> - Query wall time: 13 minutes >>> - Iceberg table with partition and min/max filtering: >>> - 412 splits >>> - Planning: 25 seconds >>> - Query wall time: 42 seconds >>> >>> These performance gains combined with the cross-engine compatibility are >> a >>> very compelling story. >>> Initial Goals >>> >>> The initial goal will be to move the existing codebase to Apache and >>> integrate with the Apache development process and infrastructure. A >> primary >>> goal of incubation will be to grow and diversify the Iceberg community. >> We >>> are well aware that the project community is largely comprised of >>> individuals from a single company. We aim to change that during >> incubation. >>> Current Status >>> >>> As previously mentioned, Iceberg is under active development at Netflix, >>> and is being used in processing large volumes of data in Amazon EC2. >>> >>> Iceberg license documentation is already based on Apache guidelines for >>> LICENSE and NOTICE content. >>> Meritocracy >>> >>> We value meritocracy and we understand that it is the basis for an open >>> community that encourages multiple companies and individuals to >> contribute >>> and be invested in the project’s future. We will encourage and monitor >>> participation and make sure to extend privileges and responsibilities to >>> all contributors. >>> Community >>> >>> Iceberg is currently being used by developers at Netflix and a growing >>> number of users are actively using it in production environments. Iceberg >>> has received contributions from developers working at Hortonworks, >> WeWork, >>> and Palantir. By bringing Iceberg to Apache we aim to assure current and >>> future contributors that the Iceberg community is meritocratic and open, >> in >>> order to broaden and diversity the user and developer community. >>> Core Developers >>> >>> Iceberg was initially developed at Netflix and is under active >>> development. We believe Netflix will be of interest to a broad range of >>> users and developers and that incubating the project at the ASF will help
Re: [VOTE] Accept the Iceberg project for incubation
+1 binding On Tue, 13 Nov 2018 at 11:09, Ryan Blue wrote: > +1 (binding) > > On Tue, Nov 13, 2018 at 9:06 AM Ryan Blue wrote: > > > The discuss thread seems to have reached consensus, so I propose > accepting > > the Iceberg project for incubation. > > > > The proposal is copied below and in the wiki: > > https://wiki.apache.org/incubator/IcebergProposal > > > > Please vote on whether to accept Iceberg in the next 72 hours: > > > > [ ] +1, accept Iceberg for incubation > > [ ] -1, reject the Iceberg proposal because . . . > > > > Thank you for reviewing the proposal and voting, > > > > rb > > -- > > Iceberg Proposal Abstract > > > > Iceberg is a table format for large, slow-moving tabular data. > > > > It is designed to improve on the de-facto standard table layout built > into > > Apache Hive, Presto, and Apache Spark. > > Proposal > > > > The purpose of Iceberg is to provide SQL-like tables that are backed by > > large sets of data files. Iceberg is similar to the Hive table layout, > the > > de-facto standard structure used to track files in a table, but provides > > additional guarantees and performance optimizations: > > > >- Atomicity - Each change to the table is will be complete or will > >fail. “Do or do not. There is no try.” > >- Snapshot isolation - Reads use one and only one snapshot of a table > >at some time without holding a lock. > >- Safe schema evolution - A table’s schema can change in well-defined > >ways, without breaking older data files. > >- Column projection - An engine may request a subset of the available > >columns, including nested fields. > >- Predicate pushdown - An engine can push filters into read planning > >to improve performance using partition data and file-level statistics. > > > > Iceberg does NOT define a new file format. All data is stored in Apache > > Avro, Apache ORC, or Apache Parquet files. > > > > Additionally, Iceberg is designed to work well when data files are stored > > in cloud blob stores, even when those systems provide weaker guarantees > > than a file system, including: > > > >- Eventual consistency in the namespace > >- High latency for directory listings > >- No renames of objects > >- No folder hierarchy > > > > Rationale > > > > Initial benchmarks show dramatic improvements in query planning. For > > example, in Netflix’s Atlas use case, which stores time-series metrics > from > > Netflix runtime systems and 1 month is stored across 2.7 million files in > > 2,688 partitions: > > > >- Hive table using Parquet: > > - 400k+ splits, not combined > > - Explain query: 9.6 minutes wall time (planning only) > >- Iceberg table with partition filtering: > > - 15,218 splits, combined > > - Planning: 10 seconds > > - Query wall time: 13 minutes > >- Iceberg table with partition and min/max filtering: > > - 412 splits > > - Planning: 25 seconds > > - Query wall time: 42 seconds > > > > These performance gains combined with the cross-engine compatibility are > a > > very compelling story. > > Initial Goals > > > > The initial goal will be to move the existing codebase to Apache and > > integrate with the Apache development process and infrastructure. A > primary > > goal of incubation will be to grow and diversify the Iceberg community. > We > > are well aware that the project community is largely comprised of > > individuals from a single company. We aim to change that during > incubation. > > Current Status > > > > As previously mentioned, Iceberg is under active development at Netflix, > > and is being used in processing large volumes of data in Amazon EC2. > > > > Iceberg license documentation is already based on Apache guidelines for > > LICENSE and NOTICE content. > > Meritocracy > > > > We value meritocracy and we understand that it is the basis for an open > > community that encourages multiple companies and individuals to > contribute > > and be invested in the project’s future. We will encourage and monitor > > participation and make sure to extend privileges and responsibilities to > > all contributors. > > Community > > > > Iceberg is currently being used by developers at Netflix and a growing > > number of users are actively using it in production environments. Iceberg > > has received contributions from developers working at Hortonworks, > WeWork, > > and Palantir. By bringing Iceberg to Apache we aim to assure current and > > future contributors that the Iceberg community is meritocratic and open, > in > > order to broaden and diversity the user and developer community. > > Core Developers > > > > Iceberg was initially developed at Netflix and is under active > > development. We believe Netflix will be of interest to a broad range of > > users and developers and that incubating the project at the ASF will help > > us build a diverse, sustainable community. > > Alignment > > > > Iceberg utilizes
Re: [VOTE] Accept the Iceberg project for incubation
+1 (non binding) awesome to see this is taken forward to the incubator and looking forward to collaborate with the community! On Tue, Nov 13, 2018 at 9:09 AM Ryan Blue wrote: > +1 (binding) > > On Tue, Nov 13, 2018 at 9:06 AM Ryan Blue wrote: > > > The discuss thread seems to have reached consensus, so I propose > accepting > > the Iceberg project for incubation. > > > > The proposal is copied below and in the wiki: > > https://wiki.apache.org/incubator/IcebergProposal > > > > Please vote on whether to accept Iceberg in the next 72 hours: > > > > [ ] +1, accept Iceberg for incubation > > [ ] -1, reject the Iceberg proposal because . . . > > > > Thank you for reviewing the proposal and voting, > > > > rb > > -- > > Iceberg Proposal Abstract > > > > Iceberg is a table format for large, slow-moving tabular data. > > > > It is designed to improve on the de-facto standard table layout built > into > > Apache Hive, Presto, and Apache Spark. > > Proposal > > > > The purpose of Iceberg is to provide SQL-like tables that are backed by > > large sets of data files. Iceberg is similar to the Hive table layout, > the > > de-facto standard structure used to track files in a table, but provides > > additional guarantees and performance optimizations: > > > >- Atomicity - Each change to the table is will be complete or will > >fail. “Do or do not. There is no try.” > >- Snapshot isolation - Reads use one and only one snapshot of a table > >at some time without holding a lock. > >- Safe schema evolution - A table’s schema can change in well-defined > >ways, without breaking older data files. > >- Column projection - An engine may request a subset of the available > >columns, including nested fields. > >- Predicate pushdown - An engine can push filters into read planning > >to improve performance using partition data and file-level statistics. > > > > Iceberg does NOT define a new file format. All data is stored in Apache > > Avro, Apache ORC, or Apache Parquet files. > > > > Additionally, Iceberg is designed to work well when data files are stored > > in cloud blob stores, even when those systems provide weaker guarantees > > than a file system, including: > > > >- Eventual consistency in the namespace > >- High latency for directory listings > >- No renames of objects > >- No folder hierarchy > > > > Rationale > > > > Initial benchmarks show dramatic improvements in query planning. For > > example, in Netflix’s Atlas use case, which stores time-series metrics > from > > Netflix runtime systems and 1 month is stored across 2.7 million files in > > 2,688 partitions: > > > >- Hive table using Parquet: > > - 400k+ splits, not combined > > - Explain query: 9.6 minutes wall time (planning only) > >- Iceberg table with partition filtering: > > - 15,218 splits, combined > > - Planning: 10 seconds > > - Query wall time: 13 minutes > >- Iceberg table with partition and min/max filtering: > > - 412 splits > > - Planning: 25 seconds > > - Query wall time: 42 seconds > > > > These performance gains combined with the cross-engine compatibility are > a > > very compelling story. > > Initial Goals > > > > The initial goal will be to move the existing codebase to Apache and > > integrate with the Apache development process and infrastructure. A > primary > > goal of incubation will be to grow and diversify the Iceberg community. > We > > are well aware that the project community is largely comprised of > > individuals from a single company. We aim to change that during > incubation. > > Current Status > > > > As previously mentioned, Iceberg is under active development at Netflix, > > and is being used in processing large volumes of data in Amazon EC2. > > > > Iceberg license documentation is already based on Apache guidelines for > > LICENSE and NOTICE content. > > Meritocracy > > > > We value meritocracy and we understand that it is the basis for an open > > community that encourages multiple companies and individuals to > contribute > > and be invested in the project’s future. We will encourage and monitor > > participation and make sure to extend privileges and responsibilities to > > all contributors. > > Community > > > > Iceberg is currently being used by developers at Netflix and a growing > > number of users are actively using it in production environments. Iceberg > > has received contributions from developers working at Hortonworks, > WeWork, > > and Palantir. By bringing Iceberg to Apache we aim to assure current and > > future contributors that the Iceberg community is meritocratic and open, > in > > order to broaden and diversity the user and developer community. > > Core Developers > > > > Iceberg was initially developed at Netflix and is under active > > development. We believe Netflix will be of interest to a broad range of > > users and developers and that incubating the
Re: [VOTE] Accept the Iceberg project for incubation
+1 (binding) On Tue, Nov 13, 2018 at 9:06 AM Ryan Blue wrote: > The discuss thread seems to have reached consensus, so I propose accepting > the Iceberg project for incubation. > > The proposal is copied below and in the wiki: > https://wiki.apache.org/incubator/IcebergProposal > > Please vote on whether to accept Iceberg in the next 72 hours: > > [ ] +1, accept Iceberg for incubation > [ ] -1, reject the Iceberg proposal because . . . > > Thank you for reviewing the proposal and voting, > > rb > -- > Iceberg Proposal Abstract > > Iceberg is a table format for large, slow-moving tabular data. > > It is designed to improve on the de-facto standard table layout built into > Apache Hive, Presto, and Apache Spark. > Proposal > > The purpose of Iceberg is to provide SQL-like tables that are backed by > large sets of data files. Iceberg is similar to the Hive table layout, the > de-facto standard structure used to track files in a table, but provides > additional guarantees and performance optimizations: > >- Atomicity - Each change to the table is will be complete or will >fail. “Do or do not. There is no try.” >- Snapshot isolation - Reads use one and only one snapshot of a table >at some time without holding a lock. >- Safe schema evolution - A table’s schema can change in well-defined >ways, without breaking older data files. >- Column projection - An engine may request a subset of the available >columns, including nested fields. >- Predicate pushdown - An engine can push filters into read planning >to improve performance using partition data and file-level statistics. > > Iceberg does NOT define a new file format. All data is stored in Apache > Avro, Apache ORC, or Apache Parquet files. > > Additionally, Iceberg is designed to work well when data files are stored > in cloud blob stores, even when those systems provide weaker guarantees > than a file system, including: > >- Eventual consistency in the namespace >- High latency for directory listings >- No renames of objects >- No folder hierarchy > > Rationale > > Initial benchmarks show dramatic improvements in query planning. For > example, in Netflix’s Atlas use case, which stores time-series metrics from > Netflix runtime systems and 1 month is stored across 2.7 million files in > 2,688 partitions: > >- Hive table using Parquet: > - 400k+ splits, not combined > - Explain query: 9.6 minutes wall time (planning only) >- Iceberg table with partition filtering: > - 15,218 splits, combined > - Planning: 10 seconds > - Query wall time: 13 minutes >- Iceberg table with partition and min/max filtering: > - 412 splits > - Planning: 25 seconds > - Query wall time: 42 seconds > > These performance gains combined with the cross-engine compatibility are a > very compelling story. > Initial Goals > > The initial goal will be to move the existing codebase to Apache and > integrate with the Apache development process and infrastructure. A primary > goal of incubation will be to grow and diversify the Iceberg community. We > are well aware that the project community is largely comprised of > individuals from a single company. We aim to change that during incubation. > Current Status > > As previously mentioned, Iceberg is under active development at Netflix, > and is being used in processing large volumes of data in Amazon EC2. > > Iceberg license documentation is already based on Apache guidelines for > LICENSE and NOTICE content. > Meritocracy > > We value meritocracy and we understand that it is the basis for an open > community that encourages multiple companies and individuals to contribute > and be invested in the project’s future. We will encourage and monitor > participation and make sure to extend privileges and responsibilities to > all contributors. > Community > > Iceberg is currently being used by developers at Netflix and a growing > number of users are actively using it in production environments. Iceberg > has received contributions from developers working at Hortonworks, WeWork, > and Palantir. By bringing Iceberg to Apache we aim to assure current and > future contributors that the Iceberg community is meritocratic and open, in > order to broaden and diversity the user and developer community. > Core Developers > > Iceberg was initially developed at Netflix and is under active > development. We believe Netflix will be of interest to a broad range of > users and developers and that incubating the project at the ASF will help > us build a diverse, sustainable community. > Alignment > > Iceberg utilizes other Apache projects such as Avro, Hadoop, Hive, ORC, > Parquet, Pig, and Spark. We anticipate integration with additional Apache > projects as the Iceberg community and interest in the project grows. > Known Risks Orphaned Products > > Netflix is committed to the future development of Iceberg and understands >
[VOTE] Accept the Iceberg project for incubation
The discuss thread seems to have reached consensus, so I propose accepting the Iceberg project for incubation. The proposal is copied below and in the wiki: https://wiki.apache.org/incubator/IcebergProposal Please vote on whether to accept Iceberg in the next 72 hours: [ ] +1, accept Iceberg for incubation [ ] -1, reject the Iceberg proposal because . . . Thank you for reviewing the proposal and voting, rb -- Iceberg Proposal Abstract Iceberg is a table format for large, slow-moving tabular data. It is designed to improve on the de-facto standard table layout built into Apache Hive, Presto, and Apache Spark. Proposal The purpose of Iceberg is to provide SQL-like tables that are backed by large sets of data files. Iceberg is similar to the Hive table layout, the de-facto standard structure used to track files in a table, but provides additional guarantees and performance optimizations: - Atomicity - Each change to the table is will be complete or will fail. “Do or do not. There is no try.” - Snapshot isolation - Reads use one and only one snapshot of a table at some time without holding a lock. - Safe schema evolution - A table’s schema can change in well-defined ways, without breaking older data files. - Column projection - An engine may request a subset of the available columns, including nested fields. - Predicate pushdown - An engine can push filters into read planning to improve performance using partition data and file-level statistics. Iceberg does NOT define a new file format. All data is stored in Apache Avro, Apache ORC, or Apache Parquet files. Additionally, Iceberg is designed to work well when data files are stored in cloud blob stores, even when those systems provide weaker guarantees than a file system, including: - Eventual consistency in the namespace - High latency for directory listings - No renames of objects - No folder hierarchy Rationale Initial benchmarks show dramatic improvements in query planning. For example, in Netflix’s Atlas use case, which stores time-series metrics from Netflix runtime systems and 1 month is stored across 2.7 million files in 2,688 partitions: - Hive table using Parquet: - 400k+ splits, not combined - Explain query: 9.6 minutes wall time (planning only) - Iceberg table with partition filtering: - 15,218 splits, combined - Planning: 10 seconds - Query wall time: 13 minutes - Iceberg table with partition and min/max filtering: - 412 splits - Planning: 25 seconds - Query wall time: 42 seconds These performance gains combined with the cross-engine compatibility are a very compelling story. Initial Goals The initial goal will be to move the existing codebase to Apache and integrate with the Apache development process and infrastructure. A primary goal of incubation will be to grow and diversify the Iceberg community. We are well aware that the project community is largely comprised of individuals from a single company. We aim to change that during incubation. Current Status As previously mentioned, Iceberg is under active development at Netflix, and is being used in processing large volumes of data in Amazon EC2. Iceberg license documentation is already based on Apache guidelines for LICENSE and NOTICE content. Meritocracy We value meritocracy and we understand that it is the basis for an open community that encourages multiple companies and individuals to contribute and be invested in the project’s future. We will encourage and monitor participation and make sure to extend privileges and responsibilities to all contributors. Community Iceberg is currently being used by developers at Netflix and a growing number of users are actively using it in production environments. Iceberg has received contributions from developers working at Hortonworks, WeWork, and Palantir. By bringing Iceberg to Apache we aim to assure current and future contributors that the Iceberg community is meritocratic and open, in order to broaden and diversity the user and developer community. Core Developers Iceberg was initially developed at Netflix and is under active development. We believe Netflix will be of interest to a broad range of users and developers and that incubating the project at the ASF will help us build a diverse, sustainable community. Alignment Iceberg utilizes other Apache projects such as Avro, Hadoop, Hive, ORC, Parquet, Pig, and Spark. We anticipate integration with additional Apache projects as the Iceberg community and interest in the project grows. Known Risks Orphaned Products Netflix is committed to the future development of Iceberg and understands that graduation to a TLP, while preferable, is not the only positive outcome of incubation. Should the Iceberg project be accepted by the Incubator, the prospective PPMC would be willing to agree to a target incubation period of 2 years or less, knowing that every