Re: [RESULT] [VOTE] Accept Hudi into the Apache Incubator

2019-01-18 Thread Vinoth Govindarajan
+1


On 2019/01/17 19:05:08, Thomas Weise  wrote: 
> The vote for accepting Hudi into the Apache Incubator passes with 11> 
> binding +1 votes, 5 non-binding +1 votes and no other votes.> 
> 
> Thanks for voting!> 
> 
> +1 votes:> 
> 
> Luciano Resende*> 
> Pierre Smits> 
> Suneel Marthi*> 
> Felix Cheung*> 
> Kenneth Knowles*> 
> Mohammad Islam> 
> Mayank Bansal> 
> Jakob Homan*> 
> Akira Ajisaka*> 
> Gosling Von*> 
> Matt Sicker*> 
> Brahma Reddy Battula> 
> Hongtao Gao> 
> Vinayakumar B*> 
> Furkan Kamaci*> 
> Thomas Weise*> 
> 
> * = binding> 
> 
> 
> On Sun, Jan 13, 2019 at 2:34 PM Thomas Weise  wrote:> 
> 
> > Hi all,> 
> >> 
> > Following the discussion of the Hudi proposal in [1], this is a vote> 
> > on accepting Hudi into the Apache Incubator,> 
> > per the ASF policy [2] and voting rules [3].> 
> >> 
> > A vote for accepting a new Apache Incubator podling is a> 
> > majority vote. Everyone is welcome to vote, only> 
> > Incubator PMC member votes are binding.> 
> >> 
> > This vote will run for at least 72 hours. Please VOTE as> 
> > follows:> 
> >> 
> > [ ] +1 Accept Hudi into the Apache Incubator> 
> > [ ] +0 Abstain> 
> > [ ] -1 Do not accept Hudi into the Apache Incubator because ...> 
> >> 
> > The proposal is included below, but you can also access it on> 
> > the wiki [4].> 
> >> 
> > Thanks for reviewing and voting,> 
> > Thomas> 
> >> 
> > [1]> 
> > https://lists.apache.org/thread.html/12e2bdaa095d68dae6f8731e473d3d43885783177d1b7e3ff2f65b6d@%3Cgeneral.incubator.apache.org%3E>
> >  
> >> 
> > [2]> 
> > https://incubator.apache.org/policy/incubation.html#approval_of_proposal_by_sponsor>
> >  
> >> 
> > [3] http://www.apache.org/foundation/voting.html> 
> >> 
> > [4] https://wiki.apache.org/incubator/HudiProposal> 
> >> 
> >> 
> >> 
> > = Hudi Proposal => 
> >> 
> > == Abstract ==> 
> >> 
> > Hudi is a big-data storage library, that provides atomic upserts and> 
> > incremental data streams.> 
> >> 
> > Hudi manages data stored in Apache Hadoop and other API compatible> 
> > distributed file systems/cloud stores.> 
> >> 
> > == Proposal ==> 
> >> 
> > Hudi provides the ability to atomically upsert datasets with new values in> 
> > near-real time, making data available quickly to existing query engines> 
> > like Apache Hive, Apache Spark, & Presto. Additionally, Hudi provides a> 
> > sequence of changes to a dataset from a given point-in-time to enable> 
> > incremental data pipelines that yield greater efficiency & latency than> 
> > their typical batch counterparts. By carefully managing number of files &> 
> > sizes, Hudi greatly aids both query engines (e.g: always providing> 
> > well-sized files) and underlying storage (e.g: HDFS NameNode memory> 
> > consumption).> 
> >> 
> > Hudi is largely implemented as an Apache Spark library that reads/writes> 
> > data from/to Hadoop compatible filesystem. SQL queries on Hudi datasets 
> > are> 
> > supported via specialized Apache Hadoop input formats, that understand> 
> > Hudi’s storage layout. Currently, Hudi manages datasets using a 
> > combination> 
> > of Apache Parquet & Apache Avro file/serialization formats.> 
> >> 
> > == Background ==> 
> >> 
> > Apache Hadoop distributed filesystem (HDFS) & other compatible cloud> 
> > storage systems (e.g: Amazon S3, Google Cloud, Microsoft Azure) serve as> 
> > longer term analytical storage for thousands of organizations. Typical> 
> > analytical datasets are built by reading data from a source (e.g: upstream> 
> > databases, messaging buses, or other datasets), transforming the data,> 
> > writing results back to storage, & making it available for analytical> 
> > queries--all of this typically accomplished in batch jobs which operate in> 
> > a bulk fashion on partitions of datasets. Such a style of processing> 
> > typically incurs large delays in making data available to queries as well> 
> > as lot of complexity in carefully partitioning datasets to guarantee> 
> > latency SLAs.> 
> >> 
> > The need for fresher/faster analytics has increased enormously in the past> 
> > few years, as evidenced by the popularity of Stream processing systems 
> > like> 
> > Apache Spark, Apache Flink, and messaging systems like Apache Kafka. By> 
> > using updateable state store to incrementally compute & instantly reflect> 
> > new results to queries and using a “tailable” messaging bus to publish> 
> > these results to other downstream jobs, such systems employ a different> 
> > approach to building analytical dataset. Even though this approach yields> 
> > low latency, the amount of data managed in such real-time data-marts is> 
> > typically limited in comparison to the aforementioned longer term storage> 
> > options. As a result, the overall data architecture has become more 
> > complex> 
> > with more moving parts and specialized systems, leading to duplication of> 
> > data and a strain on usability.> 
> >> 
> > Hudi takes a hybrid approach. Instead of moving vast amounts of batch data> 
> > to 

[RESULT] [VOTE] Accept Hudi into the Apache Incubator

2019-01-17 Thread Thomas Weise
The vote for accepting Hudi into the Apache Incubator passes with 11
binding +1 votes, 5 non-binding +1 votes and no other votes.

Thanks for voting!

+1 votes:

Luciano Resende*
Pierre Smits
Suneel Marthi*
Felix Cheung*
Kenneth Knowles*
Mohammad Islam
Mayank Bansal
Jakob Homan*
Akira Ajisaka*
Gosling Von*
Matt Sicker*
Brahma Reddy Battula
Hongtao Gao
Vinayakumar B*
Furkan Kamaci*
Thomas Weise*

* = binding


On Sun, Jan 13, 2019 at 2:34 PM Thomas Weise  wrote:

> Hi all,
>
> Following the discussion of the Hudi proposal in [1], this is a vote
> on accepting Hudi into the Apache Incubator,
> per the ASF policy [2] and voting rules [3].
>
> A vote for accepting a new Apache Incubator podling is a
> majority vote. Everyone is welcome to vote, only
> Incubator PMC member votes are binding.
>
> This vote will run for at least 72 hours. Please VOTE as
> follows:
>
> [ ] +1 Accept Hudi into the Apache Incubator
> [ ] +0 Abstain
> [ ] -1 Do not accept Hudi into the Apache Incubator because ...
>
> The proposal is included below, but you can also access it on
> the wiki [4].
>
> Thanks for reviewing and voting,
> Thomas
>
> [1]
> https://lists.apache.org/thread.html/12e2bdaa095d68dae6f8731e473d3d43885783177d1b7e3ff2f65b6d@%3Cgeneral.incubator.apache.org%3E
>
> [2]
> https://incubator.apache.org/policy/incubation.html#approval_of_proposal_by_sponsor
>
> [3] http://www.apache.org/foundation/voting.html
>
> [4] https://wiki.apache.org/incubator/HudiProposal
>
>
>
> = Hudi Proposal =
>
> == Abstract ==
>
> Hudi is a big-data storage library, that provides atomic upserts and
> incremental data streams.
>
> Hudi manages data stored in Apache Hadoop and other API compatible
> distributed file systems/cloud stores.
>
> == Proposal ==
>
> Hudi provides the ability to atomically upsert datasets with new values in
> near-real time, making data available quickly to existing query engines
> like Apache Hive, Apache Spark, & Presto. Additionally, Hudi provides a
> sequence of changes to a dataset from a given point-in-time to enable
> incremental data pipelines that yield greater efficiency & latency than
> their typical batch counterparts. By carefully managing number of files &
> sizes, Hudi greatly aids both query engines (e.g: always providing
> well-sized files) and underlying storage (e.g: HDFS NameNode memory
> consumption).
>
> Hudi is largely implemented as an Apache Spark library that reads/writes
> data from/to Hadoop compatible filesystem. SQL queries on Hudi datasets are
> supported via specialized Apache Hadoop input formats, that understand
> Hudi’s storage layout. Currently, Hudi manages datasets using a combination
> of Apache Parquet & Apache Avro file/serialization formats.
>
> == Background ==
>
> Apache Hadoop distributed filesystem (HDFS) & other compatible cloud
> storage systems (e.g: Amazon S3, Google Cloud, Microsoft Azure) serve as
> longer term analytical storage for thousands of organizations. Typical
> analytical datasets are built by reading data from a source (e.g: upstream
> databases, messaging buses, or other datasets), transforming the data,
> writing results back to storage, & making it available for analytical
> queries--all of this typically accomplished in batch jobs which operate in
> a bulk fashion on partitions of datasets. Such a style of processing
> typically incurs large delays in making data available to queries as well
> as lot of complexity in carefully partitioning datasets to guarantee
> latency SLAs.
>
> The need for fresher/faster analytics has increased enormously in the past
> few years, as evidenced by the popularity of Stream processing systems like
> Apache Spark, Apache Flink, and messaging systems like Apache Kafka. By
> using updateable state store to incrementally compute & instantly reflect
> new results to queries and using a “tailable” messaging bus to publish
> these results to other downstream jobs, such systems employ a different
> approach to building analytical dataset. Even though this approach yields
> low latency, the amount of data managed in such real-time data-marts is
> typically limited in comparison to the aforementioned longer term storage
> options. As a result, the overall data architecture has become more complex
> with more moving parts and specialized systems, leading to duplication of
> data and a strain on usability.
>
> Hudi takes a hybrid approach. Instead of moving vast amounts of batch data
> to streaming systems, we simply add the streaming primitives (upserts &
> incremental consumption) onto existing batch processing technologies. We
> believe that by adding some missing blocks to an existing Hadoop stack, we
> are able to a provide similar capabilities right on top of Hadoop at a
> reduced cost and with an increased efficiency, greatly simplifying the
> overall architecture in the process.
>
> Hudi was originally developed at Uber (original name “Hoodie”) to address
> such broad inefficiencies in ingest & ETL & ML pipelines across