The vote for accepting Hudi into the Apache Incubator passes with 11
binding +1 votes, 5 non-binding +1 votes and no other votes.
Thanks for voting!
+1 votes:
Luciano Resende*
Pierre Smits
Suneel Marthi*
Felix Cheung*
Kenneth Knowles*
Mohammad Islam
Mayank Bansal
Jakob Homan*
Akira Ajisaka*
Gosling Von*
Matt Sicker*
Brahma Reddy Battula
Hongtao Gao
Vinayakumar B*
Furkan Kamaci*
Thomas Weise*
* = binding
On Sun, Jan 13, 2019 at 2:34 PM Thomas Weise wrote:
> Hi all,
>
> Following the discussion of the Hudi proposal in [1], this is a vote
> on accepting Hudi into the Apache Incubator,
> per the ASF policy [2] and voting rules [3].
>
> A vote for accepting a new Apache Incubator podling is a
> majority vote. Everyone is welcome to vote, only
> Incubator PMC member votes are binding.
>
> This vote will run for at least 72 hours. Please VOTE as
> follows:
>
> [ ] +1 Accept Hudi into the Apache Incubator
> [ ] +0 Abstain
> [ ] -1 Do not accept Hudi into the Apache Incubator because ...
>
> The proposal is included below, but you can also access it on
> the wiki [4].
>
> Thanks for reviewing and voting,
> Thomas
>
> [1]
> https://lists.apache.org/thread.html/12e2bdaa095d68dae6f8731e473d3d43885783177d1b7e3ff2f65b6d@%3Cgeneral.incubator.apache.org%3E
>
> [2]
> https://incubator.apache.org/policy/incubation.html#approval_of_proposal_by_sponsor
>
> [3] http://www.apache.org/foundation/voting.html
>
> [4] https://wiki.apache.org/incubator/HudiProposal
>
>
>
> = Hudi Proposal =
>
> == Abstract ==
>
> Hudi is a big-data storage library, that provides atomic upserts and
> incremental data streams.
>
> Hudi manages data stored in Apache Hadoop and other API compatible
> distributed file systems/cloud stores.
>
> == Proposal ==
>
> Hudi provides the ability to atomically upsert datasets with new values in
> near-real time, making data available quickly to existing query engines
> like Apache Hive, Apache Spark, & Presto. Additionally, Hudi provides a
> sequence of changes to a dataset from a given point-in-time to enable
> incremental data pipelines that yield greater efficiency & latency than
> their typical batch counterparts. By carefully managing number of files &
> sizes, Hudi greatly aids both query engines (e.g: always providing
> well-sized files) and underlying storage (e.g: HDFS NameNode memory
> consumption).
>
> Hudi is largely implemented as an Apache Spark library that reads/writes
> data from/to Hadoop compatible filesystem. SQL queries on Hudi datasets are
> supported via specialized Apache Hadoop input formats, that understand
> Hudi’s storage layout. Currently, Hudi manages datasets using a combination
> of Apache Parquet & Apache Avro file/serialization formats.
>
> == Background ==
>
> Apache Hadoop distributed filesystem (HDFS) & other compatible cloud
> storage systems (e.g: Amazon S3, Google Cloud, Microsoft Azure) serve as
> longer term analytical storage for thousands of organizations. Typical
> analytical datasets are built by reading data from a source (e.g: upstream
> databases, messaging buses, or other datasets), transforming the data,
> writing results back to storage, & making it available for analytical
> queries--all of this typically accomplished in batch jobs which operate in
> a bulk fashion on partitions of datasets. Such a style of processing
> typically incurs large delays in making data available to queries as well
> as lot of complexity in carefully partitioning datasets to guarantee
> latency SLAs.
>
> The need for fresher/faster analytics has increased enormously in the past
> few years, as evidenced by the popularity of Stream processing systems like
> Apache Spark, Apache Flink, and messaging systems like Apache Kafka. By
> using updateable state store to incrementally compute & instantly reflect
> new results to queries and using a “tailable” messaging bus to publish
> these results to other downstream jobs, such systems employ a different
> approach to building analytical dataset. Even though this approach yields
> low latency, the amount of data managed in such real-time data-marts is
> typically limited in comparison to the aforementioned longer term storage
> options. As a result, the overall data architecture has become more complex
> with more moving parts and specialized systems, leading to duplication of
> data and a strain on usability.
>
> Hudi takes a hybrid approach. Instead of moving vast amounts of batch data
> to streaming systems, we simply add the streaming primitives (upserts &
> incremental consumption) onto existing batch processing technologies. We
> believe that by adding some missing blocks to an existing Hadoop stack, we
> are able to a provide similar capabilities right on top of Hadoop at a
> reduced cost and with an increased efficiency, greatly simplifying the
> overall architecture in the process.
>
> Hudi was originally developed at Uber (original name “Hoodie”) to address
> such broad inefficiencies in ingest & ETL & ML pipelines across