Re: [ANNOUNCE] Apache Hudi 0.9.0 released

Pratyaksh Sharma Wed, 01 Sep 2021 11:49:51 -0700

Great news! This one really feels like a major release with so many good
features getting added. :)


On Wed, Sep 1, 2021 at 7:19 AM Udit Mehrotra <udi...@apache.org> wrote:

> The Apache Hudi team is pleased to announce the release of Apache Hudi
> 0.9.0.
>
> This release comes almost 5 months after 0.8.0. It includes 387 resolved
> issues, comprising new features as well as
> general improvements and bug-fixes. Here are a few quick highlights:
>
> *Spark SQL DML and DDL Support*
> We have added experimental support for DDL/DML using Spark SQL taking a
> huge step towards making Hudi more
> easily accessible and operable by all personas (non-engineers, analysts
> etc). Users can now use SQL statements like
> "CREATE TABLE....USING HUDI" and "CREATE TABLE .. AS SELECT" to
> create/manage tables in catalogs like Hive,
> and "INSERT", "INSERT OVERWRITE", "UPDATE", "MERGE INTO" and "DELETE"
> statements to manipulate data.
> For more information, checkout our docs here
> <https://hudi.apache.org/docs/quick-start-guide> clicking on the SparkSQL
> tab.
>
> *Query Side Improvements*
> Hudi tables are now registered with Hive as spark datasource tables,
> meaning Spark SQL on these tables now uses the
> datasource as well, instead of relying on the Hive fallbacks within Spark,
> which are ill-maintained/cumbersome. This
> unlocks many optimizations such as the use of Hudi's own FileIndex
> <
> https://github.com/apache/hudi/blob/bf5a52e51bbeaa089995335a0a4c55884792e505/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieFileIndex.scala#L46
> >
> implementation for optimized caching and the use
> of the Hudi metadata table, for faster listing of large tables. We have
> also added support for time travel query
> <https://hudi.apache.org/docs/quick-start-guide#time-travel-query>, for
> spark
> datasource.
>
> *Writer Side Improvements*
> This release has several major writer side improvements. Virtual key
> support has been added to avoid populating meta
> fields and leverage existing fields to populate record keys and partition
> paths.
> Bulk Insert operation using row writer is now enabled by default for faster
> inserts.
> Hudi's automatic cleaning of uncommitted data has been enhanced to be
> performant over cloud stores. You can learn
> more about this new centrally coordinated marker mechanism in this blog
> <https://hudi.apache.org/blog/2021/08/18/improving-marker-mechanism/>.
> Async Clustering support has been added to both DeltaStreamer and Spark
> Structured Streaming Sink. More on this
> can be found in this blog
> <https://hudi.apache.org/blog/2021/08/23/async-clustering/>.
> Users can choose to drop fields used to generate partition paths.
> Added a new write operation "delete_partition" support in spark. Users can
> leverage this to delete older partitions in
> bulk, in addition to record level deletes.
> Added Support for Huawei Cloud Object Storage, BAIDU AFS storage format,
> Baidu BOS storage in Hudi.
> A pre commit validator framework
> <
> https://github.com/apache/hudi/blob/bf5a52e51bbeaa089995335a0a4c55884792e505/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/validator/SparkPreCommitValidator.java
> >
> has been added for spark engine, which can be used for DeltaStreamer and
> Spark
> Datasource writers. Users can leverage this to add any validations to be
> executed before committing writes to Hudi.
> Few out of the box validators are available like
> SqlQueryEqualityPreCommitValidator
> <
> https://github.com/apache/hudi/blob/release-0.9.0/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/validator/SqlQueryEqualityPreCommitValidator.java
> >,
> SqlQueryInequalityPreCommitValidator
> <
> https://github.com/apache/hudi/blob/release-0.9.0/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/validator/SqlQueryInequalityPreCommitValidator.java
> >
> and SqlQuerySingleResultPreCommitValidator
> <
> https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/validator/SqlQuerySingleResultPreCommitValidator.java
> >
> .
>
> *Flink Integration Improvements*
> The Flink writer now supports propagation of CDC format for MOR table, by
> turning on the option "changelog.enabled=true".
> Hudi would then persist all change flags of each record, allowing users to
> do stateful computation based on these change logs.
> Flink writing is now close to feature parity with spark writing, with
> addition of write operations like "bulk_insert" and
> "insert_overwrite", support for non-partitioned tables, automatic cleanup
> of uncommitted data, global indexing support, hive
> style partitioning and handling of partition path updates.
> Writing also supports a new log append mode, where no records are
> de-duplicated and base files are directly written for each flush.
> Flink readers now support streaming reads from COW/MOR tables. Deletions
> are emitted by default in streaming read mode, the
> downstream receives the "DELETE" message as a Hoodie record with empty
> payload.
> Hive sync has been improved by adding support for different Hive versions
> and asynchronous execution.
> Flink Streamer tool now supports transformers.
>
> *DeltaStreamer Improvements*
> We have enhanced DeltaStreamer utility with 3 new sources. JDBC
> <
> https://github.com/apache/hudi/blob/release-0.9.0/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/JdbcSource.java
> >
> will help with fetching data from RDBMS sources and
> SQLSource
> <
> https://github.com/apache/hudi/blob/release-0.9.0/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/SqlSource.java
> >
> will assist in backfilling use cases. S3EventsHoodieIncrSource
> <
> https://github.com/apache/hudi/blob/release-0.9.0/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/S3EventsHoodieIncrSource.java
> >
> and S3EventsSource
> <
> https://github.com/apache/hudi/blob/release-0.9.0/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/S3EventsSource.java
> >
> assist in reading data from S3
> reliably and efficiently ingesting that to Hudi. In addition, we have added
> support for timestamp based fetch from kafka and added
> basic auth support to schema registry.
>
> Please find more information about the release here:
> https://hudi.apache.org/releases/release-0.9.0
>
> For details on how to use Hudi, please look at the quick start page located
> here:
> https://hudi.apache.org/docs/quick-start-guide.html
>
> If you'd like to download the source release, you can find it here:
> https://github.com/apache/hudi/releases/tag/release-0.9.0
>
> You can read more about the release (including release notes) here:
>
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12322822&version=12350027
>
> We welcome your help and feedback. For more information on how to report
> problems, and to get involved, visit the project
> website at https://hudi.apache.org/
>
> Thanks to everyone involved!
>
> Udit Mehrotra
> (on behalf of the Hudi Community)
>

Re: [ANNOUNCE] Apache Hudi 0.9.0 released

Reply via email to