The Apache Hudi team is pleased to announce the release of Apache Hudi 0.9.0.
This release comes almost 5 months after 0.8.0. It includes 387 resolved issues, comprising new features as well as general improvements and bug-fixes. Here are a few quick highlights: *Spark SQL DML and DDL Support* We have added experimental support for DDL/DML using Spark SQL taking a huge step towards making Hudi more easily accessible and operable by all personas (non-engineers, analysts etc). Users can now use SQL statements like "CREATE TABLE....USING HUDI" and "CREATE TABLE .. AS SELECT" to create/manage tables in catalogs like Hive, and "INSERT", "INSERT OVERWRITE", "UPDATE", "MERGE INTO" and "DELETE" statements to manipulate data. For more information, checkout our docs here <https://hudi.apache.org/docs/quick-start-guide> clicking on the SparkSQL tab. *Query Side Improvements* Hudi tables are now registered with Hive as spark datasource tables, meaning Spark SQL on these tables now uses the datasource as well, instead of relying on the Hive fallbacks within Spark, which are ill-maintained/cumbersome. This unlocks many optimizations such as the use of Hudi's own FileIndex <https://github.com/apache/hudi/blob/bf5a52e51bbeaa089995335a0a4c55884792e505/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieFileIndex.scala#L46> implementation for optimized caching and the use of the Hudi metadata table, for faster listing of large tables. We have also added support for time travel query <https://hudi.apache.org/docs/quick-start-guide#time-travel-query>, for spark datasource. *Writer Side Improvements* This release has several major writer side improvements. Virtual key support has been added to avoid populating meta fields and leverage existing fields to populate record keys and partition paths. Bulk Insert operation using row writer is now enabled by default for faster inserts. Hudi's automatic cleaning of uncommitted data has been enhanced to be performant over cloud stores. You can learn more about this new centrally coordinated marker mechanism in this blog <https://hudi.apache.org/blog/2021/08/18/improving-marker-mechanism/>. Async Clustering support has been added to both DeltaStreamer and Spark Structured Streaming Sink. More on this can be found in this blog <https://hudi.apache.org/blog/2021/08/23/async-clustering/>. Users can choose to drop fields used to generate partition paths. Added a new write operation "delete_partition" support in spark. Users can leverage this to delete older partitions in bulk, in addition to record level deletes. Added Support for Huawei Cloud Object Storage, BAIDU AFS storage format, Baidu BOS storage in Hudi. A pre commit validator framework <https://github.com/apache/hudi/blob/bf5a52e51bbeaa089995335a0a4c55884792e505/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/validator/SparkPreCommitValidator.java> has been added for spark engine, which can be used for DeltaStreamer and Spark Datasource writers. Users can leverage this to add any validations to be executed before committing writes to Hudi. Few out of the box validators are available like SqlQueryEqualityPreCommitValidator <https://github.com/apache/hudi/blob/release-0.9.0/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/validator/SqlQueryEqualityPreCommitValidator.java>, SqlQueryInequalityPreCommitValidator <https://github.com/apache/hudi/blob/release-0.9.0/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/validator/SqlQueryInequalityPreCommitValidator.java> and SqlQuerySingleResultPreCommitValidator <https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/validator/SqlQuerySingleResultPreCommitValidator.java> . *Flink Integration Improvements* The Flink writer now supports propagation of CDC format for MOR table, by turning on the option "changelog.enabled=true". Hudi would then persist all change flags of each record, allowing users to do stateful computation based on these change logs. Flink writing is now close to feature parity with spark writing, with addition of write operations like "bulk_insert" and "insert_overwrite", support for non-partitioned tables, automatic cleanup of uncommitted data, global indexing support, hive style partitioning and handling of partition path updates. Writing also supports a new log append mode, where no records are de-duplicated and base files are directly written for each flush. Flink readers now support streaming reads from COW/MOR tables. Deletions are emitted by default in streaming read mode, the downstream receives the "DELETE" message as a Hoodie record with empty payload. Hive sync has been improved by adding support for different Hive versions and asynchronous execution. Flink Streamer tool now supports transformers. *DeltaStreamer Improvements* We have enhanced DeltaStreamer utility with 3 new sources. JDBC <https://github.com/apache/hudi/blob/release-0.9.0/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/JdbcSource.java> will help with fetching data from RDBMS sources and SQLSource <https://github.com/apache/hudi/blob/release-0.9.0/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/SqlSource.java> will assist in backfilling use cases. S3EventsHoodieIncrSource <https://github.com/apache/hudi/blob/release-0.9.0/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/S3EventsHoodieIncrSource.java> and S3EventsSource <https://github.com/apache/hudi/blob/release-0.9.0/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/S3EventsSource.java> assist in reading data from S3 reliably and efficiently ingesting that to Hudi. In addition, we have added support for timestamp based fetch from kafka and added basic auth support to schema registry. Please find more information about the release here: https://hudi.apache.org/releases/release-0.9.0 For details on how to use Hudi, please look at the quick start page located here: https://hudi.apache.org/docs/quick-start-guide.html If you'd like to download the source release, you can find it here: https://github.com/apache/hudi/releases/tag/release-0.9.0 You can read more about the release (including release notes) here: https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12322822&version=12350027 We welcome your help and feedback. For more information on how to report problems, and to get involved, visit the project website at https://hudi.apache.org/ Thanks to everyone involved! Udit Mehrotra (on behalf of the Hudi Community)