The Apache Hudi team is pleased to announce the release of Apache Hudi
0.9.0.

This release comes almost 5 months after 0.8.0. It includes 387 resolved
issues, comprising new features as well as
general improvements and bug-fixes. Here are a few quick highlights:

*Spark SQL DML and DDL Support*
We have added experimental support for DDL/DML using Spark SQL taking a
huge step towards making Hudi more
easily accessible and operable by all personas (non-engineers, analysts
etc). Users can now use SQL statements like
"CREATE TABLE....USING HUDI" and "CREATE TABLE .. AS SELECT" to
create/manage tables in catalogs like Hive,
and "INSERT", "INSERT OVERWRITE", "UPDATE", "MERGE INTO" and "DELETE"
statements to manipulate data.
For more information, checkout our docs here
<https://hudi.apache.org/docs/quick-start-guide> clicking on the SparkSQL
tab.

*Query Side Improvements*
Hudi tables are now registered with Hive as spark datasource tables,
meaning Spark SQL on these tables now uses the
datasource as well, instead of relying on the Hive fallbacks within Spark,
which are ill-maintained/cumbersome. This
unlocks many optimizations such as the use of Hudi's own FileIndex
<https://github.com/apache/hudi/blob/bf5a52e51bbeaa089995335a0a4c55884792e505/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieFileIndex.scala#L46>
implementation for optimized caching and the use
of the Hudi metadata table, for faster listing of large tables. We have
also added support for time travel query
<https://hudi.apache.org/docs/quick-start-guide#time-travel-query>, for
spark
datasource.

*Writer Side Improvements*
This release has several major writer side improvements. Virtual key
support has been added to avoid populating meta
fields and leverage existing fields to populate record keys and partition
paths.
Bulk Insert operation using row writer is now enabled by default for faster
inserts.
Hudi's automatic cleaning of uncommitted data has been enhanced to be
performant over cloud stores. You can learn
more about this new centrally coordinated marker mechanism in this blog
<https://hudi.apache.org/blog/2021/08/18/improving-marker-mechanism/>.
Async Clustering support has been added to both DeltaStreamer and Spark
Structured Streaming Sink. More on this
can be found in this blog
<https://hudi.apache.org/blog/2021/08/23/async-clustering/>.
Users can choose to drop fields used to generate partition paths.
Added a new write operation "delete_partition" support in spark. Users can
leverage this to delete older partitions in
bulk, in addition to record level deletes.
Added Support for Huawei Cloud Object Storage, BAIDU AFS storage format,
Baidu BOS storage in Hudi.
A pre commit validator framework
<https://github.com/apache/hudi/blob/bf5a52e51bbeaa089995335a0a4c55884792e505/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/validator/SparkPreCommitValidator.java>
has been added for spark engine, which can be used for DeltaStreamer and
Spark
Datasource writers. Users can leverage this to add any validations to be
executed before committing writes to Hudi.
Few out of the box validators are available like
SqlQueryEqualityPreCommitValidator
<https://github.com/apache/hudi/blob/release-0.9.0/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/validator/SqlQueryEqualityPreCommitValidator.java>,
SqlQueryInequalityPreCommitValidator
<https://github.com/apache/hudi/blob/release-0.9.0/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/validator/SqlQueryInequalityPreCommitValidator.java>
and SqlQuerySingleResultPreCommitValidator
<https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/validator/SqlQuerySingleResultPreCommitValidator.java>
.

*Flink Integration Improvements*
The Flink writer now supports propagation of CDC format for MOR table, by
turning on the option "changelog.enabled=true".
Hudi would then persist all change flags of each record, allowing users to
do stateful computation based on these change logs.
Flink writing is now close to feature parity with spark writing, with
addition of write operations like "bulk_insert" and
"insert_overwrite", support for non-partitioned tables, automatic cleanup
of uncommitted data, global indexing support, hive
style partitioning and handling of partition path updates.
Writing also supports a new log append mode, where no records are
de-duplicated and base files are directly written for each flush.
Flink readers now support streaming reads from COW/MOR tables. Deletions
are emitted by default in streaming read mode, the
downstream receives the "DELETE" message as a Hoodie record with empty
payload.
Hive sync has been improved by adding support for different Hive versions
and asynchronous execution.
Flink Streamer tool now supports transformers.

*DeltaStreamer Improvements*
We have enhanced DeltaStreamer utility with 3 new sources. JDBC
<https://github.com/apache/hudi/blob/release-0.9.0/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/JdbcSource.java>
will help with fetching data from RDBMS sources and
SQLSource
<https://github.com/apache/hudi/blob/release-0.9.0/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/SqlSource.java>
will assist in backfilling use cases. S3EventsHoodieIncrSource
<https://github.com/apache/hudi/blob/release-0.9.0/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/S3EventsHoodieIncrSource.java>
and S3EventsSource
<https://github.com/apache/hudi/blob/release-0.9.0/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/S3EventsSource.java>
assist in reading data from S3
reliably and efficiently ingesting that to Hudi. In addition, we have added
support for timestamp based fetch from kafka and added
basic auth support to schema registry.

Please find more information about the release here:
https://hudi.apache.org/releases/release-0.9.0

For details on how to use Hudi, please look at the quick start page located
here:
https://hudi.apache.org/docs/quick-start-guide.html

If you'd like to download the source release, you can find it here:
https://github.com/apache/hudi/releases/tag/release-0.9.0

You can read more about the release (including release notes) here:
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12322822&version=12350027

We welcome your help and feedback. For more information on how to report
problems, and to get involved, visit the project
website at https://hudi.apache.org/

Thanks to everyone involved!

Udit Mehrotra
(on behalf of the Hudi Community)

Reply via email to