Hi Hudi experts,

Spark 4 is coming out soon, and upgrading and supporting Spark has always
been a crucial aspect of Hudi. However, current Hudi Spark upgrades are not
smooth: they require at least one contributor to dedicate time to fixing
compilation and test failures whenever a new Spark version is released.
This work can take 1-3 weeks, depending on the complexity of the Spark API
changes.

Here's a potential optimization: We can add a new module,
*hudi-spark-master*, and set up a new GitHub CI pipeline to continuously
build changes from PRs using Spark master and hudi-spark-master. Ideally,
whenever Spark finalizes a release, we can "checkout" the hudi-spark-master
to get the new Hudi-Spark support immediately. It's important to note that
this new CI pipeline is for detecting issues and should not block us from
merging new PRs. If the CI fails, we should create a JIRA ticket with
details of the failure, and then proceed with the merge.

This new process provides many benefits:
- *Early Detection and Resolution of Compatibility Issues:* By continuously
testing against the latest Spark master, compatibility issues can be
identified and addressed early in the development cycle.
- *Better Collaboration*: Spark upgrade work can be broken down into
smaller fixes, allowing more contributors to participate in the upgrade
process.
- *Faster Spark Upgrades*: We will no longer need to wait for the Spark
release to begin fixing compatibility issues. Ideally, Hudi will support
the latest Spark as soon as the Spark release is finalized.
- *Previewing Spark Support*: Hudi users can use hudi-spark-master to try
Hudi on the latest Spark release before Hudi's official support is out.

Please feel free to share your opinion, any thoughts would be appreciated!

Best,
Shawn

Reply via email to