Hi Hudi experts, Spark 4 is coming out soon, and upgrading and supporting Spark has always been a crucial aspect of Hudi. However, current Hudi Spark upgrades are not smooth: they require at least one contributor to dedicate time to fixing compilation and test failures whenever a new Spark version is released. This work can take 1-3 weeks, depending on the complexity of the Spark API changes.
Here's a potential optimization: We can add a new module, *hudi-spark-master*, and set up a new GitHub CI pipeline to continuously build changes from PRs using Spark master and hudi-spark-master. Ideally, whenever Spark finalizes a release, we can "checkout" the hudi-spark-master to get the new Hudi-Spark support immediately. It's important to note that this new CI pipeline is for detecting issues and should not block us from merging new PRs. If the CI fails, we should create a JIRA ticket with details of the failure, and then proceed with the merge. This new process provides many benefits: - *Early Detection and Resolution of Compatibility Issues:* By continuously testing against the latest Spark master, compatibility issues can be identified and addressed early in the development cycle. - *Better Collaboration*: Spark upgrade work can be broken down into smaller fixes, allowing more contributors to participate in the upgrade process. - *Faster Spark Upgrades*: We will no longer need to wait for the Spark release to begin fixing compatibility issues. Ideally, Hudi will support the latest Spark as soon as the Spark release is finalized. - *Previewing Spark Support*: Hudi users can use hudi-spark-master to try Hudi on the latest Spark release before Hudi's official support is out. Please feel free to share your opinion, any thoughts would be appreciated! Best, Shawn