[DISCUSSION] Simplify code structure for supporting multiple Spark versions in Hudi

Shawn Chang Tue, 23 May 2023 20:16:42 -0700

Hi Hudi developers,

I am writing to discuss the current code structure of the existing
hudi-spark-datasource and propose a more scalable approach for supporting
multiple Spark versions. The current structure involves common code shared
by several Spark versions, such as hudi-spark-common, hudi-spark3-common,
hudi-spark3.2plus-common, etc. (a detailed description can be found in the
readme here:
https://github.com/apache/hudi/blob/master/hudi-spark-datasource/README.md).
This setup aims to minimize duplicate code in Hudi. Hudi currently utilizes
the SparkAdapter to invoke specific code based on the Spark version,
allowing different Spark versions to trigger different logic.


However, this code structure proves to be complex and hampers the process
of adding support for newer Spark versions. The current approach involves
the following steps:
1) Identify breaking changes introduced by the new Spark version and patch
affected Hudi classes.
2) Separate affected Hudi classes into different folders so that older
Spark versions can continue using the existing logic, while the new Spark
version can work with the updated Hudi classes.
3) Connect SparkAdapter to these Hudi classes, enabling Hudi to utilize the
correct code based on the Spark version.
4) Collect common code and place it in a new folder, such as
hudi-spark3.2plus-common, to reduce duplicate code.

This convoluted process has significantly slowed down the pace of adding
support for newer Spark versions in Hudi. Fortunately, there is a simpler
alternative that can streamline the process. I propose removing the common
modules and having only one folder for each Spark version. For example:




*hudi-spark-datasource/---hudi-spark2.4.0/---hudi-spark3.2.0/---hudi-spark3.3.0/...*

With this revised code structure, each Spark version will have its own
corresponding Hudi module. The process of adding Spark support will be
simplified as follows:
1) Copy the latest existing hudi-spark module to a new module,
hudi-spark<new_Spark_version>.
2) Identify breaking changes introduced by the new Spark version and patch
affected Hudi classes.

Let's consider some pros and cons of this new code structure:
*Pros:*
-A more readable codebase, with each Spark version having its individual
module.
-Easier addition of support for new Spark versions by duplicating the most
recent module and making necessary modifications.
-Simpler implementation of improvements specific to a particular Spark
version.
*Cons:*
-Increased duplicate code (though this shouldn't impact the Hudi jar size
during runtime, as the jar will still only contain support for one Spark
version).
-When applying a general fix for multiple Spark versions, the fix needs to
be applied to different Spark modules instead of a common codebase.

Please feel free to share your opinion, any feedback would be welcome!

Thank you.

Best,
Shawn

[DISCUSSION] Simplify code structure for supporting multiple Spark versions in Hudi

Reply via email to