wombatu-kun commented on PR #12772: URL: https://github.com/apache/hudi/pull/12772#issuecomment-2942585871
> @wombatu-kun I see a lot of complexities are brought by the `InternalRow` variant data type and `Utf8String`, it's great if we can limit the changes just in the `hudi-spark-datasource/hudi-spark4.0.x` module(by copying the referenced utility class/method or maybe maintain a separate module for these incompatible classes) so we have enough confidence to land it quickly, some compatibility issues can be addressed by the `Sparkx_xAdapter` I guess. And all these complexities are brought with only Spark 4.0.0**-preview1** version, but with released Spark **4.0.0** the situation becomes even worse because there are lots of breaking changes: many often-used classes were moved to different package (e.g. `SparkSession`, `SQLContext`, `Dataset` that are used in Hudi now locate in `org.apache.spark.sql.classic` package), new args were added to some constructors or unapply methods (e.g. `LogicalRDD`, `LogicalRelation`) etc. These changed classes that are the basic APIs for integration with Spark are frequently used even in `hudi-spark-client` (fundamental common module for all Spark versions, you know). So if we want to avoid a lot of complexities brought by the changes in Spark 4.0.0 and avoid any risks of breaking compatibility or performance issues with spark 3.x, we have to make module `hudi-spark4.0.x` kinda self-contained: copy the code of `hudi-spark-client` and `hudi-spark-common` to `hudi-spark4.0.x` (and remove dependencies on them from `hudi-spark4.0.x` module), make all classes compatible with Spark 4.0.0 release in this 'super' module. There would be a lot of copy-pasta in `hudi-spark4.0.x`, but no **Spark3.x** code would change at all and we would have Spark 4 support working. @yihus says it's unmaintainable to copy classes as suggested, but i don't see any better way to have Spark 4 support and to not complicate existing Spark 3.x code. @danny0405 @yihua let's make a decision here and now. I can create this self-contained Spark4.0.x module in new PR if you decide it's convenient way. Btw, Apache Iceberg organizes support for all Spark versions just like that: one version of Spark = one iceberg-spark module, no common spark-related code is shared between these modules, and it doesn't seem they have significant problems with maintenance. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
