Re: [PR] [HUDI-7915] Spark 4 support [hudi]

via GitHub Wed, 04 Jun 2025 20:13:59 -0700


wombatu-kun commented on PR #12772:
URL: https://github.com/apache/hudi/pull/12772#issuecomment-2942585871


   > @wombatu-kun I see a lot of complexities are brought by the `InternalRow` 
variant data type and `Utf8String`, it's great if we can limit the changes just 
in the `hudi-spark-datasource/hudi-spark4.0.x` module(by copying the referenced 
utility class/method or maybe maintain a separate module for these incompatible 
classes) so we have enough confidence to land it quickly, some compatibility 
issues can be addressed by the `Sparkx_xAdapter` I guess.
   
   And all these complexities are brought with only Spark 4.0.0**-preview1** 
version, but with released Spark **4.0.0** the situation becomes even worse 
because there are lots of breaking changes: many often-used classes were moved 
to different package (e.g. `SparkSession`, `SQLContext`, `Dataset` that are 
used in Hudi now locate in `org.apache.spark.sql.classic` package), new args 
were added to some constructors or unapply methods (e.g. `LogicalRDD`, 
`LogicalRelation`) etc. These changed classes that are the basic APIs for 
integration with Spark are frequently used even in `hudi-spark-client` 
(fundamental common module for all Spark versions, you know).  
   So if we want to avoid a lot of complexities brought by the changes in Spark 
4.0.0 and avoid any risks of breaking compatibility or performance issues with 
spark 3.x, we have to make module `hudi-spark4.0.x` kinda self-contained: copy 
the code of `hudi-spark-client` and `hudi-spark-common` to `hudi-spark4.0.x` 
(and remove dependencies on them from `hudi-spark4.0.x` module), make all 
classes compatible with Spark 4.0.0 release in this 'super' module. There would 
be a lot of copy-pasta in `hudi-spark4.0.x`, but no **Spark3.x** code would 
change at all and we would have Spark 4 support working.
   
   @yihus says it's unmaintainable to copy classes as suggested, but i don't 
see any better way to have Spark 4 support and to not complicate existing Spark 
3.x code.  
   
   @danny0405 @yihua let's make a decision here and now.  
   I can create this self-contained Spark4.0.x module in new PR if you decide 
it's convenient way.  
   
   Btw, Apache Iceberg organizes support for all Spark versions just like that: 
one version of Spark = one iceberg-spark module, no common spark-related code 
is shared between these modules, and it doesn't seem they have significant 
problems with maintenance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [HUDI-7915] Spark 4 support [hudi]

Reply via email to