Hi Hudi community! >From my point of view, there are two major Data Lakehouse scenarios, where performance is critical: Flink streaming write and Spark batch read. I’d like to thank @cshuo and @danny0405 for their work on improving performance of Flink write to Hudi tables by eliminating Avro coupling. A huge amount of effort has been made under [HUDI-9075] ( https://issues.apache.org/jira/browse/HUDI-9075).
Regarding Spark, in order to achieve performance improvements, we need to complete Datasource V2 integration. First steps in this integration were taken in RFC-38 (https://github.com/apache/hudi/pull/3964), which is now marked as completed. However, there are still several issues with advertising Hudi tables as V2 without fully implementing the necessary APIs, and with using a custom resolution rule to fall back to V1. This led to a performance regression, which was addressed in [HUDI-4178] ( https://github.com/apache/hudi/pull/5737). As a result, the current implementation of `HoodieCatalog` and `Spark3DefaultSource` returns a `V1Table` instead of `HoodieInternalV2Table`, so we do not actually support Datasource V2 yet. I propose restarting the effort and implementing full Datasource V2 integration, and I'm ready to provide an initial design for community discussion. For this purpose, I've opened a PR with a corresponding RFC claim: https://github.com/apache/hudi/pull/13609. There is an attachment with the current Spark integration schema with the missing parts highlighted. If you think we can move forward in this direction, please let me know - I’d be happy to contribute. -- Sincerely, Geser Dugarov