DISCUSS: Spark read optimization

Geser Dugarov Thu, 24 Jul 2025 20:36:11 -0700

Hi Hudi community!

>From my point of view, there are two major Data Lakehouse scenarios, where
performance is critical: Flink streaming write and Spark batch read. I’d
like to thank @cshuo and @danny0405 for their work on improving performance
of Flink write to Hudi tables by eliminating Avro coupling. A huge amount
of effort has been made under [HUDI-9075] (
https://issues.apache.org/jira/browse/HUDI-9075).


Regarding Spark, in order to achieve performance improvements, we need to
complete Datasource V2 integration. First steps in this integration were
taken in RFC-38 (https://github.com/apache/hudi/pull/3964), which is now
marked as completed. However, there are still several issues with
advertising Hudi tables as V2 without fully implementing the necessary
APIs, and with using a custom resolution rule to fall back to V1. This led
to a performance regression, which was addressed in [HUDI-4178] (
https://github.com/apache/hudi/pull/5737). As a result, the current
implementation of `HoodieCatalog` and `Spark3DefaultSource` returns a
`V1Table` instead of `HoodieInternalV2Table`, so we do not actually support
Datasource V2 yet.

I propose restarting the effort and implementing full Datasource V2
integration, and I'm ready to provide an initial design for community
discussion. For this purpose, I've opened a PR with a corresponding RFC
claim: https://github.com/apache/hudi/pull/13609. There is an attachment
with the current Spark integration schema with the missing parts
highlighted. If you think we can move forward in this direction, please let
me know - I’d be happy to contribute.

    --
    Sincerely,
    Geser Dugarov

DISCUSS: Spark read optimization

Reply via email to