Re: DISCUSS: Spark read optimization

Y Ethan Guo Fri, 25 Jul 2025 07:49:59 -0700

Hi Geser,

Thanks for bringing this up!


We should definitely revisit the Spark Datasource V2 integration.

Previously, the V1 fallback was done so that the Spark Datasource reads
provide good performance with all optimization rules applied, whereas V2
reads incurred performance regression on the read side. When you mentioned
"performance improvements", what are the specific improvements you're
referring to?

Looking forward to the full RFC.

Thanks,
- Ethan

On Thu, Jul 24, 2025 at 8:37 PM Geser Dugarov <geserduga...@gmail.com>
wrote:

> Hi Hudi community!
>
> From my point of view, there are two major Data Lakehouse scenarios, where
> performance is critical: Flink streaming write and Spark batch read. I’d
> like to thank @cshuo and @danny0405 for their work on improving performance
> of Flink write to Hudi tables by eliminating Avro coupling. A huge amount
> of effort has been made under [HUDI-9075] (
> https://issues.apache.org/jira/browse/HUDI-9075).
>
> Regarding Spark, in order to achieve performance improvements, we need to
> complete Datasource V2 integration. First steps in this integration were
> taken in RFC-38 (https://github.com/apache/hudi/pull/3964), which is now
> marked as completed. However, there are still several issues with
> advertising Hudi tables as V2 without fully implementing the necessary
> APIs, and with using a custom resolution rule to fall back to V1. This led
> to a performance regression, which was addressed in [HUDI-4178] (
> https://github.com/apache/hudi/pull/5737). As a result, the current
> implementation of `HoodieCatalog` and `Spark3DefaultSource` returns a
> `V1Table` instead of `HoodieInternalV2Table`, so we do not actually support
> Datasource V2 yet.
>
> I propose restarting the effort and implementing full Datasource V2
> integration, and I'm ready to provide an initial design for community
> discussion. For this purpose, I've opened a PR with a corresponding RFC
> claim: https://github.com/apache/hudi/pull/13609. There is an attachment
> with the current Spark integration schema with the missing parts
> highlighted. If you think we can move forward in this direction, please let
> me know - I’d be happy to contribute.
>
>     --
>     Sincerely,
>     Geser Dugarov
>

Re: DISCUSS: Spark read optimization

Reply via email to