Re: DISCUSS: Spark read optimization

Geser Dugarov Tue, 29 Jul 2025 03:20:48 -0700

Hi Ethan,

When I mentioned "performance improvements", I was referring to better
integration with Spark optimizer and enhancements in use cases that
leverage the additional functionality of Datasource V2. For instance,
Datasource V2 provides interfaces like `SupportsPushDownLimit` and
`SupportsPushDownAggregates`, which can significantly improve the
performance of queries involving filters, aggregations, and limits,
compared to Datasource V1.


If you're okay with it, could we land the corresponding RFC claim first?
https://github.com/apache/hudi/pull/13609

    --
    Best regards,
    Geser

On Fri, Jul 25, 2025 at 9:50 PM Y Ethan Guo <yi...@apache.org> wrote:

> Hi Geser,
>
> Thanks for bringing this up!
>
> We should definitely revisit the Spark Datasource V2 integration.
>
> Previously, the V1 fallback was done so that the Spark Datasource reads
> provide good performance with all optimization rules applied, whereas V2
> reads incurred performance regression on the read side. When you mentioned
> "performance improvements", what are the specific improvements you're
> referring to?
>
> Looking forward to the full RFC.
>
> Thanks,
> - Ethan
>
> On Thu, Jul 24, 2025 at 8:37 PM Geser Dugarov <geserduga...@gmail.com>
> wrote:
>
> > Hi Hudi community!
> >
> > From my point of view, there are two major Data Lakehouse scenarios,
> where
> > performance is critical: Flink streaming write and Spark batch read. I’d
> > like to thank @cshuo and @danny0405 for their work on improving
> performance
> > of Flink write to Hudi tables by eliminating Avro coupling. A huge amount
> > of effort has been made under [HUDI-9075] (
> > https://issues.apache.org/jira/browse/HUDI-9075).
> >
> > Regarding Spark, in order to achieve performance improvements, we need to
> > complete Datasource V2 integration. First steps in this integration were
> > taken in RFC-38 (https://github.com/apache/hudi/pull/3964), which is now
> > marked as completed. However, there are still several issues with
> > advertising Hudi tables as V2 without fully implementing the necessary
> > APIs, and with using a custom resolution rule to fall back to V1. This
> led
> > to a performance regression, which was addressed in [HUDI-4178] (
> > https://github.com/apache/hudi/pull/5737). As a result, the current
> > implementation of `HoodieCatalog` and `Spark3DefaultSource` returns a
> > `V1Table` instead of `HoodieInternalV2Table`, so we do not actually
> support
> > Datasource V2 yet.
> >
> > I propose restarting the effort and implementing full Datasource V2
> > integration, and I'm ready to provide an initial design for community
> > discussion. For this purpose, I've opened a PR with a corresponding RFC
> > claim: https://github.com/apache/hudi/pull/13609. There is an attachment
> > with the current Spark integration schema with the missing parts
> > highlighted. If you think we can move forward in this direction, please
> let
> > me know - I’d be happy to contribute.
> >
> >     --
> >     Sincerely,
> >     Geser Dugarov
> >
>

Re: DISCUSS: Spark read optimization

Reply via email to