Re: DISCUSS: Spark read optimization

Y Ethan Guo Thu, 31 Jul 2025 13:47:44 -0700

Got it, that makes sense to me.

We should ensure that the same performance optimizations that take effect
on Spark Datasource V1 are applied to Hudi on Spark Datasource V2 as well.


I've approved and landed the RFC claim.

Thanks,
- Ethan

On Tue, Jul 29, 2025 at 3:20 AM Geser Dugarov <geserduga...@gmail.com>
wrote:

> Hi Ethan,
>
> When I mentioned "performance improvements", I was referring to better
> integration with Spark optimizer and enhancements in use cases that
> leverage the additional functionality of Datasource V2. For instance,
> Datasource V2 provides interfaces like `SupportsPushDownLimit` and
> `SupportsPushDownAggregates`, which can significantly improve the
> performance of queries involving filters, aggregations, and limits,
> compared to Datasource V1.
>
> If you're okay with it, could we land the corresponding RFC claim first?
> https://github.com/apache/hudi/pull/13609
>
>     --
>     Best regards,
>     Geser
>
> On Fri, Jul 25, 2025 at 9:50 PM Y Ethan Guo <yi...@apache.org> wrote:
>
> > Hi Geser,
> >
> > Thanks for bringing this up!
> >
> > We should definitely revisit the Spark Datasource V2 integration.
> >
> > Previously, the V1 fallback was done so that the Spark Datasource reads
> > provide good performance with all optimization rules applied, whereas V2
> > reads incurred performance regression on the read side. When you
> mentioned
> > "performance improvements", what are the specific improvements you're
> > referring to?
> >
> > Looking forward to the full RFC.
> >
> > Thanks,
> > - Ethan
> >
> > On Thu, Jul 24, 2025 at 8:37 PM Geser Dugarov <geserduga...@gmail.com>
> > wrote:
> >
> > > Hi Hudi community!
> > >
> > > From my point of view, there are two major Data Lakehouse scenarios,
> > where
> > > performance is critical: Flink streaming write and Spark batch read.
> I’d
> > > like to thank @cshuo and @danny0405 for their work on improving
> > performance
> > > of Flink write to Hudi tables by eliminating Avro coupling. A huge
> amount
> > > of effort has been made under [HUDI-9075] (
> > > https://issues.apache.org/jira/browse/HUDI-9075).
> > >
> > > Regarding Spark, in order to achieve performance improvements, we need
> to
> > > complete Datasource V2 integration. First steps in this integration
> were
> > > taken in RFC-38 (https://github.com/apache/hudi/pull/3964), which is
> now
> > > marked as completed. However, there are still several issues with
> > > advertising Hudi tables as V2 without fully implementing the necessary
> > > APIs, and with using a custom resolution rule to fall back to V1. This
> > led
> > > to a performance regression, which was addressed in [HUDI-4178] (
> > > https://github.com/apache/hudi/pull/5737). As a result, the current
> > > implementation of `HoodieCatalog` and `Spark3DefaultSource` returns a
> > > `V1Table` instead of `HoodieInternalV2Table`, so we do not actually
> > support
> > > Datasource V2 yet.
> > >
> > > I propose restarting the effort and implementing full Datasource V2
> > > integration, and I'm ready to provide an initial design for community
> > > discussion. For this purpose, I've opened a PR with a corresponding RFC
> > > claim: https://github.com/apache/hudi/pull/13609. There is an
> attachment
> > > with the current Spark integration schema with the missing parts
> > > highlighted. If you think we can move forward in this direction, please
> > let
> > > me know - I’d be happy to contribute.
> > >
> > >     --
> > >     Sincerely,
> > >     Geser Dugarov
> > >
> >
>

Re: DISCUSS: Spark read optimization

Reply via email to