Re: Seeking Guidance on Apache Hudi Integration and Best Practices for the Apache Gravitino

Minghuang Li Wed, 11 Sep 2024 20:29:08 -0700


Hi Sudha,


Thank you for your detailed response. Your input is crucial to our planning for 
supporting the Hudi catalog.

Our initial step will be creating a read-only Hudi catalog in Gravitino, 
integrating with the HMS catalog used by Hudi, as it's currently the fastest 
option and widely used by Hudi (since most engines that support querying Hudi 
rely on it, such as Presto and Trino).

Once the Hudi community supports the Spark V2 interface, we'll focus on adding 
Spark support for Gravitino since our plugin is based on Spark V2 
implementation, reducing development costs significantly.

I appreciate your suggestion about Gravitino synctool and will consider it.

Thank you again for your reply; look forward to further communication and 
collaboration as our integration plan progresses!

Best regards,
Minghuang Li

On 2024/09/06 05:21:08 Bhavani Sudha wrote:
> Hi,Thanks for starting this thread. Gravitino seems very useful and
> valuable to the community. We’d love to engage, get Hudi supported out
> there as well.I have answered some of your questions inline in blue.
> 
> >>    1. Hudi does not currently offer a unified catalog interface
> specification (for instance, a unified interface for Table metadata. The
> existing HoodieTable seems designed for table data read/write, not
> metadata).
> 
> This is a little broad. So I’ll overshare. Hudi table broadly tracks
> metadata in timeline (avro schemas, versioned, maintained inside
> hudi-common), metadata table (described in some detail in the tech-specs,
> but the details are in the implementation classes). IIUC Gravitino is also
> java based? if so, HoodieTableMetaClient and HoodieMetadata classes can
> give you ways to read metadata. Hudi stores metadata (file listings, col
> stats, indices, as another table, written/read through HoodieTable with a
> different SSTable like file format).
> 
> >> 2. Hudi provides various sync tools that can sync metadata to an
> external catalog post-data write. Although they implement the
> HoodieMetaSyncOperations interface, it does not offer Hudi database and
> table abstractions, and seems unable to guarantee consistency (e.g., data
> write succeeds but metadata sync fails).
> 
> We think of Hudi as an external storage system built on cloud storage. So,
> we separate writing transactionally (achieved via distributed locks/lock
> providers) and catalogs (that maintain pointers to a table on storage). Any
> reads through the catalog ultimately reads the metadata on storage, (source
> of truth). So this is not a practical issue. For e.g the catalog can hand a
> older commit point to the query engine, and the engine can issue a
> time-travel query. This discussion is orthogonal to consistency.On the
> other hand, like you pointed out, Hudi supports syncing to multiple
> catalogs, given quite simply - users routinely want to have a single
> physical Hudi table shared across multiple engines/with their own catalogs.
> So ultimately, even if we picked one catalog and wrote through that for all
> writes for one engine (say AWS Glue), we need a catalog sync mechanism for
> another engine (say Snowflake) to read that table as well.
> 
> >>  Catalog Interface: Is there a stable and unified catalog interface in
> Hudi that we can use to ensure compatibility across different Hudi
> versions? If such an interface exists, could you point me towards some
> documentation or examples? If not, what approach would you recommend for
> unifying access to Hudi metadata?
> 
> A Gravitino synctool implementation would be a good, easy thing to look
> into. We already support for e.g AWS Glue, BigQuery, DataHub etc. There are
> also engine specific (Spark, Flink) catalog implementations, we are happy
> to take implementations for (examples in Hudi quickstart). I think we need
> a little more details to specifically understand your needs and then we can
> lean in help.
> 
> >> Future Developments: Are there any plans for official catalog management
> features in Hudi? We want to ensure our implementation is future-proof and
> would appreciate any details on upcoming enhancements that might impact
> catalog management.
> 
> There is some interest in investing more on the Hudi metaserver. Thats
> about what I know at this point. I can try to find out more.
> 
> >> Engine Support: Gravitino supports Spark versions 3.3, 3.4, and 3.5.
> Currently, only the latest version of Hudi (0.15) supports Spark 3.5. I am
> concerned that developing on this version might introduce stability and
> compatibility issues. Additionally, Gravitino’s Spark plugin is based on
> the Spark v2 interface, while Hudi’s Spark support uses the v1 interface.
> I’ve seen plans in the community about supporting Spark v2; could you
> provide a timeline for this? This will also determine how Gravitino’s Spark
> plugin will implement Hudi querying moving forward.
> 
> On Spark v2, we have pushed it back due to perf issues between spark v1 and
> v2. We guess, this will happen in the 1.x release line, Q4 2024 or Q1 2025,
> given its not being considered a pressing move (my interpretation) in the
> recent times.
> 
> Thanks,
> Sudha
> 
> On Sun, Sep 1, 2024 at 8:20 PM Minghuang Li <mcha...@apache.org> wrote:
> 
> > Hi He,
> >
> > Thank you for your reminder.
> >
> > Apache Gravitino is a high-performance, geo-distributed, and federated
> > metadata lake. It manages metadata directly in different sources, types,
> > and regions and provides users with unified metadata access for data and AI
> > assets. It was donated to ASF by Datastrato[1] in June 2024.
> >
> > Gravitino current supports metadata management for Apache Hive, Apache
> > Iceberg, Apache Paimon, Apache Doris, Apache Kafka, etc.As the first
> > official version of Gravitino after its donation to ASF is still being
> > released, its documentation has not been fully migrated to ASF. More
> > detailed information about Gravitino can be found in the documentation for
> > version 0.5.1 [2].
> >
> > Best regards,
> > Minghuang Li
> >
> > [1] https://datastrato.ai
> > [2] https://datastrato.ai/docs/latest
> >
> > On 2024/08/30 10:08:43 He Qi wrote:
> > > Maybe you can give more background about Gravitino.
> > >
> > > On 2024/08/30 07:50:31 Minghuang Li wrote:
> > > > Hello Hudi Devs,
> > > >
> > > > First and foremost, I would like to express my admiration for the
> > Apache Hudi project. The innovation and robust features you've brought to
> > data lake technology management are truly impressive and are greatly valued
> > by the developer community.
> > > >
> > > > I'm currently integrating Apache Hudi into Apache Gravitino[1] project
> > to more efficiently manage data lake metadata. We plan to implement a Hudi
> > catalog[2] in Gravitino and I am reaching out for advice to ensure we align
> > with Hudi's best practices and future direction.
> > > >
> > > > Through my research into the Hudi project, I have noted the current
> > state of metadata management (please correct me if I am wrong):
> > > >
> > > >     1. Hudi does not currently offer a unified catalog interface
> > specification (for instance, a unified interface for Table metadata. The
> > existing HoodieTable seems designed for table data read/write, not
> > metadata).
> > > >     2. Hudi provides various sync tools that can sync metadata to an
> > external catalog post-data write. Although they implement the
> > HoodieMetaSyncOperations interface, it does not offer Hudi database and
> > table abstractions, and seems unable to guarantee consistency (e.g., data
> > write succeeds but metadata sync fails).
> > > >
> > > > Based on these observations, a couple of things I’m hoping to get your
> > insights on:
> > > >
> > > > Catalog Interface: Is there a stable and unified catalog interface in
> > Hudi that we can use to ensure compatibility across different Hudi
> > versions? If such an interface exists, could you point me towards some
> > documentation or examples? If not, what approach would you recommend for
> > unifying access to Hudi metadata?
> > > >
> > > > Future Developments: Are there any plans for official catalog
> > management features in Hudi? We want to ensure our implementation is
> > future-proof and would appreciate any details on upcoming enhancements that
> > might impact catalog management.
> > > >
> > > > Engine Support: Gravitino supports Spark versions 3.3, 3.4, and 3.5.
> > Currently, only the latest version of Hudi (0.15) supports Spark 3.5. I am
> > concerned that developing on this version might introduce stability and
> > compatibility issues. Additionally, Gravitino's Spark plugin is based on
> > the Spark v2 interface, while Hudi's Spark support uses the v1 interface.
> > I've seen plans in the community about supporting Spark v2; could you
> > provide a timeline for this? This will also determine how Gravitino's Spark
> > plugin will implement Hudi querying moving forward.
> > > >
> > > > I would greatly appreciate any guidance and support the Hudi community
> > can offer. Your insights would be invaluable in ensuring the successful
> > integration of Hudi into our project. Thank you very much for your time and
> > assistance!
> > > >
> > > > Best regards,
> > > > Minghuang Li
> > > >
> > > > [1] https://github.com/apache/gravitino
> > > > [2] https://lists.apache.org/thread/bmz4xsv2ogpccy5wtopyy9hp1cot317b
> > > >
> > > >
> > >
> >
>

Re: Seeking Guidance on Apache Hudi Integration and Best Practices for the Apache Gravitino

Reply via email to