Re: Polaris persistence refactor POC

Yufei Gu Tue, 18 Feb 2025 09:29:42 -0800

>
> DAO is rather specific to JPA - so it is in practice specific to
> relational databases.


That's not true. DAO could be a pure Java interface without JPA as I
mentioned above. Also my POC and design intentionally avoid any JPA related
APIs. In fact, both implementers in the POC are non-JPA ones.

Yufei


On Tue, Feb 18, 2025 at 5:02 AM Robert Stupp <sn...@snazy.de> wrote:

> Reminder: not all databases, especially NoSQL databases, are usable with
> JPA. DAO is rather specific to JPA - so it is in practice specific to
> relational databases. Further, DAOs, if implemented _like_ JPA, require
> a ton of logic to be implemented.
>
> I strongly prefer simple and database agnostic interfaces and not
> something that only works for relational databases or FDB.
>
> For catalog operations (as proposed in this gdoc comment
>
> https://docs.google.com/document/d/1LlNhEy4cBjjE_um694fcsnizqd3rDm5pbewXkLxvu1o/edit?disco=AAABcK1NcYk
> )
>
> interface CatalogPersistence {
>    CatalogState getCurrentState();
>    R commitTransactional(Updater<R> updater, Class<R> resultClass);
> }
>
> The first function gives you a consistent state (think: snapshot of
> everything) of the catalog as of "now" and the second one allows you to
> update the state.
>
> CatalogState could look like this:
>
> interface CatalogState {
>      Stream<Identifier> listChildren(Identifier parent);
>      Optional<NamespaceObject> getNamespace(Identifier identifier);
>      C getContent(Identifier identifier, Class<C> clazz);
>      Optional<ResolvedCatalogPath<OBJ>> resolveCatalogPath(
>          Identifier identifier, Class<OBJ> expectedLeafType)
> }
>
> `Updater` could a functional interface with a single "RESULT
> update(UpdateState state)" function.
>
> `CatalogUpdateState` could have with function to record intended changes:
>
> interface CatalogUpdateState extends CatalogState {
>      <C> void recordChange(
>          Identifier identifier,
>          C content,
>          Class<C> clazz,
>          Change change);
> }
>
> If `update(CatalogUpdateState)` returns successfully, all changes to the
> catalog are persisted atomically and the result propagated to the
> caller. If `update()` throws, it's an error and the catalog state
> doesn't change.
>
> Combined with a retry-mechanism, `update()` could also retry in case of
> (resolvable) conflicts.
>
>
> For other use cases we can get away with an interface like this:
>
> public interface Persistence {
>     <T>  T fetch(ObjId id, Class<T> clazz);
>      // write w/ immedite consistency guarantee
>      <T> T write(T obj, Class<T> clazz);
>      // write new row if not yet exists
>      <T> T conditionalInsert(T obj, Class<T> clazz);
>      //  update row if "expected state marker" matches
>      <T> T conditionalUpdate(T expected, T update, Class<T> clazz);
>
>      // plus lower-level functions to enable atomic catalog state updates
> }
>
>
> On 18.02.25 07:41, Jean-Baptiste Onofré wrote:
> > Yes I know (remember I proposed to use DAO so I know what it is :)).
> > If it’s clear for everyone, ok for DAO (just wanted to be sure that it’s
> > clear for the implementers).
> >
> > So ok to have CatalogDao for operations and CatalogRecord for carriage
> data
> > (for instance).
> >
> > Thanks
> > Regards
> > JB
> >
> > Le mar. 18 févr. 2025 à 02:11, Yufei Gu <flyrain...@gmail.com> a écrit :
> >
> >> Hi JB,
> >>
> >> DAO is often defined as an interface, used exactly for operations. Why
> do
> >> we need another name for that? See the examples here
> >> <https://www.geeksforgeeks.org/data-access-object-pattern/>, it can be
> >> with
> >> JPA or non-JPA. In our case, it has to be compatible with non-JPA
> >> implementation.
> >>
> >> interface DeveloperDao {
> >>      public List<Developer> getAllDevelopers();
> >>      public Developer getDeveloper(int DeveloperId);
> >>      public void updateDeveloper(Developer Developer);
> >>      public void deleteDeveloper(Developer Developer);
> >> }
> >>
> >> public interface ProductDAO extends JpaRepository<Product, Long> {
> >>      // Additional custom queries or methods can be declared here
> >> }
> >>
> >> Yufei
> >>
> >>
> >> On Mon, Feb 17, 2025 at 10:58 AM Jean-Baptiste Onofré <j...@nanthrax.net>
> >> wrote:
> >>
> >>> Hi Yufei
> >>>
> >>> I agree with you. My only concern with CatalogDAO and NamespaceDAO in
> the
> >>> PoC is that it could not be obvious for implementer.
> >>> Being very abstract force us to clearly decouple service and
> persistence.
> >>> One interface or multiple interfaces is totally fine for me (for the
> >>> storage operations). I just think that we have to be clear in the DAO.
> >>>
> >>> It’s what I proposed and discussed with Jack to also get his expertise
> >> for
> >>> DynamoDB.
> >>>
> >>> So, in the PoC, I would:
> >>> 1. Clearly define DAO, with enforcement on the data transport (like
> using
> >>> Java records). Maybe it’s more a naming clarification. We can have
> >>> CatalogRecord (Java records) and CatalogStoreOperations (instead of
> >>> CatalogDAO and using CatalogRecord)
> >>> 2. Split store operations for a backend in separate modules (one for
> FDB,
> >>> one for Postgre)
> >>>
> >>> Thoughts ?
> >>>
> >>> Regards
> >>> JB
> >>>
> >>> Le lun. 17 févr. 2025 à 18:36, Yufei Gu <flyrain...@gmail.com> a
> écrit :
> >>>
> >>>> Hi JB,
> >>>>
> >>>> The Java objects you mentioned should either belong to the
> >>> service/business
> >>>> layer or be unnecessary. Apart from the business entities in Polaris,
> >>> each
> >>>> backend will maintain its own set of persistence-layer entities. The
> >>>> backend implementations (whether Postgres, FDB, or MongoDB) will be
> >>>> responsible for mapping between these persistence-layer entities and
> >> the
> >>>> business entities. Currently, the business entity set is tightly
> >> coupled
> >>>> with FDB, which is not ideal for supporting multiple backends. While
> we
> >>> can
> >>>> refactor them, creating an entirely new set of business entities is
> not
> >>>> required.
> >>>>
> >>>> Regarding the PolarisStore you mentioned, it functions similarly to
> the
> >>> DAO
> >>>> layer in the POC. It'd make more sense to split its responsibilities
> >> into
> >>>> separate interfaces rather than combining them into a single one. A
> >>> single
> >>>> interface violates the Single Responsibility Principle and may lead to
> >>>> maintenance challenges as we introduce new data access patterns. For
> >>>> instance, we would be forced to continuously add new methods to
> >>> accommodate
> >>>> additional entities like Policy and Volume, or to support new query
> >>>> patterns for existing entities.
> >>>>
> >>>> As for choosing between record or immutable classes, either is fine.
> >> But
> >>>> I'd prefer to use any native Java feature if possible.
> >>>>
> >>>>
> >>>>
> >>>> Yufei
> >>>>
> >>>>
> >>>> On Mon, Feb 17, 2025 at 9:10 AM Michael Collado <
> >> collado.m...@gmail.com>
> >>>> wrote:
> >>>>
> >>>>> I prefer native Java constructs over third party libraries and
> >> compiler
> >>>>> plugins, whenever possible. I’m a fan of Java records.
> >>>>>
> >>>>> Mike
> >>>>>
> >>>>> On Mon, Feb 17, 2025 at 6:55 AM Jean-Baptiste Onofré <
> >> j...@nanthrax.net>
> >>>>> wrote:
> >>>>>
> >>>>>> Hi Robert,
> >>>>>>
> >>>>>> Yes, my intention about Java records (e.g.
> >>>>>> https://openjdk.org/jeps/395) is to leverage:
> >>>>>> - Devise an object-oriented construct that expresses a simple
> >>>>>> aggregation of values.
> >>>>>> - Focus on modeling immutable data rather than extensible behavior.
> >>>>>> - Automatically implement data-driven methods such as equals and
> >>>>> accessors.
> >>>>>> - Preserve long-standing Java principles such as nominal typing and
> >>>>>> migration compatibility.
> >>>>>> - Provide inner builder, optionally with technical validation
> >>>>>> (Objects.NotNull, etc), for instance:
> >>>>>>
> >>>>>> public record CatalogDAO(String id, String name, ...) {
> >>>>>>
> >>>>>>    public static final class Builder {
> >>>>>>        String id;
> >>>>>>        String name;
> >>>>>>        public Builder() {}
> >>>>>>        public Builder id(String id) {
> >>>>>>         this.id = id;
> >>>>>>         return this;
> >>>>>>       }
> >>>>>>       public Builder name(String name) {
> >>>>>>         this.name = name;
> >>>>>>         return this;
> >>>>>>       }
> >>>>>>       public CatalogDAO build() {
> >>>>>>         return new CatalogDAO(id, name);
> >>>>>>       }
> >>>>>>    }
> >>>>>>
> >>>>>> }
> >>>>>>   NB: that's just a "raw" example :)
> >>>>>>
> >>>>>> That could help us for the "DAO" layer, with clean isolation and
> >> data
> >>>>>> "transport". The "conversion" from Polaris Entity to DAO could be
> >>>>>> performed by the intermediate logic layer (e.g.
> >>>>>> PolarisMetaStoreManager), the pure persistence layer can deal with
> >>> DAO
> >>>>>> only (e.g. PolarisStore).
> >>>>>>
> >>>>>> Good point about immutable collections. In some projects, I "force"
> >>>>>> the copy of a collection to ensure immutability, something like:
> >>>>>>
> >>>>>> record NamespaceDAO(Set<String> children) {
> >>>>>>      public NamespaceDAO {
> >>>>>>          children = Set.copyOf(children);
> >>>>>>      }
> >>>>>> }
> >>>>>>
> >>>>>> OK, that's not super elegant :) but it does the trick ;)
> >>>>>>
> >>>>>> Regards
> >>>>>> JB
> >>>>>>
> >>>>>> On Mon, Feb 17, 2025 at 5:28 PM Robert Stupp <sn...@snazy.de>
> >> wrote:
> >>>>>>> Agree with JB, except I think "immutables" serves the need much
> >>>> better.
> >>>>>>> Java records are okay, but do not ensure that all fields are
> >>>> immutable
> >>>>> -
> >>>>>>> especially collections. The other pro of immutables is that there
> >>> are
> >>>>>>> proper, descriptive builders - hence no need to constructors
> >> with a
> >>>>>>> gazillion parameters.
> >>>>>>>
> >>>>>>> On 17.02.25 11:42, Jean-Baptiste Onofré wrote:
> >>>>>>>> Hi Yufei
> >>>>>>>>
> >>>>>>>> I left comments in the PR.
> >>>>>>>>
> >>>>>>>> Thanks for that. That's an interesting approach but slightly
> >>>>> different
> >>>>>>>> to what I had in mind.
> >>>>>>>>
> >>>>>>>> My proposal was:
> >>>>>>>>
> >>>>>>>> 1. The DAOs are Java records clearly descriptive, without
> >>> "storage
> >>>>>> operations".
> >>>>>>>> For instance, we can imagine:
> >>>>>>>>
> >>>>>>>> public record CatalogDao(String id, String name, String
> >> location,
> >>>>> ...)
> >>>>>> {
> >>>>>>>> ...
> >>>>>>>> }
> >>>>>>>>
> >>>>>>>> public record NamespaceDao(String id, String name, String
> >> parent,
> >>>>> ...)
> >>>>>> {}
> >>>>>>>> public record TableDao(String id, String name, ...) {}
> >>>>>>>>
> >>>>>>>> etc
> >>>>>>>>
> >>>>>>>> The advantages are:
> >>>>>>>> - very simple and descriptive
> >>>>>>>> - common to any backend implementation
> >>>>>>>>
> >>>>>>>> 2. The PolarisStore defines the DAO backend operations and
> >>> mapping
> >>>> to
> >>>>>>>> "internal" Polaris entities. Each persistence adapter
> >> implements
> >>>> the
> >>>>>>>> PolarisStore.
> >>>>>>>> For the persistence adapter implementer, he just has to
> >> implement
> >>>> the
> >>>>>>>> DAO persistence (no need to understand other Polaris parts).
> >>>>>>>> Each persistence adapter is in its own module (for isolation
> >> and
> >>>>>> dependencies).
> >>>>>>>> During the Polaris Persistence Meeting last week, we already
> >> got
> >>>>>>>> consensus on the approach proposed by Dennis and I. I propose
> >> to
> >>>> do a
> >>>>>>>> new sync/review with Dennis.
> >>>>>>>>
> >>>>>>>> Regards
> >>>>>>>> JB
> >>>>>>>>
> >>>>>>>> On Mon, Feb 17, 2025 at 9:40 AM Yufei Gu <flyrain...@gmail.com
> >>>>> wrote:
> >>>>>>>>> Hi Folks:
> >>>>>>>>>
> >>>>>>>>> Here is a POC for persistence layer refactor. Please check it
> >>> out
> >>>>> and
> >>>>>> let
> >>>>>>>>> me know what you think. Please note this is a POC, we still
> >>> need a
> >>>>>> lot of
> >>>>>>>>> effort to complete the refactor.
> >>>>>>>>>
> >>>>>>>>> PR: https://github.com/apache/polaris/pull/1011.
> >>>>>>>>> Design doc:
> >>>>>>>>>
> >>
> https://docs.google.com/document/d/1Vuhw5b9-6KAol2vU3HUs9FJwcgWtiVVXMYhLtGmz53s/edit?tab=t.0
> >>>>>>>>> Experiment:
> >>>>>>>>>
> >>>>>>>>>      - Added a DAO layer for the business entity
> >> namespace(except
> >>>> the
> >>>>>> read).
> >>>>>>>>>      - Integrated with existing DAO components
> >>>>>> (PolarisMetaStoreManager and
> >>>>>>>>>      PolarisMetaStoreSession).
> >>>>>>>>>      - All tests passed successfully, including a manual local
> >>> run
> >>>>>> with Spark
> >>>>>>>>>      sql.
> >>>>>>>>>
> >>>>>>>>> Benefits:
> >>>>>>>>>
> >>>>>>>>>      - Compatible with the existing backend(FDB), as we hide it
> >>>>> behind
> >>>>>> the
> >>>>>>>>>      new DAO.
> >>>>>>>>>      - Adding new backends(Postgres/MongoDB) is much easier
> >> now,
> >>>> esp
> >>>>>> for
> >>>>>>>>>      Postgres, we could be able to use a similar model as
> >> Iceberg
> >>>>> Jdbc
> >>>>>> catalog.
> >>>>>>>>>      - Allows gradual refactoring to remove old DAO
> >> dependencies
> >>>>>>>>>      (PolarisMetaStoreManager and PolarisMetaStoreSession).
> >>>>>>>>>      - Enables parallel development of new backend
> >>> implementations.
> >>>>>>>>> Next Steps:
> >>>>>>>>>
> >>>>>>>>>      - Define business entities one by one to decouple them
> >> from
> >>>> FDB.
> >>>>>>>>>      - Create DAO interfaces for each entity to standardize
> >>>>> operations
> >>>>>> (e.g.,
> >>>>>>>>>      CRUD, ID generation).
> >>>>>>>>>      - Expand DAO implementations to support additional
> >> backends
> >>>> over
> >>>>>> time.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Yufei
> >>>>>>> --
> >>>>>>> Robert Stupp
> >>>>>>> @snazy
> >>>>>>>
> --
> Robert Stupp
> @snazy
>
>

Re: Polaris persistence refactor POC

Reply via email to