Re: Polaris persistence refactor POC

Robert Stupp Tue, 18 Feb 2025 05:04:10 -0800

Reminder: not all databases, especially NoSQL databases, are usable withJPA. DAO is rather specific to JPA - so it is in practice specific torelational databases. Further, DAOs, if implemented _like_ JPA, requirea ton of logic to be implemented.

I strongly prefer simple and database agnostic interfaces and notsomething that only works for relational databases or FDB.

For catalog operations (as proposed in this gdoc commenthttps://docs.google.com/document/d/1LlNhEy4cBjjE_um694fcsnizqd3rDm5pbewXkLxvu1o/edit?disco=AAABcK1NcYk)


interface CatalogPersistence {
  CatalogState getCurrentState();
  R commitTransactional(Updater<R> updater, Class<R> resultClass);
}

The first function gives you a consistent state (think: snapshot ofeverything) of the catalog as of "now" and the second one allows you toupdate the state.


CatalogState could look like this:

interface CatalogState {
    Stream<Identifier> listChildren(Identifier parent);
    Optional<NamespaceObject> getNamespace(Identifier identifier);
    C getContent(Identifier identifier, Class<C> clazz);
    Optional<ResolvedCatalogPath<OBJ>> resolveCatalogPath(
        Identifier identifier, Class<OBJ> expectedLeafType)
}

`Updater` could a functional interface with a single "RESULTupdate(UpdateState state)" function.


`CatalogUpdateState` could have with function to record intended changes:

interface CatalogUpdateState extends CatalogState {
    <C> void recordChange(
        Identifier identifier,
        C content,
        Class<C> clazz,
        Change change);
}

If `update(CatalogUpdateState)` returns successfully, all changes to thecatalog are persisted atomically and the result propagated to thecaller. If `update()` throws, it's an error and the catalog statedoesn't change.

Combined with a retry-mechanism, `update()` could also retry in case of(resolvable) conflicts.



For other use cases we can get away with an interface like this:

public interface Persistence {
   <T>  T fetch(ObjId id, Class<T> clazz);
    // write w/ immedite consistency guarantee
    <T> T write(T obj, Class<T> clazz);
    // write new row if not yet exists
    <T> T conditionalInsert(T obj, Class<T> clazz);
    //  update row if "expected state marker" matches
    <T> T conditionalUpdate(T expected, T update, Class<T> clazz);

    // plus lower-level functions to enable atomic catalog state updates
}


On 18.02.25 07:41, Jean-Baptiste Onofré wrote:

Yes I know (remember I proposed to use DAO so I know what it is :)).
If it’s clear for everyone, ok for DAO (just wanted to be sure that it’s
clear for the implementers).

So ok to have CatalogDao for operations and CatalogRecord for carriage data
(for instance).

Thanks
Regards
JB

Le mar. 18 févr. 2025 à 02:11, Yufei Gu <[email protected]> a écrit :

Hi JB,

DAO is often defined as an interface, used exactly for operations. Why do
we need another name for that? See the examples here
<https://www.geeksforgeeks.org/data-access-object-pattern/>, it can be
with
JPA or non-JPA. In our case, it has to be compatible with non-JPA
implementation.

interface DeveloperDao {
     public List<Developer> getAllDevelopers();
     public Developer getDeveloper(int DeveloperId);
     public void updateDeveloper(Developer Developer);
     public void deleteDeveloper(Developer Developer);
}

public interface ProductDAO extends JpaRepository<Product, Long> {
     // Additional custom queries or methods can be declared here
}

Yufei


On Mon, Feb 17, 2025 at 10:58 AM Jean-Baptiste Onofré <[email protected]>
wrote:

Hi Yufei

I agree with you. My only concern with CatalogDAO and NamespaceDAO in the
PoC is that it could not be obvious for implementer.
Being very abstract force us to clearly decouple service and persistence.
One interface or multiple interfaces is totally fine for me (for the
storage operations). I just think that we have to be clear in the DAO.

It’s what I proposed and discussed with Jack to also get his expertise

for

DynamoDB.

So, in the PoC, I would:
1. Clearly define DAO, with enforcement on the data transport (like using
Java records). Maybe it’s more a naming clarification. We can have
CatalogRecord (Java records) and CatalogStoreOperations (instead of
CatalogDAO and using CatalogRecord)
2. Split store operations for a backend in separate modules (one for FDB,
one for Postgre)

Thoughts ?

Regards
JB

Le lun. 17 févr. 2025 à 18:36, Yufei Gu <[email protected]> a écrit :

Hi JB,

The Java objects you mentioned should either belong to the

service/business

layer or be unnecessary. Apart from the business entities in Polaris,

each

backend will maintain its own set of persistence-layer entities. The
backend implementations (whether Postgres, FDB, or MongoDB) will be
responsible for mapping between these persistence-layer entities and

the

business entities. Currently, the business entity set is tightly

coupled

with FDB, which is not ideal for supporting multiple backends. While we

can

refactor them, creating an entirely new set of business entities is not
required.

Regarding the PolarisStore you mentioned, it functions similarly to the

DAO

layer in the POC. It'd make more sense to split its responsibilities

into

separate interfaces rather than combining them into a single one. A

single

interface violates the Single Responsibility Principle and may lead to
maintenance challenges as we introduce new data access patterns. For
instance, we would be forced to continuously add new methods to

accommodate

additional entities like Policy and Volume, or to support new query
patterns for existing entities.

As for choosing between record or immutable classes, either is fine.

But

I'd prefer to use any native Java feature if possible.



Yufei


On Mon, Feb 17, 2025 at 9:10 AM Michael Collado <

[email protected]>

wrote:

I prefer native Java constructs over third party libraries and

compiler

plugins, whenever possible. I’m a fan of Java records.

Mike

On Mon, Feb 17, 2025 at 6:55 AM Jean-Baptiste Onofré <

[email protected]>

wrote:

Hi Robert,

Yes, my intention about Java records (e.g.
https://openjdk.org/jeps/395) is to leverage:
- Devise an object-oriented construct that expresses a simple
aggregation of values.
- Focus on modeling immutable data rather than extensible behavior.
- Automatically implement data-driven methods such as equals and

accessors.

- Preserve long-standing Java principles such as nominal typing and
migration compatibility.
- Provide inner builder, optionally with technical validation
(Objects.NotNull, etc), for instance:

public record CatalogDAO(String id, String name, ...) {

   public static final class Builder {
       String id;
       String name;
       public Builder() {}
       public Builder id(String id) {
        this.id = id;
        return this;
      }
      public Builder name(String name) {
        this.name = name;
        return this;
      }
      public CatalogDAO build() {
        return new CatalogDAO(id, name);
      }
   }

}
  NB: that's just a "raw" example :)

That could help us for the "DAO" layer, with clean isolation and

data

"transport". The "conversion" from Polaris Entity to DAO could be
performed by the intermediate logic layer (e.g.
PolarisMetaStoreManager), the pure persistence layer can deal with

DAO

only (e.g. PolarisStore).

Good point about immutable collections. In some projects, I "force"
the copy of a collection to ensure immutability, something like:

record NamespaceDAO(Set<String> children) {
     public NamespaceDAO {
         children = Set.copyOf(children);
     }
}

OK, that's not super elegant :) but it does the trick ;)

Regards
JB

On Mon, Feb 17, 2025 at 5:28 PM Robert Stupp <[email protected]>

wrote:

Agree with JB, except I think "immutables" serves the need much

better.

Java records are okay, but do not ensure that all fields are

immutable

especially collections. The other pro of immutables is that there

are

proper, descriptive builders - hence no need to constructors

with a

gazillion parameters.

On 17.02.25 11:42, Jean-Baptiste Onofré wrote:

Hi Yufei

I left comments in the PR.

Thanks for that. That's an interesting approach but slightly

different

to what I had in mind.

My proposal was:

1. The DAOs are Java records clearly descriptive, without

"storage

operations".

For instance, we can imagine:

public record CatalogDao(String id, String name, String

location,

...)

...
}

public record NamespaceDao(String id, String name, String

parent,

...)

{}

public record TableDao(String id, String name, ...) {}

etc

The advantages are:
- very simple and descriptive
- common to any backend implementation

2. The PolarisStore defines the DAO backend operations and

mapping

to

"internal" Polaris entities. Each persistence adapter

implements

the

PolarisStore.
For the persistence adapter implementer, he just has to

implement

the

DAO persistence (no need to understand other Polaris parts).
Each persistence adapter is in its own module (for isolation

and

dependencies).

During the Polaris Persistence Meeting last week, we already

got

consensus on the approach proposed by Dennis and I. I propose

to

do a

new sync/review with Dennis.

Regards
JB

On Mon, Feb 17, 2025 at 9:40 AM Yufei Gu <[email protected]

wrote:

Hi Folks:

Here is a POC for persistence layer refactor. Please check it

out

and

let

me know what you think. Please note this is a POC, we still

need a

lot of

effort to complete the refactor.

PR: https://github.com/apache/polaris/pull/1011.
Design doc:

https://docs.google.com/document/d/1Vuhw5b9-6KAol2vU3HUs9FJwcgWtiVVXMYhLtGmz53s/edit?tab=t.0

Experiment:

     - Added a DAO layer for the business entity

namespace(except

the

read).

     - Integrated with existing DAO components

(PolarisMetaStoreManager and

     PolarisMetaStoreSession).
     - All tests passed successfully, including a manual local

run

with Spark

     sql.

Benefits:

     - Compatible with the existing backend(FDB), as we hide it

behind

the

     new DAO.
     - Adding new backends(Postgres/MongoDB) is much easier

now,

esp

for

     Postgres, we could be able to use a similar model as

Iceberg

Jdbc

catalog.

     - Allows gradual refactoring to remove old DAO

dependencies

     (PolarisMetaStoreManager and PolarisMetaStoreSession).
     - Enables parallel development of new backend

implementations.

Next Steps:

     - Define business entities one by one to decouple them

from

FDB.

     - Create DAO interfaces for each entity to standardize

operations

(e.g.,

     CRUD, ID generation).
     - Expand DAO implementations to support additional

backends

over

time.




Yufei

--
Robert Stupp
@snazy

--
Robert Stupp
@snazy

Re: Polaris persistence refactor POC

Reply via email to