Re: [PR] NoSQL database agnostic persistence [polaris]

via GitHub Fri, 11 Jul 2025 11:19:31 -0700


adam-christian-software commented on code in PR #1189:
URL: https://github.com/apache/polaris/pull/1189#discussion_r2201442502



##########
persistence/nosql/persistence/README.md:
##########
@@ -0,0 +1,224 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one
+  or more contributor license agreements.  See the NOTICE file
+  distributed with this work for additional information
+  regarding copyright ownership.  The ASF licenses this file
+  to you under the Apache License, Version 2.0 (the
+  "License"); you may not use this file except in compliance
+  with the License.  You may obtain a copy of the License at
+ 
+   http://www.apache.org/licenses/LICENSE-2.0
+ 
+  Unless required by applicable law or agreed to in writing,
+  software distributed under the License is distributed on an
+  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  KIND, either express or implied.  See the License for the
+  specific language governing permissions and limitations
+  under the License.
+-->
+
+# Database agnostic persistence framework
+
+This persistence API and functional implementations are based on the 
assumption that all databases targeted as backing
+stores for Polaris support "compare and swap" operations on a single row. 
These CAS operations are the only requirement.
+
+Since some databases do enforce hard size limits, for example DynamoDB has a 
hard 400kB row size limit, MariaDB/MySQL
+has a default 512kB packet size limit. Other databases have row-size 
recommendations around similar sizes. Polaris
+persistence respects those limits and recommendations using a common hard 
limit of 350kB.
+
+Objects exposed via the `Persistence` interface are typed Java objects, that 
must be immutable and serializable using
+Jackson. Each type is described via an implementation of the `ObjType` 
interface using a name, which must be unique
+in Polaris, and a target Java type - the Jackson serializable Java type. Types 
are registered using the Java service
+API using `ObjType`. The actual java target types must extend the `Obj` 
interface. The (logical) key for each `Obj`
+is a composite of the `ObjType.id()` and a `long` ID (64-bit signed int), 
combined using the `ObjId` composite type.
+
+The "primary key" of each object in a database is always _realmId + 
object-ID_, where realm-ID is a string and object-ID
+is a 64-bit integer. This allows, but not enforces, storing multiple realms in 
one backend database.
+
+Data in/for/of each Polaris realm, think: _tenant_, is isolated using the 
realm's ID (string). The base `Persistence`
+API interface is always scoped to exactly one realm ID.
+
+## Supporting more databases
+
+The code to support a particular database is isolated in a project, for 
example `polaris-persistence-nosql-inmemory` and
+`polaris-persistence-nosql-mongodb`.
+
+When adding another database, it must also be wired up to Quarkus in 
`polaris-persistence-nosql-cdi-quarkus` preferably using
+Quarkus extensions, added to the `polaris-persitence-corectness` tests and 
available in `polaris-persistence-nosql-benchmark`
+for low level benchmarks.
+
+## Named pointers
+
+Polaris represents a catalog for data lakehouses, which means that the 
information of and for catalog entities like
+Iceberg tables, views and namespaces must be consistent, even if multiple 
catalog entities are changes in a single
+atomic operation.
+
+Polaris leverages a concept called "Named pointers". The state of the whole 
catalog is referenced via the so-called
+HEAD (think: Git HEAD), which _points to_ all catalog entities. This state is 
persisted as an `Obj` with an index
+of the catalog entities, the ID of that "current catalog state `Obj`" is 
maintained in one named pointer.
+
+Named pointers are also used for other purposes than catalog entities, for 
example to maintain realms or
+configurations.
+
+## Committing changes
+
+Changes are persisted using a commit mechanism, providing atomic changes 
across multiple entities against one named
+pointer. The logic implementation ensures that even high frequency concurrent 
changes do neither let clients fail
+nor cause timeouts. The behavior and achievable throughput depends on the 
database being used, some databases perform
+_much_ better than others.
+
+A use-case agnostic "committer" abstraction exists to ease implementing 
committing operations. For catalog operations
+there is a more specialized abstraction.
+
+## `long` IDs
+
+Polaris persistence uses so-called Snowflake IDs, which are 64-bit integers 
which represent a timestamp, a node-ID,
+and a sequence number. The epoch of these timestamps is 2025-03-01-00:00:00.0 
GMT. Timestamps occupy 41 bits at
+millisecond precision, which lasts for about 69 years. Node-IDs are 10 bits, 
which allows 1024 concurrently active
+"JVMs running Polaris". 12 bits are used by the sequence number, which then 
allows each node to generate 4096 IDs per
+millisecond. One bit is reserved for future use.
+
+Node IDs are leased by every "JVM running Polaris" for a period of time. The 
ID generator implementation guarantees
+that no IDs will be generated for a timestamp that exceeds the "lease time". 
Leases can be extended. The implementation
+leverages atomic database operations (CAS) for the lease implementation.
+
+ID generators must not use timestamps before or after the lease period nor 
must they re-use an older timestamp. This
+requirement is satisfied using a monotonic clock implementation.
+
+## Caching
+
+Since most `Obj`s are by default assumed to be immutable, caching is very 
straight forward and does not require any
+coordination, which simplifies the design and implementation quite a bit.
+
+## Strong vs eventual consistency
+
+Polaris persistence offers two ways to persist `Obj`s: strongly consistent and 
eventually consistent. The former is
+slower than the latter.
+
+Since Polaris persistence respects the hard size limitations mentioned above, 
it cannot persist the serialized
+representation of objects that exceed those limits in a single database row. 
Some objects however legibly exceed those
+limits. Polaris persistence allows such "big object serializations" and 
persists those in multiple database rows, with

Review Comment:
   Do we have any typical use cases that we have seen that cause these "big 
object serializations"?



##########
persistence/nosql/persistence/README.md:
##########
@@ -0,0 +1,224 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one
+  or more contributor license agreements.  See the NOTICE file
+  distributed with this work for additional information
+  regarding copyright ownership.  The ASF licenses this file
+  to you under the Apache License, Version 2.0 (the
+  "License"); you may not use this file except in compliance
+  with the License.  You may obtain a copy of the License at
+ 
+   http://www.apache.org/licenses/LICENSE-2.0
+ 
+  Unless required by applicable law or agreed to in writing,
+  software distributed under the License is distributed on an
+  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  KIND, either express or implied.  See the License for the
+  specific language governing permissions and limitations
+  under the License.
+-->
+
+# Database agnostic persistence framework
+
+This persistence API and functional implementations are based on the 
assumption that all databases targeted as backing
+stores for Polaris support "compare and swap" operations on a single row. 
These CAS operations are the only requirement.
+
+Since some databases do enforce hard size limits, for example DynamoDB has a 
hard 400kB row size limit, MariaDB/MySQL
+has a default 512kB packet size limit. Other databases have row-size 
recommendations around similar sizes. Polaris
+persistence respects those limits and recommendations using a common hard 
limit of 350kB.
+
+Objects exposed via the `Persistence` interface are typed Java objects, that 
must be immutable and serializable using
+Jackson. Each type is described via an implementation of the `ObjType` 
interface using a name, which must be unique
+in Polaris, and a target Java type - the Jackson serializable Java type. Types 
are registered using the Java service
+API using `ObjType`. The actual java target types must extend the `Obj` 
interface. The (logical) key for each `Obj`
+is a composite of the `ObjType.id()` and a `long` ID (64-bit signed int), 
combined using the `ObjId` composite type.
+
+The "primary key" of each object in a database is always _realmId + 
object-ID_, where realm-ID is a string and object-ID
+is a 64-bit integer. This allows, but not enforces, storing multiple realms in 
one backend database.
+
+Data in/for/of each Polaris realm, think: _tenant_, is isolated using the 
realm's ID (string). The base `Persistence`
+API interface is always scoped to exactly one realm ID.
+
+## Supporting more databases
+
+The code to support a particular database is isolated in a project, for 
example `polaris-persistence-nosql-inmemory` and
+`polaris-persistence-nosql-mongodb`.
+
+When adding another database, it must also be wired up to Quarkus in 
`polaris-persistence-nosql-cdi-quarkus` preferably using
+Quarkus extensions, added to the `polaris-persitence-corectness` tests and 
available in `polaris-persistence-nosql-benchmark`
+for low level benchmarks.
+
+## Named pointers
+
+Polaris represents a catalog for data lakehouses, which means that the 
information of and for catalog entities like
+Iceberg tables, views and namespaces must be consistent, even if multiple 
catalog entities are changes in a single
+atomic operation.
+
+Polaris leverages a concept called "Named pointers". The state of the whole 
catalog is referenced via the so-called
+HEAD (think: Git HEAD), which _points to_ all catalog entities. This state is 
persisted as an `Obj` with an index
+of the catalog entities, the ID of that "current catalog state `Obj`" is 
maintained in one named pointer.
+
+Named pointers are also used for other purposes than catalog entities, for 
example to maintain realms or
+configurations.
+
+## Committing changes
+
+Changes are persisted using a commit mechanism, providing atomic changes 
across multiple entities against one named
+pointer. The logic implementation ensures that even high frequency concurrent 
changes do neither let clients fail
+nor cause timeouts. The behavior and achievable throughput depends on the 
database being used, some databases perform
+_much_ better than others.
+
+A use-case agnostic "committer" abstraction exists to ease implementing 
committing operations. For catalog operations
+there is a more specialized abstraction.
+
+## `long` IDs
+
+Polaris persistence uses so-called Snowflake IDs, which are 64-bit integers 
which represent a timestamp, a node-ID,
+and a sequence number. The epoch of these timestamps is 2025-03-01-00:00:00.0 
GMT. Timestamps occupy 41 bits at
+millisecond precision, which lasts for about 69 years. Node-IDs are 10 bits, 
which allows 1024 concurrently active
+"JVMs running Polaris". 12 bits are used by the sequence number, which then 
allows each node to generate 4096 IDs per
+millisecond. One bit is reserved for future use.
+
+Node IDs are leased by every "JVM running Polaris" for a period of time. The 
ID generator implementation guarantees
+that no IDs will be generated for a timestamp that exceeds the "lease time". 
Leases can be extended. The implementation
+leverages atomic database operations (CAS) for the lease implementation.
+
+ID generators must not use timestamps before or after the lease period nor 
must they re-use an older timestamp. This
+requirement is satisfied using a monotonic clock implementation.
+
+## Caching
+
+Since most `Obj`s are by default assumed to be immutable, caching is very 
straight forward and does not require any
+coordination, which simplifies the design and implementation quite a bit.
+
+## Strong vs eventual consistency
+
+Polaris persistence offers two ways to persist `Obj`s: strongly consistent and 
eventually consistent. The former is
+slower than the latter.
+
+Since Polaris persistence respects the hard size limitations mentioned above, 
it cannot persist the serialized
+representation of objects that exceed those limits in a single database row. 
Some objects however legibly exceed those
+limits. Polaris persistence allows such "big object serializations" and 
persists those in multiple database rows, with
+the restriction that this is only supported for eventually consistent writes. 
The serialized representation for
+strong consistency writes must always be within the hard limit.
+
+## Indexes
+
+The state of a data lakehouse catalog can contain many thousand, potentially a 
few 100,000, tables/views/namespaces.
+Even a space efficient serialization of an index for that many entries exceeds 
the "common hard 350kB limit".
+New changes end in the index, which is "embedded" in the "current catalog 
state `Obj`". If the respective index size
+limit of this "embedded" index is being approached, the index is spilled out 
to separate rows in the database. The
+implementation is built to split and combine when needed.
+
+## Change log / events / notifications
+
+The commit mechanism described above builds a commit log. All changes can be 
inspected via that log in exactly the
+order in which those happened (think: `git log`). Since the log of changes is 
already present, it is possible to
+retrieve the changes since some point in time or commit log ID - this allows 
clients to receive all changes that
+happened since the last known commit ID, offering a mechanism to poll for 
changes. Since the necessary `Obj`s are
+immutable, such change-log-requests likely hit already cached data and rather 
not the database.
+
+## Cleanup old commits / unused data
+
+Despite the beauty of having a "commit log" and all metadata representation in 
the backing database, the size of that
+database would always grow.
+
+Purging unused table/view metadata memoized in the database is one piece.
+Purging old commit log entries is the second part.
+Purging (then) unreferenced `Obj`s the third part.
+
+See [maintenance service](#maintenance-service) below.
+
+## Realms (aka tenants)
+
+Bootstrapping but more importantly deleting/purging a realm is a non-trivial 
operation, which requires its own
+lifecycle. Bootstrapping is a straight forward operation as the necessary 
information can be validated and enhanced
+if necessary.
+
+Both the logical but also the physical process of realm deletion are more 
complex. From a logical point of view,
+users want to disable the realm for a while before they eventually are okay 
with deleting the information.
+
+The process to delete a realm's data from the database can be quite 
time-consuming and how that happens is database
+specific. While some databases can do bulk-deletions, which "just" take some 
time (RDBMS, BigTable), other databases
+require that the process of deleting a realm must happen during a full scan of 
the database (for example RocksDB
+and Apache Cassandra). Since scanning the whole database itself can take quite 
long and no more than one instance
+should scan the database at any time.
+
+Realm have a status to reflect its lifecycle. The initial status of a realm is 
`CREATED`, which effectively only means
+that the realm-ID has been reserved and that the necessary data needs to be 
populated (bootstrap). Once a realm has
+been fully bootstrapped, its status is changed to `ACTIVE`. Only `ACTIVE` 
realms can be used for user requests.
+
+Between `CREATED` and `ACTIVE`/`INACTIVE` there are two states that are 
mutually exclusive. `INITIALIZING` means that
+Polaris will initialize the realm as a fresh, new realm. `LOADING` means that 
realm data, which has been exported
+from another Polaris instance, is to be imported.
+
+Realm deletion is a multi-step approach as well: Realms are first put into 
`INACTIVE` state, which can be reverted
+to `ACTIVE` state or into `PURGING` state. `PURGING` means that the realm's 
data is being deleted from the database,
+once purging has been started, the realm's information in the database is 
inconsistent and cannot be restored.
+Once the realm's data has been purged, the realm is put into `PURGED` state. 
Only realms that are in state `PURGED`
+can be deleted.
+
+The multi-state approach also prevents that a realm can only be used when the 
system knows that all necessary
+information is present.
+
+**Note**: the realm state machine is not fully implemented yet.

Review Comment:
   Do we expect that the Realm state machine would need to be exposed outside 
of the persistence layer? I could see a use case for it, but I'd like to try to 
keep this as targeted as possible.



##########
persistence/nosql/persistence/README.md:
##########
@@ -0,0 +1,224 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one
+  or more contributor license agreements.  See the NOTICE file
+  distributed with this work for additional information
+  regarding copyright ownership.  The ASF licenses this file
+  to you under the Apache License, Version 2.0 (the
+  "License"); you may not use this file except in compliance
+  with the License.  You may obtain a copy of the License at
+ 
+   http://www.apache.org/licenses/LICENSE-2.0
+ 
+  Unless required by applicable law or agreed to in writing,
+  software distributed under the License is distributed on an
+  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  KIND, either express or implied.  See the License for the
+  specific language governing permissions and limitations
+  under the License.
+-->
+
+# Database agnostic persistence framework
+
+This persistence API and functional implementations are based on the 
assumption that all databases targeted as backing
+stores for Polaris support "compare and swap" operations on a single row. 
These CAS operations are the only requirement.
+
+Since some databases do enforce hard size limits, for example DynamoDB has a 
hard 400kB row size limit, MariaDB/MySQL
+has a default 512kB packet size limit. Other databases have row-size 
recommendations around similar sizes. Polaris
+persistence respects those limits and recommendations using a common hard 
limit of 350kB.
+
+Objects exposed via the `Persistence` interface are typed Java objects, that 
must be immutable and serializable using
+Jackson. Each type is described via an implementation of the `ObjType` 
interface using a name, which must be unique
+in Polaris, and a target Java type - the Jackson serializable Java type. Types 
are registered using the Java service
+API using `ObjType`. The actual java target types must extend the `Obj` 
interface. The (logical) key for each `Obj`
+is a composite of the `ObjType.id()` and a `long` ID (64-bit signed int), 
combined using the `ObjId` composite type.
+
+The "primary key" of each object in a database is always _realmId + 
object-ID_, where realm-ID is a string and object-ID
+is a 64-bit integer. This allows, but not enforces, storing multiple realms in 
one backend database.
+
+Data in/for/of each Polaris realm, think: _tenant_, is isolated using the 
realm's ID (string). The base `Persistence`
+API interface is always scoped to exactly one realm ID.
+
+## Supporting more databases
+
+The code to support a particular database is isolated in a project, for 
example `polaris-persistence-nosql-inmemory` and
+`polaris-persistence-nosql-mongodb`.
+
+When adding another database, it must also be wired up to Quarkus in 
`polaris-persistence-nosql-cdi-quarkus` preferably using
+Quarkus extensions, added to the `polaris-persitence-corectness` tests and 
available in `polaris-persistence-nosql-benchmark`
+for low level benchmarks.
+
+## Named pointers
+
+Polaris represents a catalog for data lakehouses, which means that the 
information of and for catalog entities like
+Iceberg tables, views and namespaces must be consistent, even if multiple 
catalog entities are changes in a single
+atomic operation.
+
+Polaris leverages a concept called "Named pointers". The state of the whole 
catalog is referenced via the so-called
+HEAD (think: Git HEAD), which _points to_ all catalog entities. This state is 
persisted as an `Obj` with an index
+of the catalog entities, the ID of that "current catalog state `Obj`" is 
maintained in one named pointer.
+
+Named pointers are also used for other purposes than catalog entities, for 
example to maintain realms or
+configurations.
+
+## Committing changes
+
+Changes are persisted using a commit mechanism, providing atomic changes 
across multiple entities against one named
+pointer. The logic implementation ensures that even high frequency concurrent 
changes do neither let clients fail
+nor cause timeouts. The behavior and achievable throughput depends on the 
database being used, some databases perform
+_much_ better than others.
+
+A use-case agnostic "committer" abstraction exists to ease implementing 
committing operations. For catalog operations
+there is a more specialized abstraction.
+
+## `long` IDs
+
+Polaris persistence uses so-called Snowflake IDs, which are 64-bit integers 
which represent a timestamp, a node-ID,
+and a sequence number. The epoch of these timestamps is 2025-03-01-00:00:00.0 
GMT. Timestamps occupy 41 bits at
+millisecond precision, which lasts for about 69 years. Node-IDs are 10 bits, 
which allows 1024 concurrently active
+"JVMs running Polaris". 12 bits are used by the sequence number, which then 
allows each node to generate 4096 IDs per
+millisecond. One bit is reserved for future use.
+
+Node IDs are leased by every "JVM running Polaris" for a period of time. The 
ID generator implementation guarantees
+that no IDs will be generated for a timestamp that exceeds the "lease time". 
Leases can be extended. The implementation
+leverages atomic database operations (CAS) for the lease implementation.
+
+ID generators must not use timestamps before or after the lease period nor 
must they re-use an older timestamp. This
+requirement is satisfied using a monotonic clock implementation.
+
+## Caching
+
+Since most `Obj`s are by default assumed to be immutable, caching is very 
straight forward and does not require any
+coordination, which simplifies the design and implementation quite a bit.
+
+## Strong vs eventual consistency
+
+Polaris persistence offers two ways to persist `Obj`s: strongly consistent and 
eventually consistent. The former is
+slower than the latter.
+
+Since Polaris persistence respects the hard size limitations mentioned above, 
it cannot persist the serialized
+representation of objects that exceed those limits in a single database row. 
Some objects however legibly exceed those
+limits. Polaris persistence allows such "big object serializations" and 
persists those in multiple database rows, with
+the restriction that this is only supported for eventually consistent writes. 
The serialized representation for
+strong consistency writes must always be within the hard limit.
+
+## Indexes
+
+The state of a data lakehouse catalog can contain many thousand, potentially a 
few 100,000, tables/views/namespaces.
+Even a space efficient serialization of an index for that many entries exceeds 
the "common hard 350kB limit".
+New changes end in the index, which is "embedded" in the "current catalog 
state `Obj`". If the respective index size
+limit of this "embedded" index is being approached, the index is spilled out 
to separate rows in the database. The
+implementation is built to split and combine when needed.
+
+## Change log / events / notifications
+
+The commit mechanism described above builds a commit log. All changes can be 
inspected via that log in exactly the
+order in which those happened (think: `git log`). Since the log of changes is 
already present, it is possible to
+retrieve the changes since some point in time or commit log ID - this allows 
clients to receive all changes that
+happened since the last known commit ID, offering a mechanism to poll for 
changes. Since the necessary `Obj`s are
+immutable, such change-log-requests likely hit already cached data and rather 
not the database.
+
+## Cleanup old commits / unused data
+
+Despite the beauty of having a "commit log" and all metadata representation in 
the backing database, the size of that
+database would always grow.
+
+Purging unused table/view metadata memoized in the database is one piece.
+Purging old commit log entries is the second part.
+Purging (then) unreferenced `Obj`s the third part.
+
+See [maintenance service](#maintenance-service) below.
+
+## Realms (aka tenants)
+
+Bootstrapping but more importantly deleting/purging a realm is a non-trivial 
operation, which requires its own
+lifecycle. Bootstrapping is a straight forward operation as the necessary 
information can be validated and enhanced
+if necessary.
+
+Both the logical but also the physical process of realm deletion are more 
complex. From a logical point of view,
+users want to disable the realm for a while before they eventually are okay 
with deleting the information.
+
+The process to delete a realm's data from the database can be quite 
time-consuming and how that happens is database
+specific. While some databases can do bulk-deletions, which "just" take some 
time (RDBMS, BigTable), other databases
+require that the process of deleting a realm must happen during a full scan of 
the database (for example RocksDB
+and Apache Cassandra). Since scanning the whole database itself can take quite 
long and no more than one instance
+should scan the database at any time.
+
+Realm have a status to reflect its lifecycle. The initial status of a realm is 
`CREATED`, which effectively only means
+that the realm-ID has been reserved and that the necessary data needs to be 
populated (bootstrap). Once a realm has
+been fully bootstrapped, its status is changed to `ACTIVE`. Only `ACTIVE` 
realms can be used for user requests.
+
+Between `CREATED` and `ACTIVE`/`INACTIVE` there are two states that are 
mutually exclusive. `INITIALIZING` means that
+Polaris will initialize the realm as a fresh, new realm. `LOADING` means that 
realm data, which has been exported
+from another Polaris instance, is to be imported.
+
+Realm deletion is a multi-step approach as well: Realms are first put into 
`INACTIVE` state, which can be reverted
+to `ACTIVE` state or into `PURGING` state. `PURGING` means that the realm's 
data is being deleted from the database,
+once purging has been started, the realm's information in the database is 
inconsistent and cannot be restored.
+Once the realm's data has been purged, the realm is put into `PURGED` state. 
Only realms that are in state `PURGED`
+can be deleted.
+
+The multi-state approach also prevents that a realm can only be used when the 
system knows that all necessary
+information is present.
+
+**Note**: the realm state machine is not fully implemented yet.
+
+## `::system::` realm
+
+Polaris persistence uses a system realm which is used for node ID leases and 
realm management. The realm-IDs starting
+with two colons (`::`) are reserved for system use.
+
+### Named pointers in the `::system::` realm
+
+| Named pointer | Meaning         |
+|---------------|-----------------|
+| `realms`      | Realms, by name |
+
+## "User" realms
+
+### Named pointers in the user realms
+
+| Named pointer     | Meaning                      |
+|-------------------|------------------------------|
+| `root`            | Pointer to the "root" entity |
+| `catalogs`        | Catalogs                     |
+| `principals`      | Principals                   |
+| `principal-roles` | Principal roles              |
+| `grants`          | All grants                   |
+| `immediate-tasks` | Immediately scheduled tasks  |
+| `policy-mappings` | Policy mappings              |
+
+Per catalog named pointers, where `%d` refers to the catalog's integer ID:
+
+| Named pointer       | Meaning                                          |
+|---------------------|--------------------------------------------------|
+| `cat/%d/roles`      | Catalog roles                                    |
+| `cat/%d/heads/main` | Catalog content (namespaces, tables, views, etc) |
+| `cat/%d/grants`     | Catalog related grants (*)                       |
+
+(*) = currently not used, stored in the realm grants.
+
+## Maintenance Service
+
+**Note**: maintenance service not yet in the code base.

Review Comment:
   Just checking, is this not currently implemented? I see a fair amount of 
code in the maintenance sub-directory.



##########
persistence/nosql/persistence/README.md:
##########
@@ -0,0 +1,224 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one
+  or more contributor license agreements.  See the NOTICE file
+  distributed with this work for additional information
+  regarding copyright ownership.  The ASF licenses this file
+  to you under the Apache License, Version 2.0 (the
+  "License"); you may not use this file except in compliance
+  with the License.  You may obtain a copy of the License at
+ 
+   http://www.apache.org/licenses/LICENSE-2.0
+ 
+  Unless required by applicable law or agreed to in writing,
+  software distributed under the License is distributed on an
+  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  KIND, either express or implied.  See the License for the
+  specific language governing permissions and limitations
+  under the License.
+-->
+
+# Database agnostic persistence framework
+
+This persistence API and functional implementations are based on the 
assumption that all databases targeted as backing
+stores for Polaris support "compare and swap" operations on a single row. 
These CAS operations are the only requirement.
+
+Since some databases do enforce hard size limits, for example DynamoDB has a 
hard 400kB row size limit, MariaDB/MySQL
+has a default 512kB packet size limit. Other databases have row-size 
recommendations around similar sizes. Polaris
+persistence respects those limits and recommendations using a common hard 
limit of 350kB.
+
+Objects exposed via the `Persistence` interface are typed Java objects, that 
must be immutable and serializable using
+Jackson. Each type is described via an implementation of the `ObjType` 
interface using a name, which must be unique
+in Polaris, and a target Java type - the Jackson serializable Java type. Types 
are registered using the Java service
+API using `ObjType`. The actual java target types must extend the `Obj` 
interface. The (logical) key for each `Obj`
+is a composite of the `ObjType.id()` and a `long` ID (64-bit signed int), 
combined using the `ObjId` composite type.
+
+The "primary key" of each object in a database is always _realmId + 
object-ID_, where realm-ID is a string and object-ID
+is a 64-bit integer. This allows, but not enforces, storing multiple realms in 
one backend database.
+
+Data in/for/of each Polaris realm, think: _tenant_, is isolated using the 
realm's ID (string). The base `Persistence`
+API interface is always scoped to exactly one realm ID.
+
+## Supporting more databases
+
+The code to support a particular database is isolated in a project, for 
example `polaris-persistence-nosql-inmemory` and
+`polaris-persistence-nosql-mongodb`.
+
+When adding another database, it must also be wired up to Quarkus in 
`polaris-persistence-nosql-cdi-quarkus` preferably using
+Quarkus extensions, added to the `polaris-persitence-corectness` tests and 
available in `polaris-persistence-nosql-benchmark`
+for low level benchmarks.
+
+## Named pointers
+
+Polaris represents a catalog for data lakehouses, which means that the 
information of and for catalog entities like
+Iceberg tables, views and namespaces must be consistent, even if multiple 
catalog entities are changes in a single

Review Comment:
   Nit: "even if multiple catalog entities are changes in a single" -> "even if 
multiple catalog entities are changed in a single" 



##########
persistence/nosql/persistence/README.md:
##########
@@ -0,0 +1,224 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one
+  or more contributor license agreements.  See the NOTICE file
+  distributed with this work for additional information
+  regarding copyright ownership.  The ASF licenses this file
+  to you under the Apache License, Version 2.0 (the
+  "License"); you may not use this file except in compliance
+  with the License.  You may obtain a copy of the License at
+ 
+   http://www.apache.org/licenses/LICENSE-2.0
+ 
+  Unless required by applicable law or agreed to in writing,
+  software distributed under the License is distributed on an
+  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  KIND, either express or implied.  See the License for the
+  specific language governing permissions and limitations
+  under the License.
+-->
+
+# Database agnostic persistence framework
+
+This persistence API and functional implementations are based on the 
assumption that all databases targeted as backing
+stores for Polaris support "compare and swap" operations on a single row. 
These CAS operations are the only requirement.
+
+Since some databases do enforce hard size limits, for example DynamoDB has a 
hard 400kB row size limit, MariaDB/MySQL
+has a default 512kB packet size limit. Other databases have row-size 
recommendations around similar sizes. Polaris
+persistence respects those limits and recommendations using a common hard 
limit of 350kB.
+
+Objects exposed via the `Persistence` interface are typed Java objects, that 
must be immutable and serializable using
+Jackson. Each type is described via an implementation of the `ObjType` 
interface using a name, which must be unique
+in Polaris, and a target Java type - the Jackson serializable Java type. Types 
are registered using the Java service
+API using `ObjType`. The actual java target types must extend the `Obj` 
interface. The (logical) key for each `Obj`
+is a composite of the `ObjType.id()` and a `long` ID (64-bit signed int), 
combined using the `ObjId` composite type.
+
+The "primary key" of each object in a database is always _realmId + 
object-ID_, where realm-ID is a string and object-ID
+is a 64-bit integer. This allows, but not enforces, storing multiple realms in 
one backend database.
+
+Data in/for/of each Polaris realm, think: _tenant_, is isolated using the 
realm's ID (string). The base `Persistence`
+API interface is always scoped to exactly one realm ID.
+
+## Supporting more databases
+
+The code to support a particular database is isolated in a project, for 
example `polaris-persistence-nosql-inmemory` and
+`polaris-persistence-nosql-mongodb`.
+
+When adding another database, it must also be wired up to Quarkus in 
`polaris-persistence-nosql-cdi-quarkus` preferably using
+Quarkus extensions, added to the `polaris-persitence-corectness` tests and 
available in `polaris-persistence-nosql-benchmark`
+for low level benchmarks.
+
+## Named pointers
+
+Polaris represents a catalog for data lakehouses, which means that the 
information of and for catalog entities like
+Iceberg tables, views and namespaces must be consistent, even if multiple 
catalog entities are changes in a single
+atomic operation.
+
+Polaris leverages a concept called "Named pointers". The state of the whole 
catalog is referenced via the so-called
+HEAD (think: Git HEAD), which _points to_ all catalog entities. This state is 
persisted as an `Obj` with an index
+of the catalog entities, the ID of that "current catalog state `Obj`" is 
maintained in one named pointer.
+
+Named pointers are also used for other purposes than catalog entities, for 
example to maintain realms or
+configurations.
+
+## Committing changes
+
+Changes are persisted using a commit mechanism, providing atomic changes 
across multiple entities against one named
+pointer. The logic implementation ensures that even high frequency concurrent 
changes do neither let clients fail
+nor cause timeouts. The behavior and achievable throughput depends on the 
database being used, some databases perform
+_much_ better than others.
+
+A use-case agnostic "committer" abstraction exists to ease implementing 
committing operations. For catalog operations
+there is a more specialized abstraction.
+
+## `long` IDs
+
+Polaris persistence uses so-called Snowflake IDs, which are 64-bit integers 
which represent a timestamp, a node-ID,
+and a sequence number. The epoch of these timestamps is 2025-03-01-00:00:00.0 
GMT. Timestamps occupy 41 bits at
+millisecond precision, which lasts for about 69 years. Node-IDs are 10 bits, 
which allows 1024 concurrently active
+"JVMs running Polaris". 12 bits are used by the sequence number, which then 
allows each node to generate 4096 IDs per
+millisecond. One bit is reserved for future use.
+
+Node IDs are leased by every "JVM running Polaris" for a period of time. The 
ID generator implementation guarantees
+that no IDs will be generated for a timestamp that exceeds the "lease time". 
Leases can be extended. The implementation
+leverages atomic database operations (CAS) for the lease implementation.
+
+ID generators must not use timestamps before or after the lease period nor 
must they re-use an older timestamp. This
+requirement is satisfied using a monotonic clock implementation.
+
+## Caching
+
+Since most `Obj`s are by default assumed to be immutable, caching is very 
straight forward and does not require any
+coordination, which simplifies the design and implementation quite a bit.
+
+## Strong vs eventual consistency
+
+Polaris persistence offers two ways to persist `Obj`s: strongly consistent and 
eventually consistent. The former is
+slower than the latter.
+
+Since Polaris persistence respects the hard size limitations mentioned above, 
it cannot persist the serialized
+representation of objects that exceed those limits in a single database row. 
Some objects however legibly exceed those
+limits. Polaris persistence allows such "big object serializations" and 
persists those in multiple database rows, with
+the restriction that this is only supported for eventually consistent writes. 
The serialized representation for
+strong consistency writes must always be within the hard limit.
+
+## Indexes
+
+The state of a data lakehouse catalog can contain many thousand, potentially a 
few 100,000, tables/views/namespaces.
+Even a space efficient serialization of an index for that many entries exceeds 
the "common hard 350kB limit".
+New changes end in the index, which is "embedded" in the "current catalog 
state `Obj`". If the respective index size
+limit of this "embedded" index is being approached, the index is spilled out 
to separate rows in the database. The
+implementation is built to split and combine when needed.
+
+## Change log / events / notifications
+
+The commit mechanism described above builds a commit log. All changes can be 
inspected via that log in exactly the
+order in which those happened (think: `git log`). Since the log of changes is 
already present, it is possible to
+retrieve the changes since some point in time or commit log ID - this allows 
clients to receive all changes that
+happened since the last known commit ID, offering a mechanism to poll for 
changes. Since the necessary `Obj`s are
+immutable, such change-log-requests likely hit already cached data and rather 
not the database.
+
+## Cleanup old commits / unused data
+
+Despite the beauty of having a "commit log" and all metadata representation in 
the backing database, the size of that
+database would always grow.
+
+Purging unused table/view metadata memoized in the database is one piece.
+Purging old commit log entries is the second part.
+Purging (then) unreferenced `Obj`s the third part.
+
+See [maintenance service](#maintenance-service) below.
+
+## Realms (aka tenants)
+
+Bootstrapping but more importantly deleting/purging a realm is a non-trivial 
operation, which requires its own
+lifecycle. Bootstrapping is a straight forward operation as the necessary 
information can be validated and enhanced
+if necessary.
+
+Both the logical but also the physical process of realm deletion are more 
complex. From a logical point of view,
+users want to disable the realm for a while before they eventually are okay 
with deleting the information.
+
+The process to delete a realm's data from the database can be quite 
time-consuming and how that happens is database
+specific. While some databases can do bulk-deletions, which "just" take some 
time (RDBMS, BigTable), other databases
+require that the process of deleting a realm must happen during a full scan of 
the database (for example RocksDB
+and Apache Cassandra). Since scanning the whole database itself can take quite 
long and no more than one instance
+should scan the database at any time.
+
+Realm have a status to reflect its lifecycle. The initial status of a realm is 
`CREATED`, which effectively only means
+that the realm-ID has been reserved and that the necessary data needs to be 
populated (bootstrap). Once a realm has
+been fully bootstrapped, its status is changed to `ACTIVE`. Only `ACTIVE` 
realms can be used for user requests.
+
+Between `CREATED` and `ACTIVE`/`INACTIVE` there are two states that are 
mutually exclusive. `INITIALIZING` means that
+Polaris will initialize the realm as a fresh, new realm. `LOADING` means that 
realm data, which has been exported
+from another Polaris instance, is to be imported.
+
+Realm deletion is a multi-step approach as well: Realms are first put into 
`INACTIVE` state, which can be reverted
+to `ACTIVE` state or into `PURGING` state. `PURGING` means that the realm's 
data is being deleted from the database,
+once purging has been started, the realm's information in the database is 
inconsistent and cannot be restored.
+Once the realm's data has been purged, the realm is put into `PURGED` state. 
Only realms that are in state `PURGED`
+can be deleted.
+
+The multi-state approach also prevents that a realm can only be used when the 
system knows that all necessary
+information is present.
+
+**Note**: the realm state machine is not fully implemented yet.
+
+## `::system::` realm
+
+Polaris persistence uses a system realm which is used for node ID leases and 
realm management. The realm-IDs starting
+with two colons (`::`) are reserved for system use.
+
+### Named pointers in the `::system::` realm
+
+| Named pointer | Meaning         |
+|---------------|-----------------|
+| `realms`      | Realms, by name |
+
+## "User" realms
+
+### Named pointers in the user realms
+
+| Named pointer     | Meaning                      |
+|-------------------|------------------------------|
+| `root`            | Pointer to the "root" entity |

Review Comment:
   What is the root entity in this case? The realm?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@polaris.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] NoSQL database agnostic persistence [polaris]

Reply via email to