Re: Discussion: Re-evaluating Realm Modeling in Polaris

Yufei Gu Fri, 18 Apr 2025 13:38:28 -0700

Hi Folks,

As we discussed, option 1 provides the strongest isolation, which should
work particularly well for dynamically created data sources. Another
significant benefit is that it's less complicated overall.


I'm not convinced we need both option 1 and option 3. For scenarios
involving only a single realm, the concept of a realm becomes unnecessary.
In that case, there's no need for any additional options, including option
3.

Yufei


On Tue, Apr 15, 2025 at 11:19 AM Dmitri Bourlatchkov <di...@apache.org>
wrote:

> Going with options 1 and 3 initially sounds good to me. This should
> simplify current JDBC PRs too.
>
> We can certainly add capabilities later, because having realm ID in the PR
> does not preclude other deployment choices.
>
> Cheers,
> Dmitri.
>
> On Tue, Apr 15, 2025 at 1:49 PM Michael Collado <collado.m...@gmail.com>
> wrote:
>
> > My $.02 is that Option 1 is entirely possible using a DataSource that
> > dynamically creates Connections as needed. Option 1 is nice because, as
> > Pierre said, it gives admins the ability to dynamically allocate
> resources
> > to different clients as needed.
> >
> > Personally, I'm less inclined to option 3 just because it means
> potentially
> > larger blast radius if database credentials are ever leaked. But if most
> > end users are expecting to only manage a single realm, it's probably the
> > easiest and solves the most common use case.
> >
> > I like the option of combining 1 and 3 - by default, a single tenant
> > deployment writes to a single end database, but admins have the ability
> to
> > configure dynamic connections to different database endpoints if multiple
> > realms are supported.
> >
> > Mike
> >
> > On Tue, Apr 15, 2025 at 9:32 AM Alex Dutra <alex.du...@dremio.com.invalid
> >
> > wrote:
> >
> > > Hi all,
> > >
> > > I'm in agreement with Pierre, JB and Dmitri's points. I’d like to add
> > some
> > > context from the Quarkus configuration angle:
> > >
> > > Option 1, which involves distinct datasources, presents a challenge.
> > > Quarkus requires all datasources to be present and fully configured at
> > > build time. This requirement could be quite cumbersome for end users,
> > > making this option less user-friendly in practice.
> > >
> > > Regarding Option 2, while it's theoretically possible to manage
> multiple
> > > schemas with a single datasource, implementing this can be complex. To
> > > effectively work with different schemas in PostgreSQL, you would need
> to
> > > either qualify all table identifiers or adjust the `search_path` URL
> > > parameter. Additionally, other JDBC backends like MySQL don't support
> > > multiple schemas per database, which would make Option 2 less portable
> > > across different JDBC databases.
> > >
> > > That's why I think Option 3 is the most portable one, and the easiest
> for
> > > users or administrators to configure. As Pierre noted, it is subject to
> > > noisy neighbor interferences – but to some extent, I think
> interferences
> > > could also happen with separate schemas like in option 2.
> > >
> > > Just my 2 cents.
> > >
> > > Thanks,
> > >
> > > Alex
> > >
> > >
> > > On Tue, Apr 15, 2025 at 4:00 PM Dmitri Bourlatchkov <di...@apache.org>
> > > wrote:
> > >
> > > > Thanks for your perspective, Pierre! You make good points and I agree
> > > with
> > > > them.
> > > >
> > > > From my POV, I'd add that we probably need to take deployment
> concerns
> > > into
> > > > account too.
> > > >
> > > > If the deployment uses the database per realm approach (option 1)
> then
> > > > someone has to provide database connection parameters (including
> > > secrets).
> > > > If that is the deployment administrator, then the admin necessarily
> has
> > > to
> > > > be aware of all realms and effectively has control of the data in all
> > > > realms. Isolation is achieved only for end users.
> > > >
> > > > That said, even with option 3 the deployment owner has control over
> all
> > > > realms and end users are isolated as far as their access to APIs is
> > > > concerned. End users cannot discover each other's data (barring
> coding
> > > > mistakes in Polaris). The same goes for option 2 as it's the middle
> > > ground.
> > > >
> > > > I do not see any material difference between options 1, 2 and 3 from
> > the
> > > > end user's perspective.
> > > >
> > > > If, however, the database connection parameters are not controlled by
> > the
> > > > administrator, but by the end user who wants to define a realm, then
> > > > Polaris needs to expose managing database connections and secrets.
> This
> > > may
> > > > be a valuable feature, but I believe it is far beyond current Polaris
> > > > backend capabilities. I do not think going this way is justified at
> > this
> > > > time.
> > > >
> > > > I'd like to propose a hybrid approach where Polaris provides
> > capabilities
> > > > (and config) for the administrators to choose between options 1, 2, 3
> > > > according to their specific deployment concerns.
> > > >
> > > > This means that the primary key has to include the realm ID, because
> if
> > > the
> > > > Polaris code does not provide it then the admin will not be able to
> > > choose
> > > > option 3 at runtime.
> > > >
> > > > WDYT?
> > > >
> > > > Thanks,
> > > > Dmitri.
> > > >
> > > > On Tue, Apr 15, 2025 at 8:35 AM Pierre Laporte <
> pie...@pingtimeout.fr>
> > > > wrote:
> > > >
> > > > > Hi Prashant
> > > > >
> > > > > I guess the answer will depend on how easy it should be for Polaris
> > to
> > > > > support multi-tenancy.
> > > > >
> > > > > A separate database per realm would allow administrators to limit
> the
> > > > > amount of resources that a realm can consume (e.g. the maximum
> number
> > > of
> > > > > database connections).  Indeed, it would be one of the strongest
> > > > isolation
> > > > > mode.  However, the code would need to support a complete database
> > > > > configuration per realm (think username and password and possibly
> IP
> > > > > address) if the goal is to match Postgres capabilities.  In terms
> of
> > > > > backup/restore, it is the most flexible option.
> > > > >
> > > > > A "one schema per realm" approach would be a simpler approach,
> > > regarding
> > > > > datasource configuration.  However, there would be less isolation
> > > between
> > > > > realms, and a resource utilization spike on one realm could impact
> > > > > performance of another realm.  It is as flexible as option #1
> > regarding
> > > > > backup and restore.
> > > > >
> > > > > A "realm as part of the primary key" approach is the most efficient
> > > way,
> > > > in
> > > > > that the cost of adding tenants is close to zero.  Like in option
> #2,
> > > > there
> > > > > is no real resource isolation between tenants and a noisy-neighbor
> > > > > situation is a possible issue.  The biggest difference is regarding
> > > > backup
> > > > > and restore.  Consider the case where data is accidentally
> > > > > wiped/corrupted/modified/... in a given tenant and administrators
> > want
> > > to
> > > > > restore it to a previous state.  With this approach, it is a much
> > more
> > > > > complex as Postgres does not (AFAIK) allow the possibility to
> restore
> > > > > tables partially.
> > > > >
> > > > > Just my 2 cents
> > > > >
> > > > > --
> > > > >
> > > > > Pierre
> > > > >
> > > > >
> > > > > On Tue, Apr 15, 2025 at 12:42 AM Prashant Singh
> > > > > <prashant.si...@snowflake.com.invalid> wrote:
> > > > >
> > > > > > Dear Polaris Community,
> > > > > >
> > > > > > This email initiates a discussion regarding the modeling of
> Realms
> > > > within
> > > > > > the Polaris project, following its recent mention in my JDBC
> > > > > implementation
> > > > > > pull request:
> > > > > > https://github.com/apache/polaris/pull/1287/files#r2040383971.
> > > > > >
> > > > > > My current understanding, based on available information, is that
> > > > Realms
> > > > > > were primarily intended for isolation. Consequently, the
> > EclipseLink
> > > > > > implementation treats each Realm as a separate database.
> > > > > >
> > > > > > As we are re-implementing this functionality, it was suggested
> that
> > > we
> > > > > > gather community feedback on the optimal approach to modeling
> > Realms.
> > > > > >
> > > > > > Based on my current understanding, here are potential modeling
> > > options:
> > > > > >
> > > > > > *1. Separate Databases per Realm:*
> > > > > >
> > > > > >    - Each Realm would correspond to a distinct database.
> > > > > >    - This could be implemented using Quarkus custom data sources,
> > > with
> > > > > one
> > > > > >    data source per Realm.
> > > > > >
> > > > > > *2. Separate Schemas per Realm:*
> > > > > >
> > > > > >    - Each Realm would correspond to a distinct database schema
> > > within a
> > > > > >    single database.
> > > > > >    - Most database systems support two-part identifiers (
> > > > > >    <schema_name>.<table_name>), allowing for data isolation.
> > > > > >
> > > > > > *3. Realm as a Primary Key:*
> > > > > >
> > > > > >    - A realm identifier would be added as a primary key (or part
> > of a
> > > > > >    composite primary key) to each Polaris table.
> > > > > >    - Data isolation would be enforced through filtering based on
> > this
> > > > key
> > > > > >    during data access.
> > > > > >
> > > > > > The optimal approach will likely depend on ease of use and
> > > > > maintainability
> > > > > > for database administrators.
> > > > > >
> > > > > > Please share your thoughts and preferences regarding these
> options.
> > > > > >
> > > > > > Best regards,
> > > > > >
> > > > > > Prashant Singh
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Discussion: Re-evaluating Realm Modeling in Polaris

Reply via email to