Re: [DISCUSS] Generic table delegation strategy in Polaris SparkCatalog

yun zou Tue, 17 Mar 2026 17:12:09 -0700

Hi ITing,

Thanks for bringing this up!


*>>> Should Polaris determine the provider first (via metadata) and
delegate to a single matching catalog, or should it attempt multiple
sub-catalogs in a defined order? *

*>>> If multiple sub-catalogs are supported, should there be a documented,
deterministic.*

As Dimitri pointed out, Polaris Catalog today is designed to support mixed
table types. In other words, a single catalog (and namespace) can contain
Iceberg, Delta, and Hudi tables, and table identifiers must be unique
across all of them.

Currently:

   -

   Iceberg tables are only visible through Iceberg endpoints
   -

   Generic tables are only visible through generic table endpoints
   -

   These two views are disjoint

Because of this, to get a complete view of all tables in a catalog, we need
to call listTables on both the Iceberg and generic endpoints.

For loadTable, since we only have the table identifier and don’t know the
table type upfront, we may need to try both endpoints in the worst case.
Client-side table format caching could help optimize this in near future.

Regarding ordering, there isn’t a strict or required sequence when checking
different table types. For example, checking Generic first and then Iceberg
(or vice versa) won’t change the outcome. The current approach of
attempting Iceberg first is simply a convention, not a requirement.

*>>>  Should the per-format routing logic be centralised behind an
abstraction (e.g. a SubCatalogRouter interface or a provider registry), so
that adding a new format is a single registration rather than edits across
loadTable, alterTable, and dropTable? *

I think the current if/else logic mainly exists because we didn’t have a
clear understanding of how different formats would behave on the client
side at the time. Now that Delta, Hudi, and Lance appear to follow a
similar pattern, it makes sense to extract a common routing abstraction.
That would definitely simplify the code and make adding new formats a
matter of registration rather than touching multiple code paths.

*>>> Consistency：Should all table operations (loadTable, createTable,
alterTable, dropTable, renameTable) follow the same routing strategy, or
are per-operation differences acceptable? Currently createTable has a
different branching structure from loadTable.*

In general, it would be good for most table operations (loadTable,
alterTable, dropTable, renameTable) to follow a consistent routing
strategy. However, createTable is a bit different — since we already know
the table format at creation time, we can directly route to the correct
endpoint. So I think it’s reasonable for createTable to have a different
branching structure.

*>>> Is it in scope for Polaris to act as a routing layer for multiple
table providers, or should users who need both Polaris and Paimon configure
them as separate catalogs in their Spark session and route at the session
level themselves?*

Polaris Server itself doesn’t perform routing. This responsibility lies
with the Polaris Spark Client, which should determine the correct endpoint
to call for each operation.

*>>> Paimon does not support a delegating catalog mode (unlike Delta/Hudi),
it cannot automatically notify Polaris of its changes.*

I may have missed this detail in the PR and will double-check. My
understanding is that Paimon’s SparkCatalog does not call into a REST
catalog as part of its table operations. In that case, it becomes the
client’s responsibility to ensure operations are executed correctly. If
needed, we could invoke operations twice, but we’d also need to ensure
proper failure handling — i.e., if any step fails, the operation should be
marked as failed and the transaction rolled back correctly.


Best Regards,

Yun

On Tue, Mar 17, 2026 at 7:45 AM Dmitri Bourlatchkov <[email protected]>
wrote:

> Hi I-Ting,
>
> Unfortunately, I do not have an answer to your double registration question
> off the top of my head, but I added an item for this discussion to the
> Community Sync [1] agenda for March 19.
>
> [1] https://polaris.apache.org/community/meetings/
>
> Cheers,
> Dmitri.
>
> On Tue, Mar 17, 2026 at 10:19 AM ITing Lee <[email protected]> wrote:
>
> > Hi Dmitri,
> >
> > Thank you for your clear guidance!
> >
> >
> > I completely agree with the unified namespace tree principle.
> >
> > To ensure Polaris acts as the single source of truth and avoids
> resolution
> > ambiguity, I will refactor the implementation to follow a lookup then
> > dispatch pattern.
> >
> > Instead of speculative probing, the sparkCatalog will first resolve the
> > table entity via Polaris metadata to identify the provider, then
> > deterministically route the call or throw a Table format mismatch error
> if
> > the API mode is incompatible.
> >
> >
> > I have another question regarding table registration for non-delegating
> > formats.
> >
> > Since Paimon does not support a delegating catalog mode (unlike
> > Delta/Hudi), it cannot automatically notify Polaris of its changes.
> >
> > In my PR, I've implemented an explicit dual-registration during
> createTable
> > (Physical creation in Paimon warehouse followed by logical registration
> in
> > Polaris).
> >
> > This ensures Paimon tables are visible via SHOW TABLES.
> >
> >
> > I would like to ask if the community has better ideas for handling such
> > standalone formats? (From my perspective, the dual-registration is not an
> > atomic operator for both systems. There's still a  chance that only one
> of
> > the services succeeds but the other fails, which will cause
> inconsistency.
> > However, it _seems_ this is the only way to achieve it for non-delegating
> > format.)
> >
> >
> > The alternative for having Polaris actively scan external warehouses
> which
> > seems to introduce significant performance overhead.
> >
> > Is there a more elegant way to ensure catalog visibility without
> > sacrificing the goal of single source of truth , or is this explicit
> > registration the preferred pattern for now?
> >
> >
> > Best regards,
> >
> > I-Ting
> >
> > Dmitri Bourlatchkov <[email protected]> 於 2026年3月16日週一 下午9:42寫道：
> >
> > > Hi I-Ting,
> > >
> > > Thanks for starting this discussion. You bring up important points.
> > >
> > > From my point of view, the catalog data controlled by Polaris should
> > form a
> > > unified namespace tree. In other words, each full table name owned by
> > > Polaris must be unique and resolve to the same table entity regardless
> of
> > > the API used by the client.
> > >
> > > If a name is accessed via the Icebert REST Catalog API and happens to
> > point
> > > to a Paimon table, I think Polaris ought to report an error to the
> client
> > > (something like HTTP 422 "Table format mismatch").
> > >
> > > If a name is accessed via the Generic Tables API, the response must
> > > indicate actual table format.
> > >
> > > I do not think the client should make multiple "lookup" calls for the
> > same
> > > table name. That creates ambiguity in the name resolution logic and
> could
> > > lead to different lookup results in different clients.
> > >
> > > I believe the client should select the API it wants to use (IRC or
> > Generic
> > > Tables) at setup time and then rely on that API for all primary lookup
> > > calls.
> > >
> > > WDYT?
> > >
> > > Thanks,
> > > Dmitri.
> > >
> > > On Sat, Mar 14, 2026 at 3:34 AM 李宜頲 <[email protected]> wrote:
> > >
> > > > Hi all,
> > > >
> > > > We are adding support for Paimon inside Polaris's SparkCatalog.
> Before
> > we
> > > > add more formats, we would like to get community input on the
> intended
> > > > architecture.
> > > >
> > > > This discussion originated from a code review conversation in PR
> #3820
> > > > <https://github.com/apache/polaris/pull/3820#discussion_r2865885791>
> > > >
> > > >
> > > >
> > > > *Current design*
> > > >
> > > > When SparkCatalog.loadTable is called, the routing works in three
> > phases:
> > > >
> > > >
> > > > 1. Try the Iceberg catalog (icebergSparkCatalog.loadTable). If it
> > > succeeds,
> > > > return immediately.
> > > >
> > > > 2. Call getTableFormat(ident), which makes a single HTTP GET to the
> > > Polaris
> > > > server to read the provider property stored in the generic table
> > > metadata,
> > > > without triggering any Spark DataSource resolution.
> > > >
> > > > 3. Route based on the provider string:
> > > >
> > > >     - "paimon"  : delegate to Paimon's SparkCatalog
> > > >
> > > >     - unknown/other : fall back to polarisSparkCatalog.loadTable,
> which
> > > > performs full DataSource resolution
> > > >
> > > >
> > > > The same three-phase pattern is repeated independently in loadTable,
> > > > alterTable, and dropTable*（But createTable is not following this
> > > pattern)*.
> > > > It might raise the concern that this makes the routing logic
> intrusive:
> > > > every new format requires parallel changes across all three methods,
> > and
> > > > there is no single place that describes the full routing policy.
> > > >
> > > >
> > > > *Questions for discussion*
> > > >
> > > >
> > > > 1. Should Polaris determine the provider first (via metadata) and
> > > delegate
> > > > to a single matching catalog, or should it attempt multiple
> > sub-catalogs
> > > in
> > > > a defined order?
> > > >
> > > > 2. If multiple sub-catalogs are supported, should there be a
> > documented,
> > > > deterministic
> > > >
> > > >   resolution order (Iceberg -> Paimon -> Delta -> Hudi -> Polaris
> > > > fallback)? Who owns that order, should it be configurable by
> operators?
> > > >
> > > > 3. Should the per-format routing logic be centralised behind an
> > > abstraction
> > > > (e.g. a SubCatalogRouter interface or a provider registry), so that
> > > adding
> > > > a new format is a single registration rather than edits across
> > loadTable,
> > > > alterTable, and dropTable?
> > > >
> > > > 4. Consistency：Should all table operations (loadTable, createTable,
> > > > alterTable, dropTable,
> > > >
> > > >   renameTable) follow the same routing strategy, or are per-operation
> > > > differences acceptable? Currently createTable has a different
> branching
> > > > structure from loadTable.
> > > >
> > > > 5. Is it in scope for Polaris to act as a routing layer for multiple
> > > table
> > > > providers, or should users who need both Polaris and Paimon configure
> > > them
> > > > as separate catalogs in their Spark session and route at the session
> > > level
> > > > themselves?
> > > >
> > > >
> > > > We have a working Paimon implementation today and would like to avoid
> > > > locking in a pattern that becomes hard to extend. Any input on the
> > design
> > > > direction, or pointers to prior discussion on this topic, would be
> much
> > > > appreciated.
> > > >
> > > >
> > > > Best regards,
> > > >
> > > > I-Ting
> > > >
> > >
> >
>

Re: [DISCUSS] Generic table delegation strategy in Polaris SparkCatalog

Reply via email to