Hi all,

We are adding support for Paimon inside Polaris's SparkCatalog. Before we
add more formats, we would like to get community input on the intended
architecture.

This discussion originated from a code review conversation in PR #3820
<https://github.com/apache/polaris/pull/3820#discussion_r2865885791>



*Current design*

When SparkCatalog.loadTable is called, the routing works in three phases:


1. Try the Iceberg catalog (icebergSparkCatalog.loadTable). If it succeeds,
return immediately.

2. Call getTableFormat(ident), which makes a single HTTP GET to the Polaris
server to read the provider property stored in the generic table metadata,
without triggering any Spark DataSource resolution.

3. Route based on the provider string:

    - "paimon"  : delegate to Paimon's SparkCatalog

    - unknown/other : fall back to polarisSparkCatalog.loadTable, which
performs full DataSource resolution


The same three-phase pattern is repeated independently in loadTable,
alterTable, and dropTable*(But createTable is not following this pattern)*.
It might raise the concern that this makes the routing logic intrusive:
every new format requires parallel changes across all three methods, and
there is no single place that describes the full routing policy.


*Questions for discussion*


1. Should Polaris determine the provider first (via metadata) and delegate
to a single matching catalog, or should it attempt multiple sub-catalogs in
a defined order?

2. If multiple sub-catalogs are supported, should there be a documented,
deterministic

  resolution order (Iceberg -> Paimon -> Delta -> Hudi -> Polaris
fallback)? Who owns that order, should it be configurable by operators?

3. Should the per-format routing logic be centralised behind an abstraction
(e.g. a SubCatalogRouter interface or a provider registry), so that adding
a new format is a single registration rather than edits across loadTable,
alterTable, and dropTable?

4. Consistency:Should all table operations (loadTable, createTable,
alterTable, dropTable,

  renameTable) follow the same routing strategy, or are per-operation
differences acceptable? Currently createTable has a different branching
structure from loadTable.

5. Is it in scope for Polaris to act as a routing layer for multiple table
providers, or should users who need both Polaris and Paimon configure them
as separate catalogs in their Spark session and route at the session level
themselves?


We have a working Paimon implementation today and would like to avoid
locking in a pattern that becomes hard to extend. Any input on the design
direction, or pointers to prior discussion on this topic, would be much
appreciated.


Best regards,

I-Ting

Reply via email to