Re: [Discuss][HIVE-28879] Federated Catalog Support in Apache Hive

Butao Zhang Mon, 23 Mar 2026 03:18:31 -0700

I mostly agree with Denys's viewpoint. That is, when querying Iceberg and Hudi 
tables in HMS, engines need to implement and configure their own connectors. 
These connectors are specific to each engine and have nothing to do with HMS 
itself. HMS serves as a neutral, unified metadata management service, 
responsible only for managing the lifecycle of catalogs (such as creation and 
deletion) and providing unified metadata authorization services.





Add some extra information to respond to lisoda:


1) Q1: HMS may store various types of tables (e.g., Iceberg, Hudi), and some 
engines may not be able to query certain types of tables stored in HMS.
First, this issue seems unrelated to the multi-catalog or federated catalog 
approach I proposed. This is essentially a problem where multiple table formats 
(Iceberg, Hudi, etc.) are mixed within a single HMS catalog. When a compute 
engine is configured with this HMS catalog, it may be able to see all tables 
via `SHOW TABLES`, but it may only be able to query a subset of them. This 
issue should be handled at the compute engine level. For example, the engine 
can determine whether a table should be visible or whether it can be queried 
based on table attributes like `table_type`.
For instance, StarRocks provides a catalog/connector called the Unified Catalog 
(https://docs.starrocks.io/docs/data_source/catalog/unified_catalog/), which 
can query multiple table formats (such as Iceberg and Hudi) stored in the same 
HMS.


If users only want to query a specific type of table stored in the same HMS, 
such as Iceberg tables, they can create a dedicated catalog/connector, like the 
Iceberg Catalog 
(https://docs.starrocks.io/docs/data_source/catalog/iceberg/iceberg_catalog/). 
This catalog/connector allows users to see only Iceberg tables when running 
`SHOW TABLES`, and any other table formats will be invisible.


Additionally, based on my tests, when using 
`org.apache.iceberg.spark.SparkSessionCatalog`, Spark should be able to query 
both Hive tables and Iceberg tables through the HMS catalog.


2) Q2: Regarding the issue of circular catalogs, I believe this does not exist. 
When a compute engine is configured with an HMS catalog, that HMS catalog can 
only see its own catalog namespace (databases and tables). The engine cannot 
see information from other catalogs through this HMS catalog.




Thanks,
Butao Zhang
---- Replied Message ----
| From | lisoda<[email protected]> |
| Date | 3/20/2026 22:53 |
| To | dev<[email protected]> |
| Subject | Re: [Discuss][HIVE-28879] Federated Catalog Support in Apache Hive |
I understand your concern, but I may not have expressed myself clearly—I don't 
intend to tightly couple the catalog with specific engine runtime 
configurations either. What I'm suggesting is a lightweight convention 
mechanism, not deep integration.
My idea is actually quite simple: engines could report just a few boolean flags 
upon connection (e.g.,  supports_iceberg: true/false ), or we could push the 
filtering logic down to the engine side via an SDK. This is less about 
"coupling" and more about a declarative contract.
From an engineering perspective, convention over configuration is generally the 
better path:

Convention (auto-reporting/filtering): The engine declares its capabilities → 
HMS or the SDK automatically masks incompatible metadata. This maintains a 
single source of truth—the physical properties of the table (format, location) 
directly determine its visibility.

Configuration (manual access control): Administrators manually maintain a 
separate set of ACL rules outside of HMS to hide certain tables. This 
essentially creates duplicate definition—the metadata layer already defines 
"this is an Iceberg table," and then the permission layer has to define "this 
engine shouldn't see this Iceberg table." As the number of tables or engines 
scales, this manual synchronization overhead becomes unmanageable.
In other words, I'm not asking HMS to understand "what connectors Spark 3.4 has 
installed." I'm simply suggesting that the physical properties of the metadata 
(the format type) should automatically determine its distribution scope. If HMS 
remains completely agnostic and relies on external permission systems to 
retroactively hide visibility, doesn't that actually increase operational 
complexity?



---- Replied Message ----
| From | Denys Kuzmenko<[email protected]> |
| Date | 03/20/2026 19:12 |
| To | [email protected] |
| Cc | |
| Subject | Re: [Discuss][HIVE-28879] Federated Catalog Support in Apache Hive |
I don’t think tying catalog behavior to engine capabilities is a good 
direction. A catalog should remain engine-agnostic and focus purely on metadata 
management and discovery, not on the execution capabilities of specific query 
engines.

Hive Metastore is intentionally designed as a neutral metadata service. It 
exposes table definitions, while each engine (e.g., Apache Spark, Trino, etc.) 
decides whether it can actually process those tables based on its configured 
connectors or format support. Introducing capability negotiation would 
effectively couple the catalog to specific engines and their runtime 
configuration, which breaks that separation of concerns and makes the catalog 
responsible for execution-layer logic.

If a particular engine does not support a given format or catalog (for example, 
it does not have the appropriate client/connector installed), the cleaner 
solution is access control, not metadata filtering. In practice, permissions 
can simply be removed for users of that engine on catalogs or tables they are 
not expected to query.

Keeping the catalog engine-agnostic preserves interoperability and avoids 
embedding engine-specific behavior into the metadata layer.

Re: [Discuss][HIVE-28879] Federated Catalog Support in Apache Hive

Reply via email to