I've been looking into the pain points of integrating HMS with multiple 
catalogs, and there's an awkward gray area: HMS only stores metadata without 
checking the actual processing capabilities of the connecting engines. So when 
an Iceberg table is registered in HMS, every engine can see it, but only a few 
can actually query it.
This gets messy in production. When users (especially non-technical ones) see a 
table but can't query it, their first reaction is usually to file a ticket for 
permission bugs or network issues—only to find out it's something trivial like 
"this engine doesn't have the Iceberg connector installed."
I'm wondering if we could add some form of capability negotiation to HMS. When 
an engine connects, it declares what formats it can handle (e.g., "I can 
process Iceberg, but not Hudi"), and HMS filters the exposed tables based on 
that manifest—instead of the current "show everything and fail later" approach.
The benefits are straightforward:
First, the UX would be much cleaner—either filter out unsupported tables 
entirely or give a clear "not supported" message, rather than throwing cryptic 
database errors. Second, the query layer doesn't waste effort fetching metadata 
it can't use (saving the optimizer from running stats collection only to 
discover at execution time that the format isn't supported). From a security 
perspective, sensitive table names and schemas stay hidden from engines that 
can't process them, which is cleaner than relying on execution-time permission 
errors.
Bigger picture, this nudges HMS toward being a unified catalog layer. Iceberg 
is becoming the de facto standard for open table formats, but the catalog layer 
is still fragmented (Spark has  spark_catalog , Trino has  iceberg , Flink has 
its own). If HMS could support this capability registration mechanism, any 
compliant engine could get multi-catalog support out of the box without 
reinventing the wheel, truly decoupling storage from compute.
Obviously, this would require changes to HMS interfaces and might be a breaking 
change. But is this direction worth exploring at this stage?



---- Replied Message ----
| From | Denys Kuzmenko<[email protected]> |
| Date | 03/20/2026 17:45 |
| To | [email protected] |
| Cc | |
| Subject | Re: [Discuss][HIVE-28879] Federated Catalog Support in Apache Hive |
1. Catalogs are responsible for metadata management. If a table from an 
external data source is registered in a catalog, engines integrated with that 
catalog can discover the metadata, but actual query execution depends on 
whether the engine has the appropriate connector/format support.

This is already the case today. For example, when Apache Spark reads metadata 
from Hive Metastore, it may see Apache Iceberg tables. However, querying those 
tables requires the Iceberg catalog/connector to be configured in Spark.

Therefore this behavior is not unhealthy or new—it is simply a consequence of 
separating metadata discovery (catalog) from execution capabilities (engine 
connectors). Engines are expected to configure the appropriate connector for 
the table formats or external catalogs they want to query.

2.The second concern about circular catalogs is primarily a configuration issue 
rather than an architectural problem. Modern query engines like Trino already 
operate with multiple catalogs and connectors. The query planner resolves a 
table reference to one catalog/connector, and that connector is responsible for 
accessing the underlying data source. The engine itself does not recursively 
resolve catalogs through other catalogs.

Reply via email to