I've been looking into the pain points of integrating HMS with multiple catalogs, and there's an awkward gray area: HMS only stores metadata without checking the actual processing capabilities of the connecting engines. So when an Iceberg table is registered in HMS, every engine can see it, but only a few can actually query it. This gets messy in production. When users (especially non-technical ones) see a table but can't query it, their first reaction is usually to file a ticket for permission bugs or network issues—only to find out it's something trivial like "this engine doesn't have the Iceberg connector installed." I'm wondering if we could add some form of capability negotiation to HMS. When an engine connects, it declares what formats it can handle (e.g., "I can process Iceberg, but not Hudi"), and HMS filters the exposed tables based on that manifest—instead of the current "show everything and fail later" approach. The benefits are straightforward: First, the UX would be much cleaner—either filter out unsupported tables entirely or give a clear "not supported" message, rather than throwing cryptic database errors. Second, the query layer doesn't waste effort fetching metadata it can't use (saving the optimizer from running stats collection only to discover at execution time that the format isn't supported). From a security perspective, sensitive table names and schemas stay hidden from engines that can't process them, which is cleaner than relying on execution-time permission errors. Bigger picture, this nudges HMS toward being a unified catalog layer. Iceberg is becoming the de facto standard for open table formats, but the catalog layer is still fragmented (Spark has spark_catalog , Trino has iceberg , Flink has its own). If HMS could support this capability registration mechanism, any compliant engine could get multi-catalog support out of the box without reinventing the wheel, truly decoupling storage from compute. Obviously, this would require changes to HMS interfaces and might be a breaking change. But is this direction worth exploring at this stage?
---- Replied Message ---- | From | Denys Kuzmenko<[email protected]> | | Date | 03/20/2026 17:45 | | To | [email protected] | | Cc | | | Subject | Re: [Discuss][HIVE-28879] Federated Catalog Support in Apache Hive | 1. Catalogs are responsible for metadata management. If a table from an external data source is registered in a catalog, engines integrated with that catalog can discover the metadata, but actual query execution depends on whether the engine has the appropriate connector/format support. This is already the case today. For example, when Apache Spark reads metadata from Hive Metastore, it may see Apache Iceberg tables. However, querying those tables requires the Iceberg catalog/connector to be configured in Spark. Therefore this behavior is not unhealthy or new—it is simply a consequence of separating metadata discovery (catalog) from execution capabilities (engine connectors). Engines are expected to configure the appropriate connector for the table formats or external catalogs they want to query. 2.The second concern about circular catalogs is primarily a configuration issue rather than an architectural problem. Modern query engines like Trino already operate with multiple catalogs and connectors. The query planner resolves a table reference to one catalog/connector, and that connector is responsible for accessing the underlying data source. The engine itself does not recursively resolve catalogs through other catalogs.
