Hi all, I wanted to start a discussion around the support for Hive Catalog federation in Polaris. In particular, there are two primary ways we can add support for Hive federation:
*1. Support a single Hive instance per Polaris deployment* The Hive workflow would be identical to the Hadoop catalog workflow. Polaris would invoke the Iceberg connection library, that would try to find the hive-site.xml file in (1) the CLASSPATH and (2) the default Hadoop locations: HADOOP_PATH and HADOOP_CONF_DIR. Polaris would then initialize the Hive connection using the configurations it found at these locations. - *Drawbacks: *The primary drawback of this approach is that if Polaris finds multiple hive-site.xml files, it would merge their configurations, which could lead to potentially inconsistent connection state. Furthermore, there is no clear documentation of the order in which the configuration would be applied. While this is often predictable on a given OS, it is not guaranteed across environments. The other key drawback is that if a Polaris user wants to federate to multiple Hive catalogs, their only option is to deploy a separate Polaris instance for each Hive instance. *2. Support multiple Hive instances per Polaris deployment* The alternate (and in my view, ideal) solution is to allow Polaris to federate with multiple Hive catalogs. To support multiple catalogs, Polaris would explicitly disallow the connection library from reading hive-site.xml files in the default paths. To pass in the configurations, Polaris can adopt one of two options: - *Option 2a: Accept a canonical path to the target hive-site.xml.* - *Advantages:* This guarantees that the connection configurations are derived from a single source. It also allows Polaris to rely on the NONE/ENVIRONMENT/PROVIDER/UNMANAGED mechanism, making it especially useful in case the Hive instance relies on Kerberos or custom authentication that Polaris does not natively support/manage. - *Drawbacks:* The user needs to have access (or some mechanism to upload files) to the Polaris server's file system. - *Option 2b: Accept all the connection-specific parameters as a part of the create-catalog request.* - *Advantage:* Polaris can directly accept and store the configurations in a DPO instead of relying on the user having access to the server's file system (to create/update hive-site.xml). - *Drawback:* Polaris would need to manage the secrets. This is easy to support for certain authentication types (LDAP/Simple), However, it would preclude the support for other authentication mechanisms, such as Kerberos or Custom. I prefer option 2a primarily because it provides the flexibility of supporting multiple federated Hive catalogs while allowing Polaris to support authentication that it does not natively manage. Please let me know if you have any thoughts or feedback. Thanks, Pooja