Hi all,

I wanted to start a discussion around the support for Hive Catalog
federation in Polaris. In particular, there are two primary ways we can add
support for Hive federation:

*1. Support a single Hive instance per Polaris deployment* The Hive
workflow would be identical to the Hadoop catalog workflow. Polaris
would invoke the Iceberg connection library, that would try to find the
hive-site.xml file in (1) the CLASSPATH and (2) the default Hadoop
locations: HADOOP_PATH and HADOOP_CONF_DIR. Polaris would then initialize
the Hive connection using the configurations it found at these locations.

   -

   *Drawbacks: *The primary drawback of this approach is that if Polaris
   finds multiple hive-site.xml files, it would merge their configurations,
   which could lead to potentially inconsistent connection state.
   Furthermore, there is no clear documentation of the order in which the
   configuration would be applied. While this is often predictable on a given
   OS, it is not guaranteed across environments. The other key drawback is
   that if a Polaris user wants to federate to multiple Hive catalogs, their
   only option is to deploy a separate Polaris instance for each Hive
   instance.

*2. Support multiple Hive instances per Polaris deployment* The alternate
(and in my view, ideal) solution is to allow Polaris to federate with
multiple Hive catalogs. To support multiple catalogs, Polaris would
explicitly disallow the connection library from reading hive-site.xml files
in the default paths. To pass in the configurations, Polaris can adopt one
of two options:

   -

   *Option 2a: Accept a canonical path to the target hive-site.xml.*
   -

      *Advantages:* This guarantees that the connection configurations are
      derived from a single source. It also allows Polaris to rely on the
      NONE/ENVIRONMENT/PROVIDER/UNMANAGED mechanism, making it especially
      useful in case the Hive instance relies on Kerberos or custom
      authentication that Polaris does not natively support/manage.
      -

      *Drawbacks:* The user needs to have access (or some mechanism to
      upload files) to the Polaris server's file system.
      -

   *Option 2b: Accept all the connection-specific parameters as a part of
   the create-catalog request.*
   -

      *Advantage:* Polaris can directly accept and store the configurations
      in a DPO instead of relying on the user having access to the
server's file
      system (to create/update hive-site.xml).
      -

      *Drawback:* Polaris would need to manage the secrets. This is easy to
      support for certain authentication types (LDAP/Simple), However,
 it would
      preclude the support for other authentication mechanisms, such
as Kerberos
      or Custom.

I prefer option 2a primarily because it provides the flexibility of
supporting multiple federated Hive catalogs while allowing Polaris to
support authentication that it does not natively manage. Please let me know
if you have any thoughts or feedback.

Thanks,
Pooja

Reply via email to