Re: [DISCUSS] Hive Catalog federation in Polaris

Yufei Gu Tue, 08 Jul 2025 09:27:12 -0700

HMS integration is a key step toward one of Polaris’s critical missions:
helping users move off HMS. It brings clear value by aligning with our
long-term direction.


I’m not too concerned about hive.xml, most of its configurations can be
dynamically injected at runtime. The real challenge lies in Kerberos
integration. Since krb5.conf and the keytab are globally configured per
JVM, a single JVM instance cannot support true multi-tenancy. As far as I
know, there isn’t a clean solution to this limitation.

If that's indeed the case, Option 2a becomes far less appealing to me.

Yufei


On Mon, Jul 7, 2025 at 11:18 AM Russell Spitzer <russell.spit...@gmail.com>
wrote:

> I think having some integration with HMS is definitely a good idea. We've
> already seen
> users build this in the wild on top of Polaris showing that there is
> definitely a demand.
>  I'm still a strong believer that we should be helping users get to Polaris
> from whatever systems
> they are currently using to Polaris.
>
> On Mon, Jul 7, 2025 at 12:59 PM Eric Maynard <eric.w.mayn...@gmail.com>
> wrote:
>
> > 1. We (Polaris) can provide end users a way to migrate off of these
> > catalogs that the Iceberg project no longer wants to invest into.
> > Implementing HMS federation is in service to the goal of removing
> > non-Iceberg catalogs, not in contradiction to it.
> >
> > 2. This does not seem like a user-centered concern, but I'm also not
> sure I
> > understand exactly what is being expressed here. Are you saying that the
> > current HADOOP federation does not work somehow?
> >
> > 3. Yes, please see the other thread about the IMPLICIT authentication
> type
> > for discussion of this topic. Note, however, that HMS federation may
> > support authentication types other than IMPLICIT.
> >
> > 4. That depends on what you mean by "depends on" -- it could also be said
> > that Iceberg itself depends on Hadoop.
> >
> > 5. This not only also applies to HADOOP federation, which already exists,
> > but also does *not* apply to HMS federation when using an authentication
> > mechanism other than IMPLICIT -- again, please see the other thread for
> > more discussion of this topic.
> >
> > On Fri, Jul 4, 2025 at 3:52 AM Robert Stupp <sn...@snazy.de> wrote:
> >
> > > I'd really prefer to not add "anything Hive" to Polaris itself, and I'd
> > > really like to see Hadoop being removed entirely from the Polaris code
> > > base.
> > >
> > > There are multiple reasons for this:
> > >
> > > 1. The Iceberg project would rather like to remove all catalogs except
> > > the REST catalog. (That's at least what I understood from discussions
> > > quite a while ago.)
> > >
> > > 2. Hadoop is quite behind supporting recent Java versions. It is
> already
> > > impossible to run "anything Hadoop" with Java 24. Considering how long
> > > it took Hadoop to even support Java 11, it will take a long time until
> > > Hadoop is ready for Java 24+, especially since Hadoop has to refactor a
> > > lot of things. Polaris requires Java 21 and we know it works in CI with
> > > Java 22+23 (both are EOL). Hadoop does only support Java 11, not 17,
> not
> > > 21.
> > >
> > > 3. Hadoop (HDFS) is as a very different security model, which is the
> > > reason why HDFS is not suitable for Polaris production configuration,
> > > guarded by explicit configuration options.
> > >
> > > 4. Hive depends on Hadoop, so all concerns about Hadoop also apply to
> > Hive.
> > >
> > > 5. Polaris is multi-tenant (realms). A _single_ instance of Hive
> > > contradicts this.
> > >
> > >
> > > My vote would be on *not* adding Hive and also on removing Hadoop
> > entirely.
> > >
> > > If someone comes up with an Iceberg REST catalog for Hive or HDFS and
> > > Polaris can connect to it, that's fine for me, because it's outside of
> > > Polaris. But I strongly object having Hadoop or even Hive in Polaris.
> > >
> > >
> > > On 7/1/25 20:48, Pooja Nilangekar wrote:
> > > > Hi all,
> > > >
> > > > I wanted to start a discussion around the support for Hive Catalog
> > > > federation in Polaris. In particular, there are two primary ways we
> can
> > > add
> > > > support for Hive federation:
> > > >
> > > > *1. Support a single Hive instance per Polaris deployment* The Hive
> > > > workflow would be identical to the Hadoop catalog workflow. Polaris
> > > > would invoke the Iceberg connection library, that would try to find
> the
> > > > hive-site.xml file in (1) the CLASSPATH and (2) the default Hadoop
> > > > locations: HADOOP_PATH and HADOOP_CONF_DIR. Polaris would then
> > initialize
> > > > the Hive connection using the configurations it found at these
> > locations.
> > > >
> > > >     -
> > > >
> > > >     *Drawbacks: *The primary drawback of this approach is that if
> > Polaris
> > > >     finds multiple hive-site.xml files, it would merge their
> > > configurations,
> > > >     which could lead to potentially inconsistent connection state.
> > > >     Furthermore, there is no clear documentation of the order in
> which
> > > the
> > > >     configuration would be applied. While this is often predictable
> on
> > a
> > > given
> > > >     OS, it is not guaranteed across environments. The other key
> > drawback
> > > is
> > > >     that if a Polaris user wants to federate to multiple Hive
> catalogs,
> > > their
> > > >     only option is to deploy a separate Polaris instance for each
> Hive
> > > >     instance.
> > > >
> > > > *2. Support multiple Hive instances per Polaris deployment* The
> > alternate
> > > > (and in my view, ideal) solution is to allow Polaris to federate with
> > > > multiple Hive catalogs. To support multiple catalogs, Polaris would
> > > > explicitly disallow the connection library from reading hive-site.xml
> > > files
> > > > in the default paths. To pass in the configurations, Polaris can
> adopt
> > > one
> > > > of two options:
> > > >
> > > >     -
> > > >
> > > >     *Option 2a: Accept a canonical path to the target hive-site.xml.*
> > > >     -
> > > >
> > > >        *Advantages:* This guarantees that the connection
> configurations
> > > are
> > > >        derived from a single source. It also allows Polaris to rely
> on
> > > the
> > > >        NONE/ENVIRONMENT/PROVIDER/UNMANAGED mechanism, making it
> > > especially
> > > >        useful in case the Hive instance relies on Kerberos or custom
> > > >        authentication that Polaris does not natively support/manage.
> > > >        -
> > > >
> > > >        *Drawbacks:* The user needs to have access (or some mechanism
> to
> > > >        upload files) to the Polaris server's file system.
> > > >        -
> > > >
> > > >     *Option 2b: Accept all the connection-specific parameters as a
> part
> > > of
> > > >     the create-catalog request.*
> > > >     -
> > > >
> > > >        *Advantage:* Polaris can directly accept and store the
> > > configurations
> > > >        in a DPO instead of relying on the user having access to the
> > > > server's file
> > > >        system (to create/update hive-site.xml).
> > > >        -
> > > >
> > > >        *Drawback:* Polaris would need to manage the secrets. This is
> > > easy to
> > > >        support for certain authentication types (LDAP/Simple),
> However,
> > > >   it would
> > > >        preclude the support for other authentication mechanisms, such
> > > > as Kerberos
> > > >        or Custom.
> > > >
> > > > I prefer option 2a primarily because it provides the flexibility of
> > > > supporting multiple federated Hive catalogs while allowing Polaris to
> > > > support authentication that it does not natively manage. Please let
> me
> > > know
> > > > if you have any thoughts or feedback.
> > > >
> > > > Thanks,
> > > > Pooja
> > > >
> > > --
> > > Robert Stupp
> > > @snazy
> > >
> > >
> >
>

Re: [DISCUSS] Hive Catalog federation in Polaris

Reply via email to