Re: [DISCUSS][Proposal] Case-Insensitive Mode for Polaris Catalogs

Honah J. Thu, 23 Oct 2025 14:20:55 -0700

Hi all,

Thanks for all the valuable feedback and suggestions!


What would be the default ? The
> same as today preserving "case sensitivity" ? I guess the catalog
> property about case sensitivity is set at catalog creation time and
> can't be changed later (immutable), right ?


The default behavior will still be "preserving case". The proposal just
aims to provide an option to turn it to "case-insensitive". And yes, it
will be immutable after catalog creation.

i18n, locale related concern

That’s a really good point. As mentioned, the safest choice here is to use
Locale.ROOT, though it can still cause issues for certain languages such as
Turkish or German. IMHO, this could be a reasonable limitation, this is an
acceptable limitation since common identifiers produced by engines
[1][2][3] consist only of letters, digits, and underscores. This approach
also aligns with Iceberg’s column case-insensitive lookup [4], which
likewise relies on Locale.ROOT. I’ll make sure to include these details in
the document.

all these conversions wouldn't be better handled on the client
> side, the query engine/Iceberg?

Good question! The Iceberg SDK itself doesn’t handle table or namespace
identifiers — that responsibility lies with the catalog. In multi-engine
environments, engines like Spark, Trino, Flink, and DuckDB each apply their
own rules for identifier normalization: some are case-sensitive, others are
not. Even if every engine eventually provides a case-sensitivity switch,
users would still need to keep those configurations consistent across all
engines they use. This is error-prone and can easily lead to situations
where both Employee and EMPLOYEE exist in the same catalog, causing
ambiguity errors in case-insensitive engines. By letting the catalog, as
the source of truth, handle normalization, we can remove that burden from
users.

I will update the design doc soon! Please let me know if you have any other
questions/concerns/suggestions!

[1] Flink/Calcite Identifier rules:
https://calcite.apache.org/docs/reference.html#identifiers
<https://calcite.apache.org/docs/reference.html#identifiers>
[2] Trino Identifier rules:
https://trino.io/docs/current/language/reserved.html#language-identifiers
<https://trino.io/docs/current/language/reserved.html#language-identifiers>
[3] Spark Regular Identifier:
https://spark.apache.org/docs/latest/sql-ref-identifier.html#regular-identifier
[4] Iceberg column case insensitive lookup:
https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/Schema.java#L401

Best regards,
Jonas

On Tue, Oct 21, 2025 at 4:06 AM Robert Stupp <[email protected]> wrote:

> Hi all,
>
> IIRC there were a lot of different behaviors between upper/lower case
> and comparison rules in different languages. NB:
> java.lang.text.Collator gives an impression of the possible options
> for string comparisons.
>
> NB: SQL databases also use collators (there are small but important
> differences even between all Postgres and compatible databases).
> Those collators influence how databases handle string values and indexes.
> One "annoyance" is that some collators "fold" adjacent whitespaces
> (e.g. "a    1" > "a 2" is true in Java's String.compareTo() but
> "false" with some collators), some collators treat actually different
> characters as equal (for example the German a-umlaut 'ä' equal to the
> us-ascii 'a').
>
> As Alex mentioned, the clients (may) have the user's locale. I wonder
> whether all these conversions wouldn't be better handled on the client
> side, the query engine/Iceberg?
>
> On Mon, Oct 20, 2025 at 6:32 PM Alexandre Dutra <[email protected]> wrote:
> >
> > Hi Jonas,
> >
> > Thanks for the proposal. I left a comment in the doc but since Dmitri
> > also brought up the issue with i18n support, let me expand on my
> > comment here:
> >
> > Indeed case transformation is a complex operation in some languages,
> > so we should always use the appropriate locale.
> >
> > But this information won't be available at the moment when the
> > conversion is done, so the safest choice is to go with Locale.ROOT.
> >
> > And indeed, as Dmitri pointed out, such a locale is known to create
> > problems in many languages. When going from upper to lower case, we'd
> > see issues e.g. in Turkish (dotted i), German (SS -> ß) and Greek (Σ
> > -> σ/ς).
> >
> > That said, the conversion wouldn't throw any errors – in Java,
> > String.toUpperCase(Locale) never throws. It would just yield an
> > awkward result.
> >
> > If we are OK with this limitation, then I don't see any major blockers
> > in the proposal wrt to i18n handling.
> >
> > Thanks,
> > Alex
> >
> > On Mon, Oct 20, 2025 at 5:00 PM Dmitri Bourlatchkov <[email protected]>
> wrote:
> > >
> > > Hi Jonas,
> > >
> > > Thanks for the proposal! I added some comments in the docs, but I'd
> like to
> > > emphasize my biggest concern here as well.
> > >
> > > When we talk about upper/lower-casing we have to know the locale, in
> which
> > > that operation is to be performed.
> > >
> > > Using a specific locale, we have to declare a particular natural
> language
> > > context. Now, the question is how do we deal with identifiers that can
> > > Unicode characters from different languages?
> > >
> > > Tip of the "iceberg" :) :
> https://github.com/apache/iceberg/issues/9276
> > >
> > > Thanks,
> > > Dmitri.
> > >
> > > On Fri, Oct 17, 2025 at 7:26 PM Honah J. <[email protected]> wrote:
> > >
> > > > Hi everyone,
> > > >
> > > > I would like to start a discussion around supporting an option to
> make
> > > > catalog case insensitive.
> > > >
> > > > In multi-engine data lake environments, different engines (Spark,
> Trino,
> > > > Flink, etc.) apply different casing and normalization rules when
> reading or
> > > > writing identifiers. As a result, the same logical table may be
> interpreted
> > > > differently across engines. For example, Polaris currently preserves
> > > > identifier casing, so a table created by Spark with mixed-case names
> may
> > > > not be discoverable from Trino, which lowercases identifiers. This
> > > > inconsistency burdens users and undermines script portability.
> > > >
> > > > I drafted a proposal[1] with more details and a solution:
> introducing an
> > > > immutable catalog property to store and look up namespaces, tables,
> and
> > > > other objects case‑insensitively
> > > >
> > > > I’d love to hear your feedback and suggestions!
> > > >
> > > > [1]
> > > >
> > > >
> https://docs.google.com/document/d/1-3ywobpRvgdHPhe0J4w7l6t4NX79iqaeFOohCXG_12U/edit?usp=sharing
> > > > <
> > > >
> https://docs.google.com/document/d/1-3ywobpRvgdHPhe0J4w7l6t4NX79iqaeFOohCXG_12U/edit?usp=sharing
> > > > >
> > > >
> > > > Best regards,
> > > > Jonas
> > > >
>

Re: [DISCUSS][Proposal] Case-Insensitive Mode for Polaris Catalogs

Reply via email to