Re: [DISCUSS][Proposal] Case-Insensitive Mode for Polaris Catalogs

Honah J. Thu, 06 Nov 2025 13:36:06 -0800

Sorry for the late update.

Thanks everyone for your valuable input during the Oct 30 community sync!
We discussed internationalization (i18n) and agreed it’s an acceptable
limitation for now, as it supports most use cases with ASCII characters.
That said, we’ll need clear documentation and tests to catch any
incompatibilities. I’ll update the proposal with more details and continue
working on the PoC.


We also talked about possible long-term directions, such as introducing a
new config or quoted identifier mechanism in the IRC spec—let’s track that
separately.

Best regards,
Jonas

On Thu, Oct 23, 2025 at 4:57 PM Dmitri Bourlatchkov <[email protected]>
wrote:

> Hi All,
>
> Just an example for consideration (even using Locale.ROOT is still not
> completely straight-forward).
>
> jshell> "view".toUpperCase(new
> Locale("TR")).toLowerCase(Locale.ROOT).equals("view");
> $7 ==> false
>
> jshell> "view".toUpperCase(new Locale("TR")).toLowerCase(Locale.ROOT);
> $8 ==> "vi̇ew"
>
> Note: .toUpperCase(new Locale("TR")) is there only to obtain a Turkish char
> since sending it by email may be problematic.
>
> Note: the value of $8 looks exactly like "view" in some terminals / viewers
> / fonts.
>
> Cheers,
> Dmitri.
>
> On Thu, Oct 23, 2025 at 5:22 PM Honah J. <[email protected]> wrote:
>
> > Hi all,
> >
> > Thanks for all the valuable feedback and suggestions!
> >
> > What would be the default ? The
> > > same as today preserving "case sensitivity" ? I guess the catalog
> > > property about case sensitivity is set at catalog creation time and
> > > can't be changed later (immutable), right ?
> >
> >
> > The default behavior will still be "preserving case". The proposal just
> > aims to provide an option to turn it to "case-insensitive". And yes, it
> > will be immutable after catalog creation.
> >
> > i18n, locale related concern
> >
> > That’s a really good point. As mentioned, the safest choice here is to
> use
> > Locale.ROOT, though it can still cause issues for certain languages such
> as
> > Turkish or German. IMHO, this could be a reasonable limitation, this is
> an
> > acceptable limitation since common identifiers produced by engines
> > [1][2][3] consist only of letters, digits, and underscores. This approach
> > also aligns with Iceberg’s column case-insensitive lookup [4], which
> > likewise relies on Locale.ROOT. I’ll make sure to include these details
> in
> > the document.
> >
> > all these conversions wouldn't be better handled on the client
> > > side, the query engine/Iceberg?
> >
> > Good question! The Iceberg SDK itself doesn’t handle table or namespace
> > identifiers — that responsibility lies with the catalog. In multi-engine
> > environments, engines like Spark, Trino, Flink, and DuckDB each apply
> their
> > own rules for identifier normalization: some are case-sensitive, others
> are
> > not. Even if every engine eventually provides a case-sensitivity switch,
> > users would still need to keep those configurations consistent across all
> > engines they use. This is error-prone and can easily lead to situations
> > where both Employee and EMPLOYEE exist in the same catalog, causing
> > ambiguity errors in case-insensitive engines. By letting the catalog, as
> > the source of truth, handle normalization, we can remove that burden from
> > users.
> >
> > I will update the design doc soon! Please let me know if you have any
> other
> > questions/concerns/suggestions!
> >
> > [1] Flink/Calcite Identifier rules:
> > https://calcite.apache.org/docs/reference.html#identifiers
> > <https://calcite.apache.org/docs/reference.html#identifiers>
> > [2] Trino Identifier rules:
> >
> https://trino.io/docs/current/language/reserved.html#language-identifiers
> > <
> https://trino.io/docs/current/language/reserved.html#language-identifiers
> > >
> > [3] Spark Regular Identifier:
> >
> >
> https://spark.apache.org/docs/latest/sql-ref-identifier.html#regular-identifier
> > [4] Iceberg column case insensitive lookup:
> >
> >
> https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/Schema.java#L401
> >
> > Best regards,
> > Jonas
> >
> > On Tue, Oct 21, 2025 at 4:06 AM Robert Stupp <[email protected]> wrote:
> >
> > > Hi all,
> > >
> > > IIRC there were a lot of different behaviors between upper/lower case
> > > and comparison rules in different languages. NB:
> > > java.lang.text.Collator gives an impression of the possible options
> > > for string comparisons.
> > >
> > > NB: SQL databases also use collators (there are small but important
> > > differences even between all Postgres and compatible databases).
> > > Those collators influence how databases handle string values and
> indexes.
> > > One "annoyance" is that some collators "fold" adjacent whitespaces
> > > (e.g. "a    1" > "a 2" is true in Java's String.compareTo() but
> > > "false" with some collators), some collators treat actually different
> > > characters as equal (for example the German a-umlaut 'ä' equal to the
> > > us-ascii 'a').
> > >
> > > As Alex mentioned, the clients (may) have the user's locale. I wonder
> > > whether all these conversions wouldn't be better handled on the client
> > > side, the query engine/Iceberg?
> > >
> > > On Mon, Oct 20, 2025 at 6:32 PM Alexandre Dutra <[email protected]>
> > wrote:
> > > >
> > > > Hi Jonas,
> > > >
> > > > Thanks for the proposal. I left a comment in the doc but since Dmitri
> > > > also brought up the issue with i18n support, let me expand on my
> > > > comment here:
> > > >
> > > > Indeed case transformation is a complex operation in some languages,
> > > > so we should always use the appropriate locale.
> > > >
> > > > But this information won't be available at the moment when the
> > > > conversion is done, so the safest choice is to go with Locale.ROOT.
> > > >
> > > > And indeed, as Dmitri pointed out, such a locale is known to create
> > > > problems in many languages. When going from upper to lower case, we'd
> > > > see issues e.g. in Turkish (dotted i), German (SS -> ß) and Greek (Σ
> > > > -> σ/ς).
> > > >
> > > > That said, the conversion wouldn't throw any errors – in Java,
> > > > String.toUpperCase(Locale) never throws. It would just yield an
> > > > awkward result.
> > > >
> > > > If we are OK with this limitation, then I don't see any major
> blockers
> > > > in the proposal wrt to i18n handling.
> > > >
> > > > Thanks,
> > > > Alex
> > > >
> > > > On Mon, Oct 20, 2025 at 5:00 PM Dmitri Bourlatchkov <
> [email protected]>
> > > wrote:
> > > > >
> > > > > Hi Jonas,
> > > > >
> > > > > Thanks for the proposal! I added some comments in the docs, but I'd
> > > like to
> > > > > emphasize my biggest concern here as well.
> > > > >
> > > > > When we talk about upper/lower-casing we have to know the locale,
> in
> > > which
> > > > > that operation is to be performed.
> > > > >
> > > > > Using a specific locale, we have to declare a particular natural
> > > language
> > > > > context. Now, the question is how do we deal with identifiers that
> > can
> > > > > Unicode characters from different languages?
> > > > >
> > > > > Tip of the "iceberg" :) :
> > > https://github.com/apache/iceberg/issues/9276
> > > > >
> > > > > Thanks,
> > > > > Dmitri.
> > > > >
> > > > > On Fri, Oct 17, 2025 at 7:26 PM Honah J. <[email protected]>
> wrote:
> > > > >
> > > > > > Hi everyone,
> > > > > >
> > > > > > I would like to start a discussion around supporting an option to
> > > make
> > > > > > catalog case insensitive.
> > > > > >
> > > > > > In multi-engine data lake environments, different engines (Spark,
> > > Trino,
> > > > > > Flink, etc.) apply different casing and normalization rules when
> > > reading or
> > > > > > writing identifiers. As a result, the same logical table may be
> > > interpreted
> > > > > > differently across engines. For example, Polaris currently
> > preserves
> > > > > > identifier casing, so a table created by Spark with mixed-case
> > names
> > > may
> > > > > > not be discoverable from Trino, which lowercases identifiers.
> This
> > > > > > inconsistency burdens users and undermines script portability.
> > > > > >
> > > > > > I drafted a proposal[1] with more details and a solution:
> > > introducing an
> > > > > > immutable catalog property to store and look up namespaces,
> tables,
> > > and
> > > > > > other objects case‑insensitively
> > > > > >
> > > > > > I’d love to hear your feedback and suggestions!
> > > > > >
> > > > > > [1]
> > > > > >
> > > > > >
> > >
> >
> https://docs.google.com/document/d/1-3ywobpRvgdHPhe0J4w7l6t4NX79iqaeFOohCXG_12U/edit?usp=sharing
> > > > > > <
> > > > > >
> > >
> >
> https://docs.google.com/document/d/1-3ywobpRvgdHPhe0J4w7l6t4NX79iqaeFOohCXG_12U/edit?usp=sharing
> > > > > > >
> > > > > >
> > > > > > Best regards,
> > > > > > Jonas
> > > > > >
> > >
> >
>

Re: [DISCUSS][Proposal] Case-Insensitive Mode for Polaris Catalogs

Reply via email to