Re: [DISCUSS][Proposal] Case-Insensitive Mode for Polaris Catalogs

Dmitri Bourlatchkov Thu, 23 Oct 2025 14:57:57 -0700

Hi All,

Just an example for consideration (even using Locale.ROOT is still not
completely straight-forward).


jshell> "view".toUpperCase(new
Locale("TR")).toLowerCase(Locale.ROOT).equals("view");
$7 ==> false

jshell> "view".toUpperCase(new Locale("TR")).toLowerCase(Locale.ROOT);
$8 ==> "vi̇ew"

Note: .toUpperCase(new Locale("TR")) is there only to obtain a Turkish char
since sending it by email may be problematic.

Note: the value of $8 looks exactly like "view" in some terminals / viewers
/ fonts.

Cheers,
Dmitri.

On Thu, Oct 23, 2025 at 5:22 PM Honah J. <[email protected]> wrote:

> Hi all,
>
> Thanks for all the valuable feedback and suggestions!
>
> What would be the default ? The
> > same as today preserving "case sensitivity" ? I guess the catalog
> > property about case sensitivity is set at catalog creation time and
> > can't be changed later (immutable), right ?
>
>
> The default behavior will still be "preserving case". The proposal just
> aims to provide an option to turn it to "case-insensitive". And yes, it
> will be immutable after catalog creation.
>
> i18n, locale related concern
>
> That’s a really good point. As mentioned, the safest choice here is to use
> Locale.ROOT, though it can still cause issues for certain languages such as
> Turkish or German. IMHO, this could be a reasonable limitation, this is an
> acceptable limitation since common identifiers produced by engines
> [1][2][3] consist only of letters, digits, and underscores. This approach
> also aligns with Iceberg’s column case-insensitive lookup [4], which
> likewise relies on Locale.ROOT. I’ll make sure to include these details in
> the document.
>
> all these conversions wouldn't be better handled on the client
> > side, the query engine/Iceberg?
>
> Good question! The Iceberg SDK itself doesn’t handle table or namespace
> identifiers — that responsibility lies with the catalog. In multi-engine
> environments, engines like Spark, Trino, Flink, and DuckDB each apply their
> own rules for identifier normalization: some are case-sensitive, others are
> not. Even if every engine eventually provides a case-sensitivity switch,
> users would still need to keep those configurations consistent across all
> engines they use. This is error-prone and can easily lead to situations
> where both Employee and EMPLOYEE exist in the same catalog, causing
> ambiguity errors in case-insensitive engines. By letting the catalog, as
> the source of truth, handle normalization, we can remove that burden from
> users.
>
> I will update the design doc soon! Please let me know if you have any other
> questions/concerns/suggestions!
>
> [1] Flink/Calcite Identifier rules:
> https://calcite.apache.org/docs/reference.html#identifiers
> <https://calcite.apache.org/docs/reference.html#identifiers>
> [2] Trino Identifier rules:
> https://trino.io/docs/current/language/reserved.html#language-identifiers
> <https://trino.io/docs/current/language/reserved.html#language-identifiers
> >
> [3] Spark Regular Identifier:
>
> https://spark.apache.org/docs/latest/sql-ref-identifier.html#regular-identifier
> [4] Iceberg column case insensitive lookup:
>
> https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/Schema.java#L401
>
> Best regards,
> Jonas
>
> On Tue, Oct 21, 2025 at 4:06 AM Robert Stupp <[email protected]> wrote:
>
> > Hi all,
> >
> > IIRC there were a lot of different behaviors between upper/lower case
> > and comparison rules in different languages. NB:
> > java.lang.text.Collator gives an impression of the possible options
> > for string comparisons.
> >
> > NB: SQL databases also use collators (there are small but important
> > differences even between all Postgres and compatible databases).
> > Those collators influence how databases handle string values and indexes.
> > One "annoyance" is that some collators "fold" adjacent whitespaces
> > (e.g. "a    1" > "a 2" is true in Java's String.compareTo() but
> > "false" with some collators), some collators treat actually different
> > characters as equal (for example the German a-umlaut 'ä' equal to the
> > us-ascii 'a').
> >
> > As Alex mentioned, the clients (may) have the user's locale. I wonder
> > whether all these conversions wouldn't be better handled on the client
> > side, the query engine/Iceberg?
> >
> > On Mon, Oct 20, 2025 at 6:32 PM Alexandre Dutra <[email protected]>
> wrote:
> > >
> > > Hi Jonas,
> > >
> > > Thanks for the proposal. I left a comment in the doc but since Dmitri
> > > also brought up the issue with i18n support, let me expand on my
> > > comment here:
> > >
> > > Indeed case transformation is a complex operation in some languages,
> > > so we should always use the appropriate locale.
> > >
> > > But this information won't be available at the moment when the
> > > conversion is done, so the safest choice is to go with Locale.ROOT.
> > >
> > > And indeed, as Dmitri pointed out, such a locale is known to create
> > > problems in many languages. When going from upper to lower case, we'd
> > > see issues e.g. in Turkish (dotted i), German (SS -> ß) and Greek (Σ
> > > -> σ/ς).
> > >
> > > That said, the conversion wouldn't throw any errors – in Java,
> > > String.toUpperCase(Locale) never throws. It would just yield an
> > > awkward result.
> > >
> > > If we are OK with this limitation, then I don't see any major blockers
> > > in the proposal wrt to i18n handling.
> > >
> > > Thanks,
> > > Alex
> > >
> > > On Mon, Oct 20, 2025 at 5:00 PM Dmitri Bourlatchkov <[email protected]>
> > wrote:
> > > >
> > > > Hi Jonas,
> > > >
> > > > Thanks for the proposal! I added some comments in the docs, but I'd
> > like to
> > > > emphasize my biggest concern here as well.
> > > >
> > > > When we talk about upper/lower-casing we have to know the locale, in
> > which
> > > > that operation is to be performed.
> > > >
> > > > Using a specific locale, we have to declare a particular natural
> > language
> > > > context. Now, the question is how do we deal with identifiers that
> can
> > > > Unicode characters from different languages?
> > > >
> > > > Tip of the "iceberg" :) :
> > https://github.com/apache/iceberg/issues/9276
> > > >
> > > > Thanks,
> > > > Dmitri.
> > > >
> > > > On Fri, Oct 17, 2025 at 7:26 PM Honah J. <[email protected]> wrote:
> > > >
> > > > > Hi everyone,
> > > > >
> > > > > I would like to start a discussion around supporting an option to
> > make
> > > > > catalog case insensitive.
> > > > >
> > > > > In multi-engine data lake environments, different engines (Spark,
> > Trino,
> > > > > Flink, etc.) apply different casing and normalization rules when
> > reading or
> > > > > writing identifiers. As a result, the same logical table may be
> > interpreted
> > > > > differently across engines. For example, Polaris currently
> preserves
> > > > > identifier casing, so a table created by Spark with mixed-case
> names
> > may
> > > > > not be discoverable from Trino, which lowercases identifiers. This
> > > > > inconsistency burdens users and undermines script portability.
> > > > >
> > > > > I drafted a proposal[1] with more details and a solution:
> > introducing an
> > > > > immutable catalog property to store and look up namespaces, tables,
> > and
> > > > > other objects case‑insensitively
> > > > >
> > > > > I’d love to hear your feedback and suggestions!
> > > > >
> > > > > [1]
> > > > >
> > > > >
> >
> https://docs.google.com/document/d/1-3ywobpRvgdHPhe0J4w7l6t4NX79iqaeFOohCXG_12U/edit?usp=sharing
> > > > > <
> > > > >
> >
> https://docs.google.com/document/d/1-3ywobpRvgdHPhe0J4w7l6t4NX79iqaeFOohCXG_12U/edit?usp=sharing
> > > > > >
> > > > >
> > > > > Best regards,
> > > > > Jonas
> > > > >
> >
>

Re: [DISCUSS][Proposal] Case-Insensitive Mode for Polaris Catalogs

Reply via email to