Hi all,

IIRC there were a lot of different behaviors between upper/lower case
and comparison rules in different languages. NB:
java.lang.text.Collator gives an impression of the possible options
for string comparisons.

NB: SQL databases also use collators (there are small but important
differences even between all Postgres and compatible databases).
Those collators influence how databases handle string values and indexes.
One "annoyance" is that some collators "fold" adjacent whitespaces
(e.g. "a    1" > "a 2" is true in Java's String.compareTo() but
"false" with some collators), some collators treat actually different
characters as equal (for example the German a-umlaut 'ä' equal to the
us-ascii 'a').

As Alex mentioned, the clients (may) have the user's locale. I wonder
whether all these conversions wouldn't be better handled on the client
side, the query engine/Iceberg?

On Mon, Oct 20, 2025 at 6:32 PM Alexandre Dutra <[email protected]> wrote:
>
> Hi Jonas,
>
> Thanks for the proposal. I left a comment in the doc but since Dmitri
> also brought up the issue with i18n support, let me expand on my
> comment here:
>
> Indeed case transformation is a complex operation in some languages,
> so we should always use the appropriate locale.
>
> But this information won't be available at the moment when the
> conversion is done, so the safest choice is to go with Locale.ROOT.
>
> And indeed, as Dmitri pointed out, such a locale is known to create
> problems in many languages. When going from upper to lower case, we'd
> see issues e.g. in Turkish (dotted i), German (SS -> ß) and Greek (Σ
> -> σ/ς).
>
> That said, the conversion wouldn't throw any errors – in Java,
> String.toUpperCase(Locale) never throws. It would just yield an
> awkward result.
>
> If we are OK with this limitation, then I don't see any major blockers
> in the proposal wrt to i18n handling.
>
> Thanks,
> Alex
>
> On Mon, Oct 20, 2025 at 5:00 PM Dmitri Bourlatchkov <[email protected]> wrote:
> >
> > Hi Jonas,
> >
> > Thanks for the proposal! I added some comments in the docs, but I'd like to
> > emphasize my biggest concern here as well.
> >
> > When we talk about upper/lower-casing we have to know the locale, in which
> > that operation is to be performed.
> >
> > Using a specific locale, we have to declare a particular natural language
> > context. Now, the question is how do we deal with identifiers that can
> > Unicode characters from different languages?
> >
> > Tip of the "iceberg" :) : https://github.com/apache/iceberg/issues/9276
> >
> > Thanks,
> > Dmitri.
> >
> > On Fri, Oct 17, 2025 at 7:26 PM Honah J. <[email protected]> wrote:
> >
> > > Hi everyone,
> > >
> > > I would like to start a discussion around supporting an option to make
> > > catalog case insensitive.
> > >
> > > In multi-engine data lake environments, different engines (Spark, Trino,
> > > Flink, etc.) apply different casing and normalization rules when reading 
> > > or
> > > writing identifiers. As a result, the same logical table may be 
> > > interpreted
> > > differently across engines. For example, Polaris currently preserves
> > > identifier casing, so a table created by Spark with mixed-case names may
> > > not be discoverable from Trino, which lowercases identifiers. This
> > > inconsistency burdens users and undermines script portability.
> > >
> > > I drafted a proposal[1] with more details and a solution: introducing an
> > > immutable catalog property to store and look up namespaces, tables, and
> > > other objects case‑insensitively
> > >
> > > I’d love to hear your feedback and suggestions!
> > >
> > > [1]
> > >
> > > https://docs.google.com/document/d/1-3ywobpRvgdHPhe0J4w7l6t4NX79iqaeFOohCXG_12U/edit?usp=sharing
> > > <
> > > https://docs.google.com/document/d/1-3ywobpRvgdHPhe0J4w7l6t4NX79iqaeFOohCXG_12U/edit?usp=sharing
> > > >
> > >
> > > Best regards,
> > > Jonas
> > >

Reply via email to