Re: [Discussion] Collation Support

Andrei Tserakhau via dev Sat, 28 Mar 2026 17:27:33 -0700

Hi Alexander,

This looks really interesting. We've been working on collation support in
Delta and have shipped it in production for some time, so this is an area
we care about a lot. If this proposal moves forward we'd be happy to
collaborate on the design and implementation.

The pseudo-field approach for collation metrics is clean and composes well
with existing Iceberg infrastructure. The specifier coverage is
comprehensive.

A few areas worth discussing as this evolves:

1 - Sort key stability and versioning

ICU sort keys are not stable across versions, so a pinned ICU version bump
in a future Iceberg release would invalidate all existing collation
metrics. In multi-engine environments, requiring all engines to converge on
one ICU version is unrealistic.

We store original string values instead of sort keys and allow per-file
version annotations -- worth discussing whether something similar could
work here.

2 - Provider abstraction

The proposal assumes ICU as the sole provider, but Spark ships non-ICU
collations like UTF8_LCASE that are widely used. A provider or namespace
layer would prevent name collisions and support engine-specific collations
without future spec changes.

3 - Operational surface

A few things that turned out correctness-critical in our implementation:
partition transforms on collated columns (collation-equal but byte-distinct
values in different directories), sort order semantics, equality deletes
under collation, and Parquet filter pushdown (must be disabled since
Parquet has no collation concept).

These don't all need to be solved in v1 but would help to scope them.

4 - Smaller items (nit's)

UTF-8 bounds for the original field id should be "must write" not "should"
-- otherwise backward compat breaks for non-aware engines. Engine fallback
behavior (case-sensitive vs older ICU vs fail) could use a recommended
preference order to avoid divergent results across engines. The collation
specifier syntax would benefit from a formal grammar.

---

Happy to share our Delta design doc and implementation learnings in more
detail. Looking forward to the discussion.

Best,
Andrei

On Sat, Mar 28, 2026 at 11:49 PM Alexander Löser <[email protected]>
wrote:

> Hi everyone,
>
> this is my first interaction with the Iceberg community, so here a few
> words about myself:
> - I'm Alex, a Berlin-based software engineer
> - I've been working at Snowflake for 4 years now
> - I spend most of my time on data types, particularly binary, strings and
> collations.
>
> I'd like to start a discussion about adding collations to the Iceberg spec.
>
> Conceptually, collations are an annotation on the string data type. By
> default, most engines perform string operations case-sensitively.
> Collations allow specifying alternative comparison rules. This is useful
> for achieving, e.g., case- or accent-insensitive string operations, or
> language-specific string sorting.
> Collations are supported by many engines: Databricks
> <https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-collation>,
> Spark
> <https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.collate.html>,
> Snowflake <https://docs.snowflake.com/en/sql-reference/collation>, Oracle
> <https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/COLLATION.html>
>  - to
> name just a few - this list is not complete.
>
> In Snowflake, we see heavy use of the collation feature. Several users
> have approached us, mentioning they want to migrate to Iceberg tables, but
> are currently blocked by Iceberg's lack of collation support.
>
> Given the widespread support for collations across different engines, I
> believe introducing collations to Iceberg will increase interoperability
> and boost its adoption.
> I'd be curious about your thoughts.
>
> *Goal of the proposal*
> - Support collation specifications for columns
> - Define how collation bounds should be stored - UTF-8 based bounds are
> not useful for collated columns
>
> *Required Changes*
> - Extend the schema to let (string) fields be annotated with a collation
>
> More details can be found in this doc
> <https://docs.google.com/document/d/1m8b7u97uteHYjXk-4DNglJSpQO8OcZOCzW2tApCNTW4/edit?tab=t.0#heading=h.y1ant4w2163k>
> .
>
> I'm also hoping to present the idea in the next community sync.
>
> Best, Alex
>
>
>

Re: [Discussion] Collation Support

Reply via email to