Hi Alexander, This looks really interesting. We've been working on collation support in Delta and have shipped it in production for some time, so this is an area we care about a lot. If this proposal moves forward we'd be happy to collaborate on the design and implementation.
The pseudo-field approach for collation metrics is clean and composes well with existing Iceberg infrastructure. The specifier coverage is comprehensive. A few areas worth discussing as this evolves: 1 - Sort key stability and versioning ICU sort keys are not stable across versions, so a pinned ICU version bump in a future Iceberg release would invalidate all existing collation metrics. In multi-engine environments, requiring all engines to converge on one ICU version is unrealistic. We store original string values instead of sort keys and allow per-file version annotations -- worth discussing whether something similar could work here. 2 - Provider abstraction The proposal assumes ICU as the sole provider, but Spark ships non-ICU collations like UTF8_LCASE that are widely used. A provider or namespace layer would prevent name collisions and support engine-specific collations without future spec changes. 3 - Operational surface A few things that turned out correctness-critical in our implementation: partition transforms on collated columns (collation-equal but byte-distinct values in different directories), sort order semantics, equality deletes under collation, and Parquet filter pushdown (must be disabled since Parquet has no collation concept). These don't all need to be solved in v1 but would help to scope them. 4 - Smaller items (nit's) UTF-8 bounds for the original field id should be "must write" not "should" -- otherwise backward compat breaks for non-aware engines. Engine fallback behavior (case-sensitive vs older ICU vs fail) could use a recommended preference order to avoid divergent results across engines. The collation specifier syntax would benefit from a formal grammar. --- Happy to share our Delta design doc and implementation learnings in more detail. Looking forward to the discussion. Best, Andrei On Sat, Mar 28, 2026 at 11:49 PM Alexander Löser <[email protected]> wrote: > Hi everyone, > > this is my first interaction with the Iceberg community, so here a few > words about myself: > - I'm Alex, a Berlin-based software engineer > - I've been working at Snowflake for 4 years now > - I spend most of my time on data types, particularly binary, strings and > collations. > > I'd like to start a discussion about adding collations to the Iceberg spec. > > Conceptually, collations are an annotation on the string data type. By > default, most engines perform string operations case-sensitively. > Collations allow specifying alternative comparison rules. This is useful > for achieving, e.g., case- or accent-insensitive string operations, or > language-specific string sorting. > Collations are supported by many engines: Databricks > <https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-collation>, > Spark > <https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.collate.html>, > Snowflake <https://docs.snowflake.com/en/sql-reference/collation>, Oracle > <https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/COLLATION.html> > - to > name just a few - this list is not complete. > > In Snowflake, we see heavy use of the collation feature. Several users > have approached us, mentioning they want to migrate to Iceberg tables, but > are currently blocked by Iceberg's lack of collation support. > > Given the widespread support for collations across different engines, I > believe introducing collations to Iceberg will increase interoperability > and boost its adoption. > I'd be curious about your thoughts. > > *Goal of the proposal* > - Support collation specifications for columns > - Define how collation bounds should be stored - UTF-8 based bounds are > not useful for collated columns > > *Required Changes* > - Extend the schema to let (string) fields be annotated with a collation > > More details can be found in this doc > <https://docs.google.com/document/d/1m8b7u97uteHYjXk-4DNglJSpQO8OcZOCzW2tApCNTW4/edit?tab=t.0#heading=h.y1ant4w2163k> > . > > I'm also hoping to present the idea in the next community sync. > > Best, Alex > > >
