Very nice direction, left some comments on the spec proposal. Thanks to you folks for working on it ! Szehon
On Fri, Jun 26, 2026 at 3:29 AM Andrei Tserakhau via dev < [email protected]> wrote: > Hi all, > > I've spend some cycle on the collation discussion and make something more > concrete to react to: a spec-change PR plus reference implementations (go > and java). > > - Spec change (apache/iceberg#16972): a "collation" annotation on string > fields, and a data_file.collation_bounds field so collated columns stay > prunable. > - Reference implementation in iceberg-go (apache/iceberg-go#1318): the > full path end to end - schema annotation, collation-aware comparison > (CLDR/UCA), collation bounds in the manifest, and version-gated data-file > pruning, with an Avro round-trip and pruning tests. > - A lightweight Java POC (link below): the schema annotation plus a > Collator-backed comparator, to match where the discussion is. I > deliberately left the manifest/bounds side out of Java for now. > > The design follows the original proposal but takes a few different turns, > mostly to adopt what we learned in Delta. The ones I'd most like input on: > > 1 - Bounds store original values, not sort keys, tagged with a per-file > collation version. ICU/CLDR sort keys aren't stable across versions, so > storing keys ties every reader to one exact version; original values plus a > per-file version (readers prune only on an exact match) degrade gracefully > instead of breaking. The schema keeps the collation name unversioned so > anyone can read. > > 2 - A provider-qualified identifier (icu.en_US-ci), leaving room for > non-ICU collations like Spark's UTF8_LCASE, rather than assuming ICU as the > sole provider. > > 3 - One structural question I don't have a strong opinion on yet: I put > collation_bounds on data_file as a standalone v3 field, but field id 146 is > already the v4 content_stats struct, and collation bounds might belong > inside that typed-stats framework instead. Worth settling before we fix > field ids. > > The full set of differences and the reader/writer rules are in the PR > description and the write-up. Comments very welcome — both on the calls > above and on whether the standalone-field vs content_stats direction is the > right one. > > Best, Andrei > > - original proposal: > https://docs.google.com/document/d/1m8b7u97uteHYjXk-4DNglJSpQO8OcZOCzW2tApCNTW4/edit?tab=t.0 > - spec change: https://github.com/apache/iceberg/pull/16972 > - POC in go: https://github.com/apache/iceberg-go/pull/1318 > - java POC: > https://github.com/laskoviymishka/iceberg/tree/prototype/collation-support > > On Mon, Mar 30, 2026 at 10:54 PM Alexander Löser <[email protected]> > wrote: > >> Hi Andrei, >> >> I'm glad you're interested. Looking forward to collaborate with you! >> Thanks for all the feedback here and in the doc. I only had a quick >> glance, but I think you raised some good points. I'll address/respond to >> your comments as soon as I get the chance, hopefully tomorrow. >> I think you also left some comments in this mail that are not yet in the >> doc - I'll move those to a dedicated section at the end of the doc, so we >> can use the doc as a single source of truth/discussion. >> >> > Happy to share our Delta design doc and implementation learnings in >> more detail. >> >> Sure, sounds good :) >> >> Best, >> Alex >> On 3/29/26 01:25, Andrei Tserakhau via dev wrote: >> >> Hi Alexander, >> >> This looks really interesting. We've been working on collation support in >> Delta and have shipped it in production for some time, so this is an area >> we care about a lot. If this proposal moves forward we'd be happy to >> collaborate on the design and implementation. >> >> The pseudo-field approach for collation metrics is clean and composes >> well with existing Iceberg infrastructure. The specifier coverage is >> comprehensive. >> >> A few areas worth discussing as this evolves: >> >> 1 - Sort key stability and versioning >> >> ICU sort keys are not stable across versions, so a pinned ICU version >> bump in a future Iceberg release would invalidate all existing collation >> metrics. In multi-engine environments, requiring all engines to converge on >> one ICU version is unrealistic. >> >> We store original string values instead of sort keys and allow per-file >> version annotations -- worth discussing whether something similar could >> work here. >> >> 2 - Provider abstraction >> >> The proposal assumes ICU as the sole provider, but Spark ships non-ICU >> collations like UTF8_LCASE that are widely used. A provider or namespace >> layer would prevent name collisions and support engine-specific collations >> without future spec changes. >> >> 3 - Operational surface >> >> A few things that turned out correctness-critical in our implementation: >> partition transforms on collated columns (collation-equal but byte-distinct >> values in different directories), sort order semantics, equality deletes >> under collation, and Parquet filter pushdown (must be disabled since >> Parquet has no collation concept). >> >> These don't all need to be solved in v1 but would help to scope them. >> >> 4 - Smaller items (nit's) >> >> UTF-8 bounds for the original field id should be "must write" not >> "should" -- otherwise backward compat breaks for non-aware engines. Engine >> fallback behavior (case-sensitive vs older ICU vs fail) could use a >> recommended preference order to avoid divergent results across engines. The >> collation specifier syntax would benefit from a formal grammar. >> >> --- >> >> Happy to share our Delta design doc and implementation learnings in more >> detail. Looking forward to the discussion. >> >> Best, >> Andrei >> >> On Sat, Mar 28, 2026 at 11:49 PM Alexander Löser <[email protected]> >> wrote: >> >>> Hi everyone, >>> >>> this is my first interaction with the Iceberg community, so here a few >>> words about myself: >>> - I'm Alex, a Berlin-based software engineer >>> - I've been working at Snowflake for 4 years now >>> - I spend most of my time on data types, particularly binary, strings >>> and collations. >>> >>> I'd like to start a discussion about adding collations to the Iceberg >>> spec. >>> >>> Conceptually, collations are an annotation on the string data type. By >>> default, most engines perform string operations case-sensitively. >>> Collations allow specifying alternative comparison rules. This is useful >>> for achieving, e.g., case- or accent-insensitive string operations, or >>> language-specific string sorting. >>> Collations are supported by many engines: Databricks >>> <https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-collation>, >>> Spark >>> <https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.collate.html>, >>> Snowflake <https://docs.snowflake.com/en/sql-reference/collation>, >>> Oracle >>> <https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/COLLATION.html> >>> - to >>> name just a few - this list is not complete. >>> >>> In Snowflake, we see heavy use of the collation feature. Several users >>> have approached us, mentioning they want to migrate to Iceberg tables, but >>> are currently blocked by Iceberg's lack of collation support. >>> >>> Given the widespread support for collations across different engines, I >>> believe introducing collations to Iceberg will increase interoperability >>> and boost its adoption. >>> I'd be curious about your thoughts. >>> >>> *Goal of the proposal* >>> - Support collation specifications for columns >>> - Define how collation bounds should be stored - UTF-8 based bounds are >>> not useful for collated columns >>> >>> *Required Changes* >>> - Extend the schema to let (string) fields be annotated with a collation >>> >>> More details can be found in this doc >>> <https://docs.google.com/document/d/1m8b7u97uteHYjXk-4DNglJSpQO8OcZOCzW2tApCNTW4/edit?tab=t.0#heading=h.y1ant4w2163k> >>> . >>> >>> I'm also hoping to present the idea in the next community sync. >>> >>> Best, Alex >>> >>> >>>
