Very nice direction, left some comments on the spec proposal.

Thanks to you folks for working on it !
Szehon

On Fri, Jun 26, 2026 at 3:29 AM Andrei Tserakhau via dev <
[email protected]> wrote:

> Hi all,
>
> I've spend some cycle on the collation discussion and make something more
> concrete to react to: a spec-change PR plus reference implementations (go
> and java).
>
> - Spec change (apache/iceberg#16972): a "collation" annotation on string
> fields, and a data_file.collation_bounds field so collated columns stay
> prunable.
> - Reference implementation in iceberg-go (apache/iceberg-go#1318): the
> full path end to end - schema annotation, collation-aware comparison
> (CLDR/UCA), collation bounds in the manifest, and version-gated data-file
> pruning, with an Avro round-trip and pruning tests.
> - A lightweight Java POC (link below): the schema annotation plus a
> Collator-backed comparator, to match where the discussion is. I
> deliberately left the manifest/bounds side out of Java for now.
>
> The design follows the original proposal but takes a few different turns,
> mostly to adopt what we learned in Delta. The ones I'd most like input on:
>
> 1 - Bounds store original values, not sort keys, tagged with a per-file
> collation version. ICU/CLDR sort keys aren't stable across versions, so
> storing keys ties every reader to one exact version; original values plus a
> per-file version (readers prune only on an exact match) degrade gracefully
> instead of breaking. The schema keeps the collation name unversioned so
> anyone can read.
>
> 2 - A provider-qualified identifier (icu.en_US-ci), leaving room for
> non-ICU collations like Spark's UTF8_LCASE, rather than assuming ICU as the
> sole provider.
>
> 3 - One structural question I don't have a strong opinion on yet: I put
> collation_bounds on data_file as a standalone v3 field, but field id 146 is
> already the v4 content_stats struct, and collation bounds might belong
> inside that typed-stats framework instead. Worth settling before we fix
> field ids.
>
> The full set of differences and the reader/writer rules are in the PR
> description and the write-up. Comments very welcome — both on the calls
> above and on whether the standalone-field vs content_stats direction is the
> right one.
>
> Best, Andrei
>
> - original proposal:
> https://docs.google.com/document/d/1m8b7u97uteHYjXk-4DNglJSpQO8OcZOCzW2tApCNTW4/edit?tab=t.0
> - spec change: https://github.com/apache/iceberg/pull/16972
> - POC in go: https://github.com/apache/iceberg-go/pull/1318
> - java POC:
> https://github.com/laskoviymishka/iceberg/tree/prototype/collation-support
>
> On Mon, Mar 30, 2026 at 10:54 PM Alexander Löser <[email protected]>
> wrote:
>
>> Hi Andrei,
>>
>> I'm glad you're interested. Looking forward to collaborate with you!
>> Thanks for all the feedback here and in the doc. I only had a quick
>> glance, but I think you raised some good points.  I'll address/respond to
>> your comments as soon as  I get the chance, hopefully tomorrow.
>> I think you also left some comments in this mail that are not yet in the
>> doc - I'll move those to a dedicated section at the end of the doc, so we
>> can use the doc as a single source of truth/discussion.
>>
>> > Happy to share our Delta design doc and implementation learnings in
>> more detail.
>>
>> Sure, sounds good :)
>>
>> Best,
>> Alex
>> On 3/29/26 01:25, Andrei Tserakhau via dev wrote:
>>
>> Hi Alexander,
>>
>> This looks really interesting. We've been working on collation support in
>> Delta and have shipped it in production for some time, so this is an area
>> we care about a lot. If this proposal moves forward we'd be happy to
>> collaborate on the design and implementation.
>>
>> The pseudo-field approach for collation metrics is clean and composes
>> well with existing Iceberg infrastructure. The specifier coverage is
>> comprehensive.
>>
>> A few areas worth discussing as this evolves:
>>
>> 1 - Sort key stability and versioning
>>
>> ICU sort keys are not stable across versions, so a pinned ICU version
>> bump in a future Iceberg release would invalidate all existing collation
>> metrics. In multi-engine environments, requiring all engines to converge on
>> one ICU version is unrealistic.
>>
>> We store original string values instead of sort keys and allow per-file
>> version annotations -- worth discussing whether something similar could
>> work here.
>>
>> 2 - Provider abstraction
>>
>> The proposal assumes ICU as the sole provider, but Spark ships non-ICU
>> collations like UTF8_LCASE that are widely used. A provider or namespace
>> layer would prevent name collisions and support engine-specific collations
>> without future spec changes.
>>
>> 3 - Operational surface
>>
>> A few things that turned out correctness-critical in our implementation:
>> partition transforms on collated columns (collation-equal but byte-distinct
>> values in different directories), sort order semantics, equality deletes
>> under collation, and Parquet filter pushdown (must be disabled since
>> Parquet has no collation concept).
>>
>> These don't all need to be solved in v1 but would help to scope them.
>>
>> 4 - Smaller items (nit's)
>>
>> UTF-8 bounds for the original field id should be "must write" not
>> "should" -- otherwise backward compat breaks for non-aware engines. Engine
>> fallback behavior (case-sensitive vs older ICU vs fail) could use a
>> recommended preference order to avoid divergent results across engines. The
>> collation specifier syntax would benefit from a formal grammar.
>>
>> ---
>>
>> Happy to share our Delta design doc and implementation learnings in more
>> detail. Looking forward to the discussion.
>>
>> Best,
>> Andrei
>>
>> On Sat, Mar 28, 2026 at 11:49 PM Alexander Löser <[email protected]>
>> wrote:
>>
>>> Hi everyone,
>>>
>>> this is my first interaction with the Iceberg community, so here a few
>>> words about myself:
>>> - I'm Alex, a Berlin-based software engineer
>>> - I've been working at Snowflake for 4 years now
>>> - I spend most of my time on data types, particularly binary, strings
>>> and collations.
>>>
>>> I'd like to start a discussion about adding collations to the Iceberg
>>> spec.
>>>
>>> Conceptually, collations are an annotation on the string data type. By
>>> default, most engines perform string operations case-sensitively.
>>> Collations allow specifying alternative comparison rules. This is useful
>>> for achieving, e.g., case- or accent-insensitive string operations, or
>>> language-specific string sorting.
>>> Collations are supported by many engines: Databricks
>>> <https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-collation>,
>>> Spark
>>> <https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.collate.html>,
>>> Snowflake <https://docs.snowflake.com/en/sql-reference/collation>,
>>> Oracle
>>> <https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/COLLATION.html>
>>>  - to
>>> name just a few - this list is not complete.
>>>
>>> In Snowflake, we see heavy use of the collation feature. Several users
>>> have approached us, mentioning they want to migrate to Iceberg tables, but
>>> are currently blocked by Iceberg's lack of collation support.
>>>
>>> Given the widespread support for collations across different engines, I
>>> believe introducing collations to Iceberg will increase interoperability
>>> and boost its adoption.
>>> I'd be curious about your thoughts.
>>>
>>> *Goal of the proposal*
>>> - Support collation specifications for columns
>>> - Define how collation bounds should be stored - UTF-8 based bounds are
>>> not useful for collated columns
>>>
>>> *Required Changes*
>>> - Extend the schema to let (string) fields be annotated with a collation
>>>
>>> More details can be found in this doc
>>> <https://docs.google.com/document/d/1m8b7u97uteHYjXk-4DNglJSpQO8OcZOCzW2tApCNTW4/edit?tab=t.0#heading=h.y1ant4w2163k>
>>> .
>>>
>>> I'm also hoping to present the idea in the next community sync.
>>>
>>> Best, Alex
>>>
>>>
>>>

Reply via email to