Re: [Discussion] Collation Support

2026-03-30 Thread Alexander Löser

Hi Andrei,

I'm glad you're interested. Looking forward to collaborate with you!
Thanks for all the feedback here and in the doc. I only had a quick 
glance, but I think you raised some good points.  I'll address/respond 
to your comments as soon as  I get the chance, hopefully tomorrow.
I think you also left some comments in this mail that are not yet in the 
doc - I'll move those to a dedicated section at the end of the doc, so 
we can use the doc as a single source of truth/discussion.


> Happy to share our Delta design doc and implementation learnings in 
more detail.


Sure, sounds good :)

Best,
Alex

On 3/29/26 01:25, Andrei Tserakhau via dev wrote:

Hi Alexander,

This looks really interesting. We've been working on collation support 
in Delta and have shipped it in production for some time, so this is 
an area we care about a lot. If this proposal moves forward we'd be 
happy to collaborate on the design and implementation.


The pseudo-field approach for collation metrics is clean and composes 
well with existing Iceberg infrastructure. The specifier coverage is 
comprehensive.


A few areas worth discussing as this evolves:

1 - Sort key stability and versioning

ICU sort keys are not stable across versions, so a pinned ICU version 
bump in a future Iceberg release would invalidate all existing 
collation metrics. In multi-engine environments, requiring all engines 
to converge on one ICU version is unrealistic.


We store original string values instead of sort keys and allow 
per-file version annotations -- worth discussing whether something 
similar could work here.


2 - Provider abstraction

The proposal assumes ICU as the sole provider, but Spark ships non-ICU 
collations like UTF8_LCASE that are widely used. A provider or 
namespace layer would prevent name collisions and support 
engine-specific collations without future spec changes.


3 - Operational surface

A few things that turned out correctness-critical in our 
implementation: partition transforms on collated columns 
(collation-equal but byte-distinct values in different directories), 
sort order semantics, equality deletes under collation, and Parquet 
filter pushdown (must be disabled since Parquet has no collation 
concept).


These don't all need to be solved in v1 but would help to scope them.

4 - Smaller items (nit's)

UTF-8 bounds for the original field id should be "must write" not 
"should" -- otherwise backward compat breaks for non-aware engines. 
Engine fallback behavior (case-sensitive vs older ICU vs fail) could 
use a recommended preference order to avoid divergent results across 
engines. The collation specifier syntax would benefit from a formal 
grammar.


---

Happy to share our Delta design doc and implementation learnings in 
more detail. Looking forward to the discussion.


Best,
Andrei

On Sat, Mar 28, 2026 at 11:49 PM Alexander Löser 
 wrote:


Hi everyone,

this is my first interaction with the Iceberg community, so here a
few words about myself:
- I'm Alex, a Berlin-based software engineer
- I've been working at Snowflake for 4 years now
- I spend most of my time on data types, particularly binary,
strings and collations.

I'd like to start a discussion about adding collations to the
Iceberg spec.

Conceptually, collations are an annotation on the string data
type. By default, most engines perform string operations
case-sensitively.
Collations allow specifying alternative comparison rules. This is
useful for achieving, e.g., case- or accent-insensitive string
operations, or language-specific string sorting.
Collations are supported by many engines: Databricks
,
Spark

,
Snowflake ,
Oracle


 - to
name just a few - this list is not complete.

In Snowflake, we see heavy use of the collation feature. Several
users have approached us, mentioning they want to migrate to
Iceberg tables, but are currently blocked by Iceberg's lack of
collation support.

Given the widespread support for collations across different
engines, I believe introducing collations to Iceberg will increase
interoperability and boost its adoption.
I'd be curious about your thoughts.

*Goal of the proposal*
- Support collation specifications for columns
- Define how collation bounds should be stored - UTF-8 based
bounds are not useful for collated columns

*Required Changes*
- Extend the schema to let (string) fields be annotated with a
collation

More details can be found in this doc



Re: [Discussion] Collation Support

2026-03-28 Thread Andrei Tserakhau via dev
Hi Alexander,

This looks really interesting. We've been working on collation support in
Delta and have shipped it in production for some time, so this is an area
we care about a lot. If this proposal moves forward we'd be happy to
collaborate on the design and implementation.

The pseudo-field approach for collation metrics is clean and composes well
with existing Iceberg infrastructure. The specifier coverage is
comprehensive.

A few areas worth discussing as this evolves:

1 - Sort key stability and versioning

ICU sort keys are not stable across versions, so a pinned ICU version bump
in a future Iceberg release would invalidate all existing collation
metrics. In multi-engine environments, requiring all engines to converge on
one ICU version is unrealistic.

We store original string values instead of sort keys and allow per-file
version annotations -- worth discussing whether something similar could
work here.

2 - Provider abstraction

The proposal assumes ICU as the sole provider, but Spark ships non-ICU
collations like UTF8_LCASE that are widely used. A provider or namespace
layer would prevent name collisions and support engine-specific collations
without future spec changes.

3 - Operational surface

A few things that turned out correctness-critical in our implementation:
partition transforms on collated columns (collation-equal but byte-distinct
values in different directories), sort order semantics, equality deletes
under collation, and Parquet filter pushdown (must be disabled since
Parquet has no collation concept).

These don't all need to be solved in v1 but would help to scope them.

4 - Smaller items (nit's)

UTF-8 bounds for the original field id should be "must write" not "should"
-- otherwise backward compat breaks for non-aware engines. Engine fallback
behavior (case-sensitive vs older ICU vs fail) could use a recommended
preference order to avoid divergent results across engines. The collation
specifier syntax would benefit from a formal grammar.

---

Happy to share our Delta design doc and implementation learnings in more
detail. Looking forward to the discussion.

Best,
Andrei

On Sat, Mar 28, 2026 at 11:49 PM Alexander Löser 
wrote:

> Hi everyone,
>
> this is my first interaction with the Iceberg community, so here a few
> words about myself:
> - I'm Alex, a Berlin-based software engineer
> - I've been working at Snowflake for 4 years now
> - I spend most of my time on data types, particularly binary, strings and
> collations.
>
> I'd like to start a discussion about adding collations to the Iceberg spec.
>
> Conceptually, collations are an annotation on the string data type. By
> default, most engines perform string operations case-sensitively.
> Collations allow specifying alternative comparison rules. This is useful
> for achieving, e.g., case- or accent-insensitive string operations, or
> language-specific string sorting.
> Collations are supported by many engines: Databricks
> ,
> Spark
> ,
> Snowflake , Oracle
> 
>  - to
> name just a few - this list is not complete.
>
> In Snowflake, we see heavy use of the collation feature. Several users
> have approached us, mentioning they want to migrate to Iceberg tables, but
> are currently blocked by Iceberg's lack of collation support.
>
> Given the widespread support for collations across different engines, I
> believe introducing collations to Iceberg will increase interoperability
> and boost its adoption.
> I'd be curious about your thoughts.
>
> *Goal of the proposal*
> - Support collation specifications for columns
> - Define how collation bounds should be stored - UTF-8 based bounds are
> not useful for collated columns
>
> *Required Changes*
> - Extend the schema to let (string) fields be annotated with a collation
>
> More details can be found in this doc
> 
> .
>
> I'm also hoping to present the idea in the next community sync.
>
> Best, Alex
>
>
>


[Discussion] Collation Support

2026-03-28 Thread Alexander Löser

Hi everyone,

this is my first interaction with the Iceberg community, so here a few 
words about myself:

- I'm Alex, a Berlin-based software engineer
- I've been working at Snowflake for 4 years now
- I spend most of my time on data types, particularly binary, strings 
and collations.


I'd like to start a discussion about adding collations to the Iceberg spec.

Conceptually, collations are an annotation on the string data type. By 
default, most engines perform string operations case-sensitively.
Collations allow specifying alternative comparison rules. This is useful 
for achieving, e.g., case- or accent-insensitive string operations, or 
language-specific string sorting.
Collations are supported by many engines: Databricks 
, 
Spark 
, 
Snowflake , 
Oracle 
 - to 
name just a few - this list is not complete.


In Snowflake, we see heavy use of the collation feature. Several users 
have approached us, mentioning they want to migrate to Iceberg tables, 
but are currently blocked by Iceberg's lack of collation support.


Given the widespread support for collations across different engines, I 
believe introducing collations to Iceberg will increase interoperability 
and boost its adoption.

I'd be curious about your thoughts.

*Goal of the proposal*
- Support collation specifications for columns
- Define how collation bounds should be stored - UTF-8 based bounds are 
not useful for collated columns


*Required Changes*
- Extend the schema to let (string) fields be annotated with a collation

More details can be found in this doc 
.


I'm also hoping to present the idea in the next community sync.

Best, Alex