Re: Column-Level Key-Value Properties (Tags) in Iceberg

Anand Kumar Sankaran via dev Thu, 19 Feb 2026 12:53:45 -0800

Hi Dan,

It’s the latter. Nothing concrete so far.


Appreciate the response.

Get Outlook for iOS<https://aka.ms/o0ukef>
________________________________
From: Daniel Weeks <[email protected]>
Sent: Thursday, February 19, 2026 12:18:50 PM
To: [email protected] <[email protected]>
Cc: Jean-Baptiste Onofré <[email protected]>; Anand Kumar Sankaran 
<[email protected]>
Subject: Re: Re: Column-Level Key-Value Properties (Tags) in Iceberg

Hey Anand, This is an interesting topic that I think the community is open to 
discussing as security is an increasingly important part of data access and 
interoperability. If you already have a proposal, I'd be happy to work with you 
on

Hey Anand,

This is an interesting topic that I think the community is open to discussing 
as security is an increasingly important part of data access and 
interoperability.  If you already have a proposal, I'd be happy to work with 
you on this.  If not, I know others are looking into similar functionality and 
we could collaborate to bring something to the community that's a little more 
concrete.

-Dan

On Thu, Feb 19, 2026 at 10:52 AM Anand Kumar Sankaran via dev 
<[email protected]<mailto:[email protected]>> wrote:
JB, Dan, Walaa, Jack and others,

I am working on rolling out an Iceberg catalog (using Apache Polaris). We share 
a lot of PII data and eventually sensitive financial data.  Having column-level 
key-value properties will really help with the governance aspect. Ideally, we 
want to be fully compliant with the fine grained access control proposals of 
Iceberg and want all our partners and customers to uptake it.  The column-level 
key-value properties will help us as well.

I am interested in bringing about a proposal towards this. Any guidance would 
be greatly appreciated.

Thank you.

—
Anand
Workday Data Cloud

On 1/9/24, 6:59 AM, "Jean-Baptiste Onofré" 
<[email protected]<mailto:[email protected]>> wrote:

It makes sense. Agree.

Regards
JB

On Mon, Jan 8, 2024 at 11:48 PM Daniel Weeks 
<[email protected]<mailto:[email protected]>> wrote:
>
> JB,
>
> I would draw a distinction between catalog and this proposed feature in that 
> the catalog is actually not part of the spec, so it is entirely up to the 
> engine and is optional.
>
> When it comes to the table spec, "optional" does not mean that it does not 
> have to be implemented/supported.  Any engine/library that produces metadata 
> would need to support column-level properties so that it does not drop or 
> improperly handle the metadata elements, even if it does not expose a way to 
> view/manipulate them.  This is why scrutiny of spec changes is critical.
>
> +1 to what you said about documentation and support.
>
> -Dan
>
>
>
> On Mon, Jan 8, 2024 at 1:38 AM Jean-Baptiste Onofré 
> <[email protected]<mailto:[email protected]>> wrote:
>>
>> Hi Dan,
>>
>> I agree: it will depend on the engine capabilities. That said, it's
>> similar to catalog: each catalog might have different
>> approaches/features/capabilities, so engines might have different
>> capabilities as well.
>> If it's an optional feature in the spec, and each engine might or
>> might not implement it, that's ok. But it's certainly not a
>> requirement.
>> That said, we would need to clearly document the capabilities of each
>> engine (and catalog) (I don't say this documentation should be in
>> Iceberg, but engine "providers" would need to clearly state the
>> supported features).
>>
>> Regards
>> JB
>>
>> On Mon, Jan 8, 2024 at 6:33 AM Daniel Weeks 
>> <[email protected]<mailto:[email protected]>> wrote:
>> >
>> > The main risk I see is that this adds complexity and there may be limited 
>> > use of the feature, which makes me question the value.  Spark seems like 
>> > the most likely/obvious to add native support for column-level properties, 
>> > but there are a wide range of engines that may never really adopt this 
>> > (e.g. Trino, Dremio, Doris, Starrocks, Redshift) as there isn't SQL 
>> > specification for table/column properties to my knowledge.
>> >
>> > I do think it would be nice for engines that have similar concepts if it 
>> > really can be natively integrated and I'm sure there are other use cases 
>> > for column properties, but it still feels somewhat niche.
>> >
>> > That being said, I'm not opposed and if there's interest in getting a 
>> > proposal put together for the spec changes, we'll get a much better idea 
>> > of any challenges.
>> >
>> > Thanks,
>> > -Dan
>> >
>> > On Thu, Jan 4, 2024 at 11:55 AM Walaa Eldin Moustafa 
>> > <[email protected]<mailto:[email protected]>> wrote:
>> >>
>> >> Agree that it should not be use case specific. There could be other 
>> >> applications beyond governance. As John mentioned, ML is another domain, 
>> >> and it is actually the case at LinkedIn as well.
>> >>
>> >> I would approach this with the understanding that the key requirement is 
>> >> to add key/value properties at the column level, not necessarily solving 
>> >> for compliance. Compliance is just one of the applications and can 
>> >> leverage this feature in many ways. But one of the key requirements in 
>> >> compliance, ML, and other applications is enriching column-level 
>> >> metadata. Other systems (Avro, BigQuery, Snowflake) do that too as 
>> >> pointed out in the original message. Since Iceberg is the source of truth 
>> >> for schema/column/field data, it sounds reasonable that the column-level 
>> >> metadata should co-exist in the same place, hence the Iceberg-level 
>> >> proposal. Other external solutions are possible of course (for column 
>> >> level metadata, not necessarily "compliance"), but with the compromise of 
>> >> possible schema drift and inconsistency. For example, at LinkedIn, we use 
>> >> Datahub for compliance annotations/tags (this is an example of an 
>> >> external system, even outside the catalog) and use Avro schema literals 
>> >> for ML column-level metadata (this is an example of table level 
>> >> property). In both situations, it would have been better if the tags 
>> >> co-existed with the column definitions. So the tradeoff is really 
>> >> between: (1) Enhancing Iceberg spec to minimize inconsistency in this 
>> >> domain, or (2) Letting Iceberg users come up with custom, disparate, and 
>> >> potentially inconsistent solutions. What do you all think?
>> >>
>> >> Thanks,
>> >> Walaa.
>> >>
>> >> On Thu, Jan 4, 2024 at 11:14 AM Daniel Weeks 
>> >> <[email protected]<mailto:[email protected]>> wrote:
>> >>>
>> >>> I not opposed to the idea of adding column-level properties with a few 
>> >>> considerations:
>> >>>
>> >>> We shouldn't explicitly tie it to a particular use case like data 
>> >>> governance.  You may be able leverage this for those capabilities, but 
>> >>> adding anything use case specific gets into some really opinionated 
>> >>> areas and makes the feature less generalizable.
>> >>> We need to be really explicit about the behaviors around evolution, tags 
>> >>> and branches as it could have implications about features built around 
>> >>> this behave.
>> >>> Iceberg would need to be the source of truth for this information to 
>> >>> keep external tags from misrepresenting the underlying schema definition.
>> >>>
>> >>> I would agree with Jack that there may be other ways to approach policy 
>> >>> information so we should explore those and see if those would render 
>> >>> this functionality less useful overall (I'm sure there are ways we can 
>> >>> use column-level properties, but if the main driver is policy, this may 
>> >>> not be worth the investment at the moment).
>> >>>
>> >>> -Dan
>> >>>
>> >>> On Wed, Jan 3, 2024 at 5:40 PM Renjie Liu 
>> >>> <[email protected]<mailto:[email protected]>> wrote:
>> >>>>
>> >>>> This proposal sounds good to me.
>> >>>>
>> >>>>> If we talk specifically about governance features, I am not sure if 
>> >>>>> column property is the best way though. Consider the case of having a 
>> >>>>> column which was not PII, but becomes PII because certain law has 
>> >>>>> passed. The operation a user would perform in this case is something 
>> >>>>> like "ALTER TABLE MODIFY COLUMN col SET PROPERTIES ('pii'='true')". 
>> >>>>> However, Iceberg schema is versioned, that means if you time travel to 
>> >>>>> some time before the MODIFY COLUMN operation, the PII column becomes 
>> >>>>> still accessible.
>> >>>>
>> >>>>
>> >>>> This sounds like reasonable behavior to me. This is just like we do an 
>> >>>> ddl like "ALTER TABLE ADD COLUMNS ADD COLUMNS (new_column string)", and 
>> >>>> if we to time travel to older version, we should also not see the 
>> >>>> new_column.
>> >>>>
>> >>>> On Thu, Jan 4, 2024 at 6:26 AM John Zhuge 
>> >>>> <[email protected]<mailto:[email protected]>> wrote:
>> >>>>>
>> >>>>> Hi Walaa,
>> >>>>>
>> >>>>> Netflix internal Spark and Iceberg have supported column metadata in 
>> >>>>> Iceberg tables since Spark 2.4. The Spark data type is 
>> >>>>> `org.apache.spark.sql.types.Metadata` in StructType. The feature is 
>> >>>>> used by ML teams.
>> >>>>>
>> >>>>> It'd be great for the feature to be adopted.
>> >>>>>
>> >>>>>
>> >>>>> On Wed, Jan 3, 2024 at 1:18 PM Walaa Eldin Moustafa 
>> >>>>> <[email protected]<mailto:[email protected]>> wrote:
>> >>>>>>
>> >>>>>> Thanks Jack!
>> >>>>>>
>> >>>>>> I think generic key value pairs are still valuable, even for data 
>> >>>>>> governance.
>> >>>>>>
>> >>>>>> Regarding schema versions and PII evolution over time, I actually 
>> >>>>>> think it is a good feature to keep PII and schema in sync across 
>> >>>>>> versions for data reproducibility. Consistency is key in time travel 
>> >>>>>> scenarios - the objective should be to replicate data states 
>> >>>>>> accurately, regardless of subsequent changes in column tags. On the 
>> >>>>>> other hand, organizations typically make special arrangements when it 
>> >>>>>> comes to addressing compliance in the context of time travel. For 
>> >>>>>> example in the data deletion use case, special accomodation should 
>> >>>>>> take place to address the fact that time travel can facilitate 
>> >>>>>> restoring the data. Finally, I am not very concerned about the case 
>> >>>>>> when a field evolves to PII=true while it is still set to PII=false 
>> >>>>>> in the time travel window. Typically, the time travel window is in 
>> >>>>>> the order of days but regulation enforcement window is in the order 
>> >>>>>> of months. Most often, the data versions with PII=false would have 
>> >>>>>> cycled out of the system before the regulatory enforcement is in 
>> >>>>>> effect.
>> >>>>>>
>> >>>>>> I also think that the catalog level example in AWS Glue still needs 
>> >>>>>> to consistently ensure schema compatibility? How does it ensure that 
>> >>>>>> the columns referenced in the policies are in sync with the Iceberg 
>> >>>>>> table schema, especially when the Iceberg table schema is evolved 
>> >>>>>> when the policies and referenced columns are not?
>> >>>>>>
>> >>>>>> Regarding bringing policy and compliance semantics aspects into 
>> >>>>>> Iceberg as a top level construct, I agree this is taking it a bit too 
>> >>>>>> far and might be out of scope. Further, compliance policies can be 
>> >>>>>> quite complicated, and a predefined set of permissions/access 
>> >>>>>> controls can be too restrictive and not flexible enough to capture 
>> >>>>>> various compliance needs, like dynamic data masking.
>> >>>>>>
>> >>>>>> Thanks,
>> >>>>>> Walaa.
>> >>>>>>
>> >>>>>>
>> >>>>>> On Wed, Jan 3, 2024 at 10:09 AM Jack Ye 
>> >>>>>> <[email protected]<mailto:[email protected]>> wrote:
>> >>>>>>>
>> >>>>>>> Thanks for bringing this topic up! I can provide some perspective 
>> >>>>>>> about AWS Glue's related features.
>> >>>>>>>
>> >>>>>>> AWS Glue table definition also has a column parameters feature 
>> >>>>>>> (ref). This does not serve any governance purpose at this moment, 
>> >>>>>>> but it is a pretty convenient feature that allows users to add 
>> >>>>>>> arbitrary tags to columns. As you said, it is technically just a 
>> >>>>>>> more fancy and more structured doc field for a column, which I don't 
>> >>>>>>> have a strong opinion about adding it or not in Iceberg.
>> >>>>>>>
>> >>>>>>> If we talk specifically about governance features, I am not sure if 
>> >>>>>>> column property is the best way though. Consider the case of having 
>> >>>>>>> a column which was not PII, but becomes PII because certain law has 
>> >>>>>>> passed. The operation a user would perform in this case is something 
>> >>>>>>> like "ALTER TABLE MODIFY COLUMN col SET PROPERTIES ('pii'='true')". 
>> >>>>>>> However, Iceberg schema is versioned, that means if you time travel 
>> >>>>>>> to some time before the MODIFY COLUMN operation, the PII column 
>> >>>>>>> becomes still accessible. So what you really want is to globally set 
>> >>>>>>> the column to be PII, instead of just the latest column, but that 
>> >>>>>>> becomes a bit incompatible with Iceberg's versioned schema model.
>> >>>>>>>
>> >>>>>>> in AWS Glue, such governance features are provided at policy and 
>> >>>>>>> table level. The information like PII, sensitivity level are 
>> >>>>>>> essentially persisted as LakeFormation policies that are attached to 
>> >>>>>>> the table but separated from the table. After users configure 
>> >>>>>>> column/row-level access to a table through LakeFormation, what would 
>> >>>>>>> happen is that the table response received by services like EMR 
>> >>>>>>> Spark, Athena, Glue ETL will contain an additional fields of 
>> >>>>>>> authorized columns and cell filters (ref), which allows these 
>> >>>>>>> engines to apply the authorization to any schema of the table that 
>> >>>>>>> will be used for the query. In this approach, the user's policy 
>> >>>>>>> setting is decoupled with the table's schema evolution over time, 
>> >>>>>>> which avoids problems like the one above in time travel, and many 
>> >>>>>>> other types of unintended user configuration mistakes.
>> >>>>>>>
>> >>>>>>> So I think a full governance story would mean to add something 
>> >>>>>>> similar in Iceberg's table model. For example, we could add a 
>> >>>>>>> "policy" field that contains sub-fields like the table's basic 
>> >>>>>>> access permission (READ/WRITE/ADMIN), authorized columns, data 
>> >>>>>>> filters, etc. I am not sure if Iceberg needs its own policy spec 
>> >>>>>>> though, that might go a bit too far.
>> >>>>>>>
>> >>>>>>> Any thoughts?
>> >>>>>>>
>> >>>>>>> Best,
>> >>>>>>> Jack Ye
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> On Wed, Jan 3, 2024 at 1:10 AM Walaa Eldin Moustafa 
>> >>>>>>> <[email protected]<mailto:[email protected]>> wrote:
>> >>>>>>>>
>> >>>>>>>> Hi Iceberg Developers,
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> I would like to start a discussion on a potential enhancement to 
>> >>>>>>>> Iceberg around the implementation of key-value style properties 
>> >>>>>>>> (tags) for individual columns or fields. I believe this feature 
>> >>>>>>>> could have significant applications, especially in the domain of 
>> >>>>>>>> data governance.
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> Here are some examples of how this feature can be potentially used:
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> * PII Classification: Indicating whether a field contains 
>> >>>>>>>> Personally Identifiable Information (e.g., PII -> {true, false}).
>> >>>>>>>>
>> >>>>>>>> * Ontology Mapping: Associating fields with specific ontology terms 
>> >>>>>>>> (e.g., Type -> {USER_ID, USER_NAME, LOCATION}).
>> >>>>>>>>
>> >>>>>>>> * Sensitivity Level Setting: Defining the sensitivity level of a 
>> >>>>>>>> field (e.g., Sensitive -> {High, Medium, Low}).
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> While current workarounds like table-level properties or 
>> >>>>>>>> column-level comments/docs exist, they lack the structured approach 
>> >>>>>>>> needed for these use cases. Table-level properties often require 
>> >>>>>>>> constant schema validation and can be error-prone, especially when 
>> >>>>>>>> not in sync with the table schema. Additionally, column-level 
>> >>>>>>>> comments, while useful, do not enforce a standardized format.
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> I am also interested in hearing thoughts or experiences around 
>> >>>>>>>> whether this problem is addressed at the catalog level in any of 
>> >>>>>>>> the implementations (e.g., AWS Glue). My impression is that even 
>> >>>>>>>> with catalog-level implementations, there's still a need for 
>> >>>>>>>> continual validation against the table schema. Further, 
>> >>>>>>>> catalog-specific implementations will lack a standardized 
>> >>>>>>>> specification. A spec could be beneficial for areas requiring 
>> >>>>>>>> consistent and structured metadata management.
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> I realize that introducing this feature may necessitate the 
>> >>>>>>>> development of APIs in various engines to set these properties or 
>> >>>>>>>> tags, such as extensions in Spark or Trino SQL. However, I believe 
>> >>>>>>>> it’s a worthwhile discussion to have, separate from whether Iceberg 
>> >>>>>>>> should include these features in its APIs. For the sake of this 
>> >>>>>>>> thread we can focus on the Iceberg APIs aspect.
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> Here are some references to similar concepts in other systems:
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> * Avro attributes: Avro 1.10.2 Specification - Schemas (see 
>> >>>>>>>> "Attributes not defined in this document are permitted as 
>> >>>>>>>> metadata").
>> >>>>>>>>
>> >>>>>>>> * BigQuery policy tags: BigQuery Column-level Security.
>> >>>>>>>>
>> >>>>>>>> * Snowflake object tagging: Snowflake Object Tagging Documentation 
>> >>>>>>>> (see references to "MODIFY COLUMN").
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> Looking forward to your insights on whether addressing this issue 
>> >>>>>>>> at the Iceberg specification and API level is a reasonable 
>> >>>>>>>> direction.
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> Thanks,
>> >>>>>>>> Walaa.
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>
>> >>>>>
>> >>>>> --
>> >>>>> John Zhuge

Re: Column-Level Key-Value Properties (Tags) in Iceberg

Reply via email to