Hey Anand,

This is an interesting topic that I think the community is open to
discussing as security is an increasingly important part of data access and
interoperability.  If you already have a proposal, I'd be happy to work
with you on this.  If not, I know others are looking into
similar functionality and we could collaborate to bring something to the
community that's a little more concrete.

-Dan

On Thu, Feb 19, 2026 at 10:52 AM Anand Kumar Sankaran via dev <
[email protected]> wrote:

> JB, Dan, Walaa, Jack and others,
>
> I am working on rolling out an Iceberg catalog (using Apache Polaris). We
> share a lot of PII data and eventually sensitive financial data.  Having
> column-level key-value properties will really help with the governance
> aspect. Ideally, we want to be fully compliant with the fine grained access
> control proposals of Iceberg and want all our partners and customers to
> uptake it.  The column-level key-value properties will help us as well.
>
> I am interested in bringing about a proposal towards this. Any guidance
> would be greatly appreciated.
>
> Thank you.
>
> —
> Anand
> Workday Data Cloud
>
> On 1/9/24, 6:59 AM, "Jean-Baptiste Onofré" <[email protected]> wrote:
>
> It makes sense. Agree.
>
> Regards
> JB
>
> On Mon, Jan 8, 2024 at 11:48 PM Daniel Weeks <[email protected]>
> wrote:
> >
> > JB,
> >
> > I would draw a distinction between catalog and this proposed feature in
> that the catalog is actually not part of the spec, so it is entirely up to
> the engine and is optional.
> >
> > When it comes to the table spec, "optional" does not mean that it does
> not have to be implemented/supported.  Any engine/library that produces
> metadata would need to support column-level properties so that it does not
> drop or improperly handle the metadata elements, even if it does not expose
> a way to view/manipulate them.  This is why scrutiny of spec changes is
> critical.
> >
> > +1 to what you said about documentation and support.
> >
> > -Dan
> >
> >
> >
> > On Mon, Jan 8, 2024 at 1:38 AM Jean-Baptiste Onofré <[email protected]>
> wrote:
> >>
> >> Hi Dan,
> >>
> >> I agree: it will depend on the engine capabilities. That said, it's
> >> similar to catalog: each catalog might have different
> >> approaches/features/capabilities, so engines might have different
> >> capabilities as well.
> >> If it's an optional feature in the spec, and each engine might or
> >> might not implement it, that's ok. But it's certainly not a
> >> requirement.
> >> That said, we would need to clearly document the capabilities of each
> >> engine (and catalog) (I don't say this documentation should be in
> >> Iceberg, but engine "providers" would need to clearly state the
> >> supported features).
> >>
> >> Regards
> >> JB
> >>
> >> On Mon, Jan 8, 2024 at 6:33 AM Daniel Weeks <[email protected]> wrote:
> >> >
> >> > The main risk I see is that this adds complexity and there may be
> limited use of the feature, which makes me question the value.  Spark seems
> like the most likely/obvious to add native support for column-level
> properties, but there are a wide range of engines that may never really
> adopt this (e.g. Trino, Dremio, Doris, Starrocks, Redshift) as there isn't
> SQL specification for table/column properties to my knowledge.
> >> >
> >> > I do think it would be nice for engines that have similar concepts if
> it really can be natively integrated and I'm sure there are other use cases
> for column properties, but it still feels somewhat niche.
> >> >
> >> > That being said, I'm not opposed and if there's interest in getting a
> proposal put together for the spec changes, we'll get a much better idea of
> any challenges.
> >> >
> >> > Thanks,
> >> > -Dan
> >> >
> >> > On Thu, Jan 4, 2024 at 11:55 AM Walaa Eldin Moustafa <
> [email protected]> wrote:
> >> >>
> >> >> Agree that it should not be use case specific. There could be other
> applications beyond governance. As John mentioned, ML is another domain,
> and it is actually the case at LinkedIn as well.
> >> >>
> >> >> I would approach this with the understanding that the key
> requirement is to add key/value properties at the column level, not
> necessarily solving for compliance. Compliance is just one of the
> applications and can leverage this feature in many ways. But one of the key
> requirements in compliance, ML, and other applications is enriching
> column-level metadata. Other systems (Avro, BigQuery, Snowflake) do that
> too as pointed out in the original message. Since Iceberg is the source of
> truth for schema/column/field data, it sounds reasonable that the
> column-level metadata should co-exist in the same place, hence the
> Iceberg-level proposal. Other external solutions are possible of course
> (for column level metadata, not necessarily "compliance"), but with the
> compromise of possible schema drift and inconsistency. For example, at
> LinkedIn, we use Datahub for compliance annotations/tags (this is an
> example of an external system, even outside the catalog) and use Avro
> schema literals for ML column-level metadata (this is an example of table
> level property). In both situations, it would have been better if the tags
> co-existed with the column definitions. So the tradeoff is really between:
> (1) Enhancing Iceberg spec to minimize inconsistency in this domain, or (2)
> Letting Iceberg users come up with custom, disparate, and potentially
> inconsistent solutions. What do you all think?
> >> >>
> >> >> Thanks,
> >> >> Walaa.
> >> >>
> >> >> On Thu, Jan 4, 2024 at 11:14 AM Daniel Weeks <[email protected]>
> wrote:
> >> >>>
> >> >>> I not opposed to the idea of adding column-level properties with a
> few considerations:
> >> >>>
> >> >>> We shouldn't explicitly tie it to a particular use case like data
> governance.  You may be able leverage this for those capabilities, but
> adding anything use case specific gets into some really opinionated areas
> and makes the feature less generalizable.
> >> >>> We need to be really explicit about the behaviors around evolution,
> tags and branches as it could have implications about features built around
> this behave.
> >> >>> Iceberg would need to be the source of truth for this information
> to keep external tags from misrepresenting the underlying schema definition.
> >> >>>
> >> >>> I would agree with Jack that there may be other ways to approach
> policy information so we should explore those and see if those would render
> this functionality less useful overall (I'm sure there are ways we can use
> column-level properties, but if the main driver is policy, this may not be
> worth the investment at the moment).
> >> >>>
> >> >>> -Dan
> >> >>>
> >> >>> On Wed, Jan 3, 2024 at 5:40 PM Renjie Liu <[email protected]>
> wrote:
> >> >>>>
> >> >>>> This proposal sounds good to me.
> >> >>>>
> >> >>>>> If we talk specifically about governance features, I am not sure
> if column property is the best way though. Consider the case of having a
> column which was not PII, but becomes PII because certain law has passed.
> The operation a user would perform in this case is something like "ALTER
> TABLE MODIFY COLUMN col SET PROPERTIES ('pii'='true')". However, Iceberg
> schema is versioned, that means if you time travel to some time before the
> MODIFY COLUMN operation, the PII column becomes still accessible.
> >> >>>>
> >> >>>>
> >> >>>> This sounds like reasonable behavior to me. This is just like we
> do an ddl like "ALTER TABLE ADD COLUMNS ADD COLUMNS (new_column string)",
> and if we to time travel to older version, we should also not see the
> new_column.
> >> >>>>
> >> >>>> On Thu, Jan 4, 2024 at 6:26 AM John Zhuge <[email protected]>
> wrote:
> >> >>>>>
> >> >>>>> Hi Walaa,
> >> >>>>>
> >> >>>>> Netflix internal Spark and Iceberg have supported column metadata
> in Iceberg tables since Spark 2.4. The Spark data type is
> `org.apache.spark.sql.types.Metadata` in StructType. The feature is used by
> ML teams.
> >> >>>>>
> >> >>>>> It'd be great for the feature to be adopted.
> >> >>>>>
> >> >>>>>
> >> >>>>> On Wed, Jan 3, 2024 at 1:18 PM Walaa Eldin Moustafa <
> [email protected]> wrote:
> >> >>>>>>
> >> >>>>>> Thanks Jack!
> >> >>>>>>
> >> >>>>>> I think generic key value pairs are still valuable, even for
> data governance.
> >> >>>>>>
> >> >>>>>> Regarding schema versions and PII evolution over time, I
> actually think it is a good feature to keep PII and schema in sync across
> versions for data reproducibility. Consistency is key in time travel
> scenarios - the objective should be to replicate data states accurately,
> regardless of subsequent changes in column tags. On the other hand,
> organizations typically make special arrangements when it comes to
> addressing compliance in the context of time travel. For example in the
> data deletion use case, special accomodation should take place to address
> the fact that time travel can facilitate restoring the data. Finally, I am
> not very concerned about the case when a field evolves to PII=true while it
> is still set to PII=false in the time travel window. Typically, the time
> travel window is in the order of days but regulation enforcement window is
> in the order of months. Most often, the data versions with PII=false would
> have cycled out of the system before the regulatory enforcement is in
> effect.
> >> >>>>>>
> >> >>>>>> I also think that the catalog level example in AWS Glue still
> needs to consistently ensure schema compatibility? How does it ensure that
> the columns referenced in the policies are in sync with the Iceberg table
> schema, especially when the Iceberg table schema is evolved when the
> policies and referenced columns are not?
> >> >>>>>>
> >> >>>>>> Regarding bringing policy and compliance semantics aspects into
> Iceberg as a top level construct, I agree this is taking it a bit too far
> and might be out of scope. Further, compliance policies can be quite
> complicated, and a predefined set of permissions/access controls can be too
> restrictive and not flexible enough to capture various compliance needs,
> like dynamic data masking.
> >> >>>>>>
> >> >>>>>> Thanks,
> >> >>>>>> Walaa.
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> On Wed, Jan 3, 2024 at 10:09 AM Jack Ye <[email protected]>
> wrote:
> >> >>>>>>>
> >> >>>>>>> Thanks for bringing this topic up! I can provide some
> perspective about AWS Glue's related features.
> >> >>>>>>>
> >> >>>>>>> AWS Glue table definition also has a column parameters feature
> (ref). This does not serve any governance purpose at this moment, but it is
> a pretty convenient feature that allows users to add arbitrary tags to
> columns. As you said, it is technically just a more fancy and more
> structured doc field for a column, which I don't have a strong opinion
> about adding it or not in Iceberg.
> >> >>>>>>>
> >> >>>>>>> If we talk specifically about governance features, I am not
> sure if column property is the best way though. Consider the case of having
> a column which was not PII, but becomes PII because certain law has passed.
> The operation a user would perform in this case is something like "ALTER
> TABLE MODIFY COLUMN col SET PROPERTIES ('pii'='true')". However, Iceberg
> schema is versioned, that means if you time travel to some time before the
> MODIFY COLUMN operation, the PII column becomes still accessible. So what
> you really want is to globally set the column to be PII, instead of just
> the latest column, but that becomes a bit incompatible with Iceberg's
> versioned schema model.
> >> >>>>>>>
> >> >>>>>>> in AWS Glue, such governance features are provided at policy
> and table level. The information like PII, sensitivity level are
> essentially persisted as LakeFormation policies that are attached to the
> table but separated from the table. After users configure column/row-level
> access to a table through LakeFormation, what would happen is that the
> table response received by services like EMR Spark, Athena, Glue ETL will
> contain an additional fields of authorized columns and cell filters (ref),
> which allows these engines to apply the authorization to any schema of the
> table that will be used for the query. In this approach, the user's policy
> setting is decoupled with the table's schema evolution over time, which
> avoids problems like the one above in time travel, and many other types of
> unintended user configuration mistakes.
> >> >>>>>>>
> >> >>>>>>> So I think a full governance story would mean to add something
> similar in Iceberg's table model. For example, we could add a "policy"
> field that contains sub-fields like the table's basic access permission
> (READ/WRITE/ADMIN), authorized columns, data filters, etc. I am not sure if
> Iceberg needs its own policy spec though, that might go a bit too far.
> >> >>>>>>>
> >> >>>>>>> Any thoughts?
> >> >>>>>>>
> >> >>>>>>> Best,
> >> >>>>>>> Jack Ye
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>> On Wed, Jan 3, 2024 at 1:10 AM Walaa Eldin Moustafa <
> [email protected]> wrote:
> >> >>>>>>>>
> >> >>>>>>>> Hi Iceberg Developers,
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>> I would like to start a discussion on a potential enhancement
> to Iceberg around the implementation of key-value style properties (tags)
> for individual columns or fields. I believe this feature could have
> significant applications, especially in the domain of data governance.
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>> Here are some examples of how this feature can be potentially
> used:
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>> * PII Classification: Indicating whether a field contains
> Personally Identifiable Information (e.g., PII -> {true, false}).
> >> >>>>>>>>
> >> >>>>>>>> * Ontology Mapping: Associating fields with specific ontology
> terms (e.g., Type -> {USER_ID, USER_NAME, LOCATION}).
> >> >>>>>>>>
> >> >>>>>>>> * Sensitivity Level Setting: Defining the sensitivity level of
> a field (e.g., Sensitive -> {High, Medium, Low}).
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>> While current workarounds like table-level properties or
> column-level comments/docs exist, they lack the structured approach needed
> for these use cases. Table-level properties often require constant schema
> validation and can be error-prone, especially when not in sync with the
> table schema. Additionally, column-level comments, while useful, do not
> enforce a standardized format.
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>> I am also interested in hearing thoughts or experiences around
> whether this problem is addressed at the catalog level in any of the
> implementations (e.g., AWS Glue). My impression is that even with
> catalog-level implementations, there's still a need for continual
> validation against the table schema. Further, catalog-specific
> implementations will lack a standardized specification. A spec could be
> beneficial for areas requiring consistent and structured metadata
> management.
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>> I realize that introducing this feature may necessitate the
> development of APIs in various engines to set these properties or tags,
> such as extensions in Spark or Trino SQL. However, I believe it’s a
> worthwhile discussion to have, separate from whether Iceberg should include
> these features in its APIs. For the sake of this thread we can focus on the
> Iceberg APIs aspect.
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>> Here are some references to similar concepts in other systems:
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>> * Avro attributes: Avro 1.10.2 Specification - Schemas (see
> "Attributes not defined in this document are permitted as metadata").
> >> >>>>>>>>
> >> >>>>>>>> * BigQuery policy tags: BigQuery Column-level Security.
> >> >>>>>>>>
> >> >>>>>>>> * Snowflake object tagging: Snowflake Object Tagging
> Documentation (see references to "MODIFY COLUMN").
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>> Looking forward to your insights on whether addressing this
> issue at the Iceberg specification and API level is a reasonable direction.
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>> Thanks,
> >> >>>>>>>> Walaa.
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>
> >> >>>>>
> >> >>>>> --
> >> >>>>> John Zhuge
>
>
>

Reply via email to