Hey Anand, This is an interesting topic that I think the community is open to discussing as security is an increasingly important part of data access and interoperability. If you already have a proposal, I'd be happy to work with you on this. If not, I know others are looking into similar functionality and we could collaborate to bring something to the community that's a little more concrete.
-Dan On Thu, Feb 19, 2026 at 10:52 AM Anand Kumar Sankaran via dev < [email protected]> wrote: > JB, Dan, Walaa, Jack and others, > > I am working on rolling out an Iceberg catalog (using Apache Polaris). We > share a lot of PII data and eventually sensitive financial data. Having > column-level key-value properties will really help with the governance > aspect. Ideally, we want to be fully compliant with the fine grained access > control proposals of Iceberg and want all our partners and customers to > uptake it. The column-level key-value properties will help us as well. > > I am interested in bringing about a proposal towards this. Any guidance > would be greatly appreciated. > > Thank you. > > — > Anand > Workday Data Cloud > > On 1/9/24, 6:59 AM, "Jean-Baptiste Onofré" <[email protected]> wrote: > > It makes sense. Agree. > > Regards > JB > > On Mon, Jan 8, 2024 at 11:48 PM Daniel Weeks <[email protected]> > wrote: > > > > JB, > > > > I would draw a distinction between catalog and this proposed feature in > that the catalog is actually not part of the spec, so it is entirely up to > the engine and is optional. > > > > When it comes to the table spec, "optional" does not mean that it does > not have to be implemented/supported. Any engine/library that produces > metadata would need to support column-level properties so that it does not > drop or improperly handle the metadata elements, even if it does not expose > a way to view/manipulate them. This is why scrutiny of spec changes is > critical. > > > > +1 to what you said about documentation and support. > > > > -Dan > > > > > > > > On Mon, Jan 8, 2024 at 1:38 AM Jean-Baptiste Onofré <[email protected]> > wrote: > >> > >> Hi Dan, > >> > >> I agree: it will depend on the engine capabilities. That said, it's > >> similar to catalog: each catalog might have different > >> approaches/features/capabilities, so engines might have different > >> capabilities as well. > >> If it's an optional feature in the spec, and each engine might or > >> might not implement it, that's ok. But it's certainly not a > >> requirement. > >> That said, we would need to clearly document the capabilities of each > >> engine (and catalog) (I don't say this documentation should be in > >> Iceberg, but engine "providers" would need to clearly state the > >> supported features). > >> > >> Regards > >> JB > >> > >> On Mon, Jan 8, 2024 at 6:33 AM Daniel Weeks <[email protected]> wrote: > >> > > >> > The main risk I see is that this adds complexity and there may be > limited use of the feature, which makes me question the value. Spark seems > like the most likely/obvious to add native support for column-level > properties, but there are a wide range of engines that may never really > adopt this (e.g. Trino, Dremio, Doris, Starrocks, Redshift) as there isn't > SQL specification for table/column properties to my knowledge. > >> > > >> > I do think it would be nice for engines that have similar concepts if > it really can be natively integrated and I'm sure there are other use cases > for column properties, but it still feels somewhat niche. > >> > > >> > That being said, I'm not opposed and if there's interest in getting a > proposal put together for the spec changes, we'll get a much better idea of > any challenges. > >> > > >> > Thanks, > >> > -Dan > >> > > >> > On Thu, Jan 4, 2024 at 11:55 AM Walaa Eldin Moustafa < > [email protected]> wrote: > >> >> > >> >> Agree that it should not be use case specific. There could be other > applications beyond governance. As John mentioned, ML is another domain, > and it is actually the case at LinkedIn as well. > >> >> > >> >> I would approach this with the understanding that the key > requirement is to add key/value properties at the column level, not > necessarily solving for compliance. Compliance is just one of the > applications and can leverage this feature in many ways. But one of the key > requirements in compliance, ML, and other applications is enriching > column-level metadata. Other systems (Avro, BigQuery, Snowflake) do that > too as pointed out in the original message. Since Iceberg is the source of > truth for schema/column/field data, it sounds reasonable that the > column-level metadata should co-exist in the same place, hence the > Iceberg-level proposal. Other external solutions are possible of course > (for column level metadata, not necessarily "compliance"), but with the > compromise of possible schema drift and inconsistency. For example, at > LinkedIn, we use Datahub for compliance annotations/tags (this is an > example of an external system, even outside the catalog) and use Avro > schema literals for ML column-level metadata (this is an example of table > level property). In both situations, it would have been better if the tags > co-existed with the column definitions. So the tradeoff is really between: > (1) Enhancing Iceberg spec to minimize inconsistency in this domain, or (2) > Letting Iceberg users come up with custom, disparate, and potentially > inconsistent solutions. What do you all think? > >> >> > >> >> Thanks, > >> >> Walaa. > >> >> > >> >> On Thu, Jan 4, 2024 at 11:14 AM Daniel Weeks <[email protected]> > wrote: > >> >>> > >> >>> I not opposed to the idea of adding column-level properties with a > few considerations: > >> >>> > >> >>> We shouldn't explicitly tie it to a particular use case like data > governance. You may be able leverage this for those capabilities, but > adding anything use case specific gets into some really opinionated areas > and makes the feature less generalizable. > >> >>> We need to be really explicit about the behaviors around evolution, > tags and branches as it could have implications about features built around > this behave. > >> >>> Iceberg would need to be the source of truth for this information > to keep external tags from misrepresenting the underlying schema definition. > >> >>> > >> >>> I would agree with Jack that there may be other ways to approach > policy information so we should explore those and see if those would render > this functionality less useful overall (I'm sure there are ways we can use > column-level properties, but if the main driver is policy, this may not be > worth the investment at the moment). > >> >>> > >> >>> -Dan > >> >>> > >> >>> On Wed, Jan 3, 2024 at 5:40 PM Renjie Liu <[email protected]> > wrote: > >> >>>> > >> >>>> This proposal sounds good to me. > >> >>>> > >> >>>>> If we talk specifically about governance features, I am not sure > if column property is the best way though. Consider the case of having a > column which was not PII, but becomes PII because certain law has passed. > The operation a user would perform in this case is something like "ALTER > TABLE MODIFY COLUMN col SET PROPERTIES ('pii'='true')". However, Iceberg > schema is versioned, that means if you time travel to some time before the > MODIFY COLUMN operation, the PII column becomes still accessible. > >> >>>> > >> >>>> > >> >>>> This sounds like reasonable behavior to me. This is just like we > do an ddl like "ALTER TABLE ADD COLUMNS ADD COLUMNS (new_column string)", > and if we to time travel to older version, we should also not see the > new_column. > >> >>>> > >> >>>> On Thu, Jan 4, 2024 at 6:26 AM John Zhuge <[email protected]> > wrote: > >> >>>>> > >> >>>>> Hi Walaa, > >> >>>>> > >> >>>>> Netflix internal Spark and Iceberg have supported column metadata > in Iceberg tables since Spark 2.4. The Spark data type is > `org.apache.spark.sql.types.Metadata` in StructType. The feature is used by > ML teams. > >> >>>>> > >> >>>>> It'd be great for the feature to be adopted. > >> >>>>> > >> >>>>> > >> >>>>> On Wed, Jan 3, 2024 at 1:18 PM Walaa Eldin Moustafa < > [email protected]> wrote: > >> >>>>>> > >> >>>>>> Thanks Jack! > >> >>>>>> > >> >>>>>> I think generic key value pairs are still valuable, even for > data governance. > >> >>>>>> > >> >>>>>> Regarding schema versions and PII evolution over time, I > actually think it is a good feature to keep PII and schema in sync across > versions for data reproducibility. Consistency is key in time travel > scenarios - the objective should be to replicate data states accurately, > regardless of subsequent changes in column tags. On the other hand, > organizations typically make special arrangements when it comes to > addressing compliance in the context of time travel. For example in the > data deletion use case, special accomodation should take place to address > the fact that time travel can facilitate restoring the data. Finally, I am > not very concerned about the case when a field evolves to PII=true while it > is still set to PII=false in the time travel window. Typically, the time > travel window is in the order of days but regulation enforcement window is > in the order of months. Most often, the data versions with PII=false would > have cycled out of the system before the regulatory enforcement is in > effect. > >> >>>>>> > >> >>>>>> I also think that the catalog level example in AWS Glue still > needs to consistently ensure schema compatibility? How does it ensure that > the columns referenced in the policies are in sync with the Iceberg table > schema, especially when the Iceberg table schema is evolved when the > policies and referenced columns are not? > >> >>>>>> > >> >>>>>> Regarding bringing policy and compliance semantics aspects into > Iceberg as a top level construct, I agree this is taking it a bit too far > and might be out of scope. Further, compliance policies can be quite > complicated, and a predefined set of permissions/access controls can be too > restrictive and not flexible enough to capture various compliance needs, > like dynamic data masking. > >> >>>>>> > >> >>>>>> Thanks, > >> >>>>>> Walaa. > >> >>>>>> > >> >>>>>> > >> >>>>>> On Wed, Jan 3, 2024 at 10:09 AM Jack Ye <[email protected]> > wrote: > >> >>>>>>> > >> >>>>>>> Thanks for bringing this topic up! I can provide some > perspective about AWS Glue's related features. > >> >>>>>>> > >> >>>>>>> AWS Glue table definition also has a column parameters feature > (ref). This does not serve any governance purpose at this moment, but it is > a pretty convenient feature that allows users to add arbitrary tags to > columns. As you said, it is technically just a more fancy and more > structured doc field for a column, which I don't have a strong opinion > about adding it or not in Iceberg. > >> >>>>>>> > >> >>>>>>> If we talk specifically about governance features, I am not > sure if column property is the best way though. Consider the case of having > a column which was not PII, but becomes PII because certain law has passed. > The operation a user would perform in this case is something like "ALTER > TABLE MODIFY COLUMN col SET PROPERTIES ('pii'='true')". However, Iceberg > schema is versioned, that means if you time travel to some time before the > MODIFY COLUMN operation, the PII column becomes still accessible. So what > you really want is to globally set the column to be PII, instead of just > the latest column, but that becomes a bit incompatible with Iceberg's > versioned schema model. > >> >>>>>>> > >> >>>>>>> in AWS Glue, such governance features are provided at policy > and table level. The information like PII, sensitivity level are > essentially persisted as LakeFormation policies that are attached to the > table but separated from the table. After users configure column/row-level > access to a table through LakeFormation, what would happen is that the > table response received by services like EMR Spark, Athena, Glue ETL will > contain an additional fields of authorized columns and cell filters (ref), > which allows these engines to apply the authorization to any schema of the > table that will be used for the query. In this approach, the user's policy > setting is decoupled with the table's schema evolution over time, which > avoids problems like the one above in time travel, and many other types of > unintended user configuration mistakes. > >> >>>>>>> > >> >>>>>>> So I think a full governance story would mean to add something > similar in Iceberg's table model. For example, we could add a "policy" > field that contains sub-fields like the table's basic access permission > (READ/WRITE/ADMIN), authorized columns, data filters, etc. I am not sure if > Iceberg needs its own policy spec though, that might go a bit too far. > >> >>>>>>> > >> >>>>>>> Any thoughts? > >> >>>>>>> > >> >>>>>>> Best, > >> >>>>>>> Jack Ye > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> On Wed, Jan 3, 2024 at 1:10 AM Walaa Eldin Moustafa < > [email protected]> wrote: > >> >>>>>>>> > >> >>>>>>>> Hi Iceberg Developers, > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> I would like to start a discussion on a potential enhancement > to Iceberg around the implementation of key-value style properties (tags) > for individual columns or fields. I believe this feature could have > significant applications, especially in the domain of data governance. > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> Here are some examples of how this feature can be potentially > used: > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> * PII Classification: Indicating whether a field contains > Personally Identifiable Information (e.g., PII -> {true, false}). > >> >>>>>>>> > >> >>>>>>>> * Ontology Mapping: Associating fields with specific ontology > terms (e.g., Type -> {USER_ID, USER_NAME, LOCATION}). > >> >>>>>>>> > >> >>>>>>>> * Sensitivity Level Setting: Defining the sensitivity level of > a field (e.g., Sensitive -> {High, Medium, Low}). > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> While current workarounds like table-level properties or > column-level comments/docs exist, they lack the structured approach needed > for these use cases. Table-level properties often require constant schema > validation and can be error-prone, especially when not in sync with the > table schema. Additionally, column-level comments, while useful, do not > enforce a standardized format. > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> I am also interested in hearing thoughts or experiences around > whether this problem is addressed at the catalog level in any of the > implementations (e.g., AWS Glue). My impression is that even with > catalog-level implementations, there's still a need for continual > validation against the table schema. Further, catalog-specific > implementations will lack a standardized specification. A spec could be > beneficial for areas requiring consistent and structured metadata > management. > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> I realize that introducing this feature may necessitate the > development of APIs in various engines to set these properties or tags, > such as extensions in Spark or Trino SQL. However, I believe it’s a > worthwhile discussion to have, separate from whether Iceberg should include > these features in its APIs. For the sake of this thread we can focus on the > Iceberg APIs aspect. > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> Here are some references to similar concepts in other systems: > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> * Avro attributes: Avro 1.10.2 Specification - Schemas (see > "Attributes not defined in this document are permitted as metadata"). > >> >>>>>>>> > >> >>>>>>>> * BigQuery policy tags: BigQuery Column-level Security. > >> >>>>>>>> > >> >>>>>>>> * Snowflake object tagging: Snowflake Object Tagging > Documentation (see references to "MODIFY COLUMN"). > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> Looking forward to your insights on whether addressing this > issue at the Iceberg specification and API level is a reasonable direction. > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> Thanks, > >> >>>>>>>> Walaa. > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >> >>>>> > >> >>>>> > >> >>>>> -- > >> >>>>> John Zhuge > > >
