Hi Walaa, Netflix internal Spark and Iceberg have supported column metadata in Iceberg tables since Spark 2.4. The Spark data type is `org.apache.spark.sql.types.Metadata` in StructType. The feature is used by ML teams.
It'd be great for the feature to be adopted. On Wed, Jan 3, 2024 at 1:18 PM Walaa Eldin Moustafa <wa.moust...@gmail.com> wrote: > Thanks Jack! > > I think generic key value pairs are still valuable, even for data > governance. > > Regarding schema versions and PII evolution over time, I actually think it > is a good feature to keep PII and schema in sync across versions for data > reproducibility. Consistency is key in time travel scenarios - the > objective should be to replicate data states accurately, regardless of > subsequent changes in column tags. On the other hand, organizations > typically make special arrangements when it comes to addressing compliance > in the context of time travel. For example in the data deletion use case, > special accomodation should take place to address the fact that time travel > can facilitate restoring the data. Finally, I am not very concerned about > the case when a field evolves to PII=true while it is still set to > PII=false in the time travel window. Typically, the time travel window is > in the order of days but regulation enforcement window is in the order of > months. Most often, the data versions with PII=false would have cycled out > of the system before the regulatory enforcement is in effect. > > I also think that the catalog level example in AWS Glue still needs to > consistently ensure schema compatibility? How does it ensure that the > columns referenced in the policies are in sync with the Iceberg table > schema, especially when the Iceberg table schema is evolved when the > policies and referenced columns are not? > > Regarding bringing policy and compliance semantics aspects into Iceberg as > a top level construct, I agree this is taking it a bit too far and might be > out of scope. Further, compliance policies can be quite complicated, and a > predefined set of permissions/access controls can be too restrictive and > not flexible enough to capture various compliance needs, like dynamic data > masking. > > Thanks, > Walaa. > > > On Wed, Jan 3, 2024 at 10:09 AM Jack Ye <yezhao...@gmail.com> wrote: > >> Thanks for bringing this topic up! I can provide some perspective about >> AWS Glue's related features. >> >> AWS Glue table definition also has a column parameters feature (ref >> <https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-tables.html#aws-glue-api-catalog-tables-Column>). >> This does not serve any governance purpose at this moment, but it is a >> pretty convenient feature that allows users to add arbitrary tags to >> columns. As you said, it is technically just a more fancy and more >> structured doc field for a column, which I don't have a strong opinion >> about adding it or not in Iceberg. >> >> If we talk specifically about governance features, I am not sure if >> column property is the best way though. Consider the case of having a >> column which was not PII, but becomes PII because certain law has passed. >> The operation a user would perform in this case is something like "ALTER >> TABLE MODIFY COLUMN col SET PROPERTIES ('pii'='true')". However, Iceberg >> schema is versioned, that means if you time travel to some time before the >> MODIFY COLUMN operation, the PII column becomes still accessible. So what >> you really want is to globally set the column to be PII, instead of just >> the latest column, but that becomes a bit incompatible with Iceberg's >> versioned schema model. >> >> in AWS Glue, such governance features are provided at policy and table >> level. The information like PII, sensitivity level are essentially >> persisted as LakeFormation policies that are attached to the table but >> separated from the table. After users configure column/row-level access to >> a table through LakeFormation, what would happen is that the table response >> received by services like EMR Spark, Athena, Glue ETL will contain an >> additional fields of authorized columns and cell filters (ref >> <https://docs.aws.amazon.com/glue/latest/webapi/API_GetUnfilteredTableMetadata.html#API_GetUnfilteredTableMetadata_ResponseElements>), >> which allows these engines to apply the authorization to any schema of the >> table that will be used for the query. In this approach, the user's policy >> setting is decoupled with the table's schema evolution over time, which >> avoids problems like the one above in time travel, and many other types of >> unintended user configuration mistakes. >> >> So I think a full governance story would mean to add something similar in >> Iceberg's table model. For example, we could add a "policy" field that >> contains sub-fields like the table's basic access permission >> (READ/WRITE/ADMIN), authorized columns, data filters, etc. I am not sure if >> Iceberg needs its own policy spec though, that might go a bit too far. >> >> Any thoughts? >> >> Best, >> Jack Ye >> >> >> On Wed, Jan 3, 2024 at 1:10 AM Walaa Eldin Moustafa < >> wa.moust...@gmail.com> wrote: >> >>> Hi Iceberg Developers, >>> >>> >>> I would like to start a discussion on a potential enhancement to Iceberg >>> around the implementation of key-value style properties (tags) for >>> individual columns or fields. I believe this feature could have significant >>> applications, especially in the domain of data governance. >>> >>> >>> Here are some examples of how this feature can be potentially used: >>> >>> >>> * PII Classification: Indicating whether a field contains Personally >>> Identifiable Information (e.g., PII -> {true, false}). >>> >>> * Ontology Mapping: Associating fields with specific ontology terms >>> (e.g., Type -> {USER_ID, USER_NAME, LOCATION}). >>> >>> * Sensitivity Level Setting: Defining the sensitivity level of a field >>> (e.g., Sensitive -> {High, Medium, Low}). >>> >>> >>> While current workarounds like table-level properties or column-level >>> comments/docs exist, they lack the structured approach needed for these use >>> cases. Table-level properties often require constant schema validation and >>> can be error-prone, especially when not in sync with the table schema. >>> Additionally, column-level comments, while useful, do not enforce a >>> standardized format. >>> >>> >>> I am also interested in hearing thoughts or experiences around whether >>> this problem is addressed at the catalog level in any of the >>> implementations (e.g., AWS Glue). My impression is that even with >>> catalog-level implementations, there's still a need for continual >>> validation against the table schema. Further, catalog-specific >>> implementations will lack a standardized specification. A spec could be >>> beneficial for areas requiring consistent and structured metadata >>> management. >>> >>> >>> I realize that introducing this feature may necessitate the development >>> of APIs in various engines to set these properties or tags, such as >>> extensions in Spark or Trino SQL. However, I believe it’s a worthwhile >>> discussion to have, separate from whether Iceberg should include these >>> features in its APIs. For the sake of this thread we can focus on the >>> Iceberg APIs aspect. >>> >>> >>> Here are some references to similar concepts in other systems: >>> >>> >>> * Avro attributes: *Avro 1.10.2 Specification - Schemas* >>> <https://avro.apache.org/docs/1.10.2/spec.html#schemas> (see >>> "Attributes not defined in this document are permitted as metadata"). >>> >>> * BigQuery policy tags: *BigQuery Column-level Security* >>> <https://cloud.google.com/bigquery/docs/column-level-security#set_policy> >>> . >>> >>> * Snowflake object tagging: *Snowflake Object Tagging Documentation* >>> <https://docs.snowflake.com/en/user-guide/object-tagging#create-and-assign-tags> >>> (see references to "MODIFY COLUMN"). >>> >>> >>> Looking forward to your insights on whether addressing this issue at the >>> Iceberg specification and API level is a reasonable direction. >>> >>> >>> Thanks, >>> Walaa. >>> >>> >>> >>> -- John Zhuge