Hi Walaa,

Netflix internal Spark and Iceberg have supported column metadata in
Iceberg tables since Spark 2.4. The Spark data type is
`org.apache.spark.sql.types.Metadata` in StructType. The feature is used by
ML teams.

It'd be great for the feature to be adopted.


On Wed, Jan 3, 2024 at 1:18 PM Walaa Eldin Moustafa <wa.moust...@gmail.com>
wrote:

> Thanks Jack!
>
> I think generic key value pairs are still valuable, even for data
> governance.
>
> Regarding schema versions and PII evolution over time, I actually think it
> is a good feature to keep PII and schema in sync across versions for data
> reproducibility. Consistency is key in time travel scenarios - the
> objective should be to replicate data states accurately, regardless of
> subsequent changes in column tags. On the other hand, organizations
> typically make special arrangements when it comes to addressing compliance
> in the context of time travel. For example in the data deletion use case,
> special accomodation should take place to address the fact that time travel
> can facilitate restoring the data. Finally, I am not very concerned about
> the case when a field evolves to PII=true while it is still set to
> PII=false in the time travel window. Typically, the time travel window is
> in the order of days but regulation enforcement window is in the order of
> months. Most often, the data versions with PII=false would have cycled out
> of the system before the regulatory enforcement is in effect.
>
> I also think that the catalog level example in AWS Glue still needs to
> consistently ensure schema compatibility? How does it ensure that the
> columns referenced in the policies are in sync with the Iceberg table
> schema, especially when the Iceberg table schema is evolved when the
> policies and referenced columns are not?
>
> Regarding bringing policy and compliance semantics aspects into Iceberg as
> a top level construct, I agree this is taking it a bit too far and might be
> out of scope. Further, compliance policies can be quite complicated, and a
> predefined set of permissions/access controls can be too restrictive and
> not flexible enough to capture various compliance needs, like dynamic data
> masking.
>
> Thanks,
> Walaa.
>
>
> On Wed, Jan 3, 2024 at 10:09 AM Jack Ye <yezhao...@gmail.com> wrote:
>
>> Thanks for bringing this topic up! I can provide some perspective about
>> AWS Glue's related features.
>>
>> AWS Glue table definition also has a column parameters feature (ref
>> <https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-tables.html#aws-glue-api-catalog-tables-Column>).
>> This does not serve any governance purpose at this moment, but it is a
>> pretty convenient feature that allows users to add arbitrary tags to
>> columns. As you said, it is technically just a more fancy and more
>> structured doc field for a column, which I don't have a strong opinion
>> about adding it or not in Iceberg.
>>
>> If we talk specifically about governance features, I am not sure if
>> column property is the best way though. Consider the case of having a
>> column which was not PII, but becomes PII because certain law has passed.
>> The operation a user would perform in this case is something like "ALTER
>> TABLE MODIFY COLUMN col SET PROPERTIES ('pii'='true')". However, Iceberg
>> schema is versioned, that means if you time travel to some time before the
>> MODIFY COLUMN operation, the PII column becomes still accessible. So what
>> you really want is to globally set the column to be PII, instead of just
>> the latest column, but that becomes a bit incompatible with Iceberg's
>> versioned schema model.
>>
>> in AWS Glue, such governance features are provided at policy and table
>> level. The information like PII, sensitivity level are essentially
>> persisted as LakeFormation policies that are attached to the table but
>> separated from the table. After users configure column/row-level access to
>> a table through LakeFormation, what would happen is that the table response
>> received by services like EMR Spark, Athena, Glue ETL will contain an
>> additional fields of authorized columns and cell filters (ref
>> <https://docs.aws.amazon.com/glue/latest/webapi/API_GetUnfilteredTableMetadata.html#API_GetUnfilteredTableMetadata_ResponseElements>),
>> which allows these engines to apply the authorization to any schema of the
>> table that will be used for the query. In this approach, the user's policy
>> setting is decoupled with the table's schema evolution over time, which
>> avoids problems like the one above in time travel, and many other types of
>> unintended user configuration mistakes.
>>
>> So I think a full governance story would mean to add something similar in
>> Iceberg's table model. For example, we could add a "policy" field that
>> contains sub-fields like the table's basic access permission
>> (READ/WRITE/ADMIN), authorized columns, data filters, etc. I am not sure if
>> Iceberg needs its own policy spec though, that might go a bit too far.
>>
>> Any thoughts?
>>
>> Best,
>> Jack Ye
>>
>>
>> On Wed, Jan 3, 2024 at 1:10 AM Walaa Eldin Moustafa <
>> wa.moust...@gmail.com> wrote:
>>
>>> Hi Iceberg Developers,
>>>
>>>
>>> I would like to start a discussion on a potential enhancement to Iceberg
>>> around the implementation of key-value style properties (tags) for
>>> individual columns or fields. I believe this feature could have significant
>>> applications, especially in the domain of data governance.
>>>
>>>
>>> Here are some examples of how this feature can be potentially used:
>>>
>>>
>>> * PII Classification: Indicating whether a field contains Personally
>>> Identifiable Information (e.g., PII -> {true, false}).
>>>
>>> * Ontology Mapping: Associating fields with specific ontology terms
>>> (e.g., Type -> {USER_ID, USER_NAME, LOCATION}).
>>>
>>> * Sensitivity Level Setting: Defining the sensitivity level of a field
>>> (e.g., Sensitive -> {High, Medium, Low}).
>>>
>>>
>>> While current workarounds like table-level properties or column-level
>>> comments/docs exist, they lack the structured approach needed for these use
>>> cases. Table-level properties often require constant schema validation and
>>> can be error-prone, especially when not in sync with the table schema.
>>> Additionally, column-level comments, while useful, do not enforce a
>>> standardized format.
>>>
>>>
>>> I am also interested in hearing thoughts or experiences around whether
>>> this problem is addressed at the catalog level in any of the
>>> implementations (e.g., AWS Glue). My impression is that even with
>>> catalog-level implementations, there's still a need for continual
>>> validation against the table schema. Further, catalog-specific
>>> implementations will lack a standardized specification. A spec could be
>>> beneficial for areas requiring consistent and structured metadata
>>> management.
>>>
>>>
>>> I realize that introducing this feature may necessitate the development
>>> of APIs in various engines to set these properties or tags, such as
>>> extensions in Spark or Trino SQL. However, I believe it’s a worthwhile
>>> discussion to have, separate from whether Iceberg should include these
>>> features in its APIs. For the sake of this thread we can focus on the
>>> Iceberg APIs aspect.
>>>
>>>
>>> Here are some references to similar concepts in other systems:
>>>
>>>
>>> * Avro attributes: *Avro 1.10.2 Specification - Schemas*
>>> <https://avro.apache.org/docs/1.10.2/spec.html#schemas> (see
>>> "Attributes not defined in this document are permitted as metadata").
>>>
>>> * BigQuery policy tags: *BigQuery Column-level Security*
>>> <https://cloud.google.com/bigquery/docs/column-level-security#set_policy>
>>> .
>>>
>>> * Snowflake object tagging: *Snowflake Object Tagging Documentation*
>>> <https://docs.snowflake.com/en/user-guide/object-tagging#create-and-assign-tags>
>>>  (see references to "MODIFY COLUMN").
>>>
>>>
>>> Looking forward to your insights on whether addressing this issue at the
>>> Iceberg specification and API level is a reasonable direction.
>>>
>>>
>>> Thanks,
>>> Walaa.
>>>
>>>
>>>
>>>

-- 
John Zhuge

Reply via email to