GitHub user fkostadinov created a discussion: Introduce enforced and versioned 
data product schemas when they are published for cross-domain governance

A key idea of a data mesh is that domains are superimposed onto a data lake - 
to avoid creation of a data swamp. In this way, ownership of datasets is always 
clearly assigned. However, many metadata and data lineage tools simply document 
schema of datasets, but do not enforce them. The consequence is that when a 
data product is published and nobody enforces a schema then the downstream 
consumers of a data product may experience upstream data product producers to 
break the schema contract at some point.

I was wondering whether we could introduce enforced schemas somehow. As long as 
a dataset resides within a domain its schema is allowed to change. But in the 
very moment a data owner publishes a dataset to be consumable beyond a domain 
then there should be an immutable contract downstream consumers can rely on. 
The publication of a data product would enforce version control of the schema 
contract (similar to what e.g. Nessie does). Every change of the schema of a 
data product would require a sign-off from the data owner, and would notify 
registered downstream consumers.

This approach would allow domain-specific data product ownership, but provide 
guarantees to downstream consumers that things do not simply break without 
being at least notified. Every breaking change could be clearly tracked and we 
would always know who the assigned owner of the dataset is.

I am not sure if this would be aligned with the idea of what Gravitino tries to 
be (e.g. only a metadata catalogue, or also a governance platform)? Maybe also 
my idea should be located at a level higher than Gravitino.

GitHub link: https://github.com/apache/gravitino/discussions/8914

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to