GitHub user FANNG1 edited a discussion: Discuss adding Gravitino support to the 
OpenLineage community

## Background

Gravitino has OpenLineage integration support today, but the plugin is not open 
sourced in the OpenLineage community. This makes it harder to maintain, review, 
evolve, and align with OpenLineage's dataset model.

I would like to discuss whether we should first contribute Gravitino support to 
the OpenLineage community.

## Context

OpenLineage dataset naming is datasource-oriented. For example, the Spark 
Iceberg integration can emit the physical storage dataset as the primary 
dataset and add a table identifier through the `symlinks` facet.

For Gravitino, the logical resource model is different. A table is normally 
identified through:

- metalake
- catalog
- schema
- table

This does not map directly to a single OpenLineage dataset name without 
choosing a convention.

## Proposal

Add Gravitino support in the OpenLineage community first, and keep emitted 
dataset identifiers consistent with OpenLineage conventions.

On the Gravitino side, the server can translate OpenLineage dataset identifiers 
into Gravitino's internal resource model. For example, Gravitino could resolve 
a dataset by using one or more of:

- dataset namespace
- dataset name
- symlinks facet
- catalog dataset facet or custom Gravitino facet
- Gravitino-specific configuration such as the target metalake

This keeps OpenLineage producers aligned with OpenLineage naming, while 
allowing Gravitino to map lineage events back to 
`metalake.catalog.schema.table`.

References:

- OpenLineage naming conventions: https://openlineage.io/docs/spec/naming
- Spark Iceberg handler: 
https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/spark3/src/main/java/io/openlineage/spark3/agent/lifecycle/plan/catalog/iceberg/IcebergHandler.java
- Spark Iceberg example event with symlinks: 
https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/app/integrations/container/pysparkV2ReplaceTableAsSelectCompleteEvent.json


GitHub link: https://github.com/apache/gravitino/discussions/10850

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to