GitHub user FANNG1 added a comment to the discussion: Discuss adding Gravitino 
support to the OpenLineage community

I also looked at the current Polaris OpenLineage discussion and PoC branch:

- Polaris dev thread: 
https://www.mail-archive.com/[email protected]/msg04430.html
- Follow-up mentioning an earlier/alternative direction: 
https://www.mail-archive.com/[email protected]/msg04447.html
- PoC branch: https://github.com/iting0321/polaris/tree/data-lineage

Based on the public thread and PoC code, there seem to be two related 
directions being discussed in Polaris:

1. The current PoC emits OpenLineage events from Polaris-managed Iceberg table 
operations. It hooks into Polaris persistence events, generates OpenLineage 
`RunEvent` JSON for table create/update/drop operations, stores that JSON in 
the persisted Polaris event under an `openlineage` property, and adds an HTTP 
listener that posts the payload to a Marquez/OpenLineage endpoint.
2. The earlier/alternative proposal mentioned in the Polaris thread is to make 
Polaris an OpenLineage server implementation. In that model, compute engines 
would send OpenLineage events to Polaris through OpenLineage APIs, and Polaris 
could either persist lineage internally or forward it to a downstream 
OpenLineage backend such as Marquez.

The PoC maps Polaris/Iceberg events roughly as follows:

- synthetic job namespace: `polaris.<realm>.<catalog>`
- synthetic job name: `<event_type>:<table_identifier>`
- dataset namespace: Iceberg table namespace
- dataset name: Iceberg table name
- datasource facet: derived from the Iceberg table location, with a Polaris URN 
fallback
- dataset facets: schema, snapshot/version, lifecycle state, and basic output 
statistics

One important limitation is that the PoC tries to infer CTAS/input lineage at 
the Polaris server side by tracking `AFTER_LOAD_TABLE` events and correlating 
them with later create/update events using `spark.app.id` or `app-id` from 
Iceberg snapshot summaries. That can work for a demo or coarse table-level 
lineage, but it is heuristic:

- the same Spark app can run multiple independent SQL statements;
- table load order does not always mean data dependency;
- cached data, temporary views, retries, and concurrent operations can make the 
inferred lineage inaccurate;
- non-Spark engines or engines that do not propagate an app id cannot be 
correlated;
- Polaris/Gravitino server side does not see the query plan, so it cannot 
provide operator-level or reliable column-level lineage.

For Gravitino, I think this means we should be careful about relying only on 
server-side lineage inference. A better separation may be:

- engine-side OpenLineage integrations emit the authoritative query/table 
lineage, because engines can see the execution plan;
- Gravitino server resolves OpenLineage dataset identifiers back to 
`metalake.catalog.schema.table` and enriches the lineage with catalog metadata;
- Gravitino server may still emit supplemental lifecycle/metadata events, such 
as table create/drop/alter, schema changes, snapshot/version changes, and 
policy/ownership changes.

This keeps Gravitino aligned with OpenLineage conventions while avoiding 
inaccurate lineage inference from catalog-server observations alone.


GitHub link: 
https://github.com/apache/gravitino/discussions/10850#discussioncomment-16677640

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to