GitHub user FANNG1 added a comment to the discussion: Discuss adding Gravitino support to the OpenLineage community
I also looked at the current Polaris OpenLineage discussion and PoC branch: - Polaris dev thread: https://www.mail-archive.com/[email protected]/msg04430.html - Follow-up mentioning an earlier/alternative direction: https://www.mail-archive.com/[email protected]/msg04447.html - PoC branch: https://github.com/iting0321/polaris/tree/data-lineage Based on the public thread and PoC code, there seem to be two related directions being discussed in Polaris: 1. The current PoC emits OpenLineage events from Polaris-managed Iceberg table operations. It hooks into Polaris persistence events, generates OpenLineage `RunEvent` JSON for table create/update/drop operations, stores that JSON in the persisted Polaris event under an `openlineage` property, and adds an HTTP listener that posts the payload to a Marquez/OpenLineage endpoint. 2. The earlier/alternative proposal mentioned in the Polaris thread is to make Polaris an OpenLineage server implementation. In that model, compute engines would send OpenLineage events to Polaris through OpenLineage APIs, and Polaris could either persist lineage internally or forward it to a downstream OpenLineage backend such as Marquez. The PoC maps Polaris/Iceberg events roughly as follows: - synthetic job namespace: `polaris.<realm>.<catalog>` - synthetic job name: `<event_type>:<table_identifier>` - dataset namespace: Iceberg table namespace - dataset name: Iceberg table name - datasource facet: derived from the Iceberg table location, with a Polaris URN fallback - dataset facets: schema, snapshot/version, lifecycle state, and basic output statistics One important limitation is that the PoC tries to infer CTAS/input lineage at the Polaris server side by tracking `AFTER_LOAD_TABLE` events and correlating them with later create/update events using `spark.app.id` or `app-id` from Iceberg snapshot summaries. That can work for a demo or coarse table-level lineage, but it is heuristic: - the same Spark app can run multiple independent SQL statements; - table load order does not always mean data dependency; - cached data, temporary views, retries, and concurrent operations can make the inferred lineage inaccurate; - non-Spark engines or engines that do not propagate an app id cannot be correlated; - Polaris/Gravitino server side does not see the query plan, so it cannot provide operator-level or reliable column-level lineage. For Gravitino, I think this means we should be careful about relying only on server-side lineage inference. A better separation may be: - engine-side OpenLineage integrations emit the authoritative query/table lineage, because engines can see the execution plan; - Gravitino server resolves OpenLineage dataset identifiers back to `metalake.catalog.schema.table` and enriches the lineage with catalog metadata; - Gravitino server may still emit supplemental lifecycle/metadata events, such as table create/drop/alter, schema changes, snapshot/version changes, and policy/ownership changes. This keeps Gravitino aligned with OpenLineage conventions while avoiding inaccurate lineage inference from catalog-server observations alone. GitHub link: https://github.com/apache/gravitino/discussions/10850#discussioncomment-16677640 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
