FANNG1 commented on code in PR #6946:
URL: https://github.com/apache/gravitino/pull/6946#discussion_r2050292739
##########
docs/lineage/gravitino-spark-lineage.md:
##########
@@ -0,0 +1,109 @@
+---
+title: "Gravitino Spark Lineage support"
+slug: /lineage/gravitino-spark-lineage
+keyword: Gravitino Spark OpenLineage
+license: "This software is licensed under the Apache License version 2."
+---
+
+## Overview
+
+By leveraging OpenLineage Spark plugin, Gravitino provides a separate Spark
plugin to extract data lineage and transform the dataset identifier to
Gravitino identifier.
+
+## Capabilities
+
+- Supports column lineage.
+- Supports lineage across different catalogs like like fileset, Iceberg, Hudi,
Paimon, Hive, Model, etc.
+- Supports extract Gravitino dataset from GVFS.
+- Supports Gravitino spark connector and non Gravitino Spark connector.
+
+## Gravitino dataset
+
+The Gravitino OpenLineage Spark plugin transforms the Gravitino metalake name
into the dataset namespace. The dataset name varies by dataset type when
generating lineage information.
+
+When using the [Gravitino Spark
connector](/spark-connector/spark-connector.md) to access tables managed by
Gravitino, the dataset name follows this format:
+
+| Dataset Type | Dataset name | Example
| Since Version |
+|-----------------|------------------------------------------------|----------------------------|---------------|
+| Hive catalog | `$GravitinoCatalogName.$schemaName.$tableName` |
`hive_catalog.db.student` | 0.9.0 |
+| Iceberg catalog | `$GravitinoCatalogName.$schemaName.$tableName` |
`iceberg_catalog.db.score` | 0.9.0 |
+| Paimon catalog | `$GravitinoCatalogName.$schemaName.$tableName` |
`paimon_catalog.db.detail` | 0.9.0 |
+| JDBC catalog | `$GravitinoCatalogName.$schemaName.$tableName` |
`jdbc_catalog.db.score` | 0.9.0 |
+
+For datasets not managed by Gravitino, the dataset name is as follows:
+
+| Dataset Type | Dataset name | Example
| Since Version |
+|--------------|----------------------------------------|---------------------------------------|---------------|
+| Hive | `spark_catalog.$schemaName.$tableName` |
`spark_catalog.db.table` | 0.9.0 |
+| Iceberg | `$catalogName.$schemaName.$tableName` |
`iceberg_catalog.db.table` | 0.9.0 |
+| JDBC v2 | `$catalogName.$schemaName.$tableName` |
`jdbc_catalog.db.table` | 0.9.0 |
+| JDBC v1 | `spark_catalog.$schemaName.$tableName` |
`spark_catalog.postgres.public.table` | 0.9.0 |
+
+When accessing datasets by location (e.g., `SELECT * FROM
parquet.$dataset_path`), the name is derived from the physical path:
+
+| Location Type | Dataset name | Example
| Since Version |
+|----------------|--------------------------------------------------|---------------------------------------|---------------|
+| GVFS location | `$GravitinoCatalogName.$schemaName.$filesetName` |
`fileset_catalog.schema.fileset_a` | 0.9.0 |
+| Other location | location path |
`hdfs://127.0.0.1:9000/tmp/a/student` | 0.9.0 |
+
+For GVFS location, the plugin add `fileset-location` facets which contains the
location path.
+
+```json
+"fileset-location" :
+{
+"location":"/path/xx",
Review Comment:
gvfs virtual location
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]