Re: [PR] [#6946] docs(lineage): add lineage document [gravitino]

via GitHub Fri, 18 Apr 2025 00:49:50 -0700


FANNG1 commented on code in PR #6946:
URL: https://github.com/apache/gravitino/pull/6946#discussion_r2050292739



##########
docs/lineage/gravitino-spark-lineage.md:
##########
@@ -0,0 +1,109 @@
+---
+title: "Gravitino Spark Lineage support"
+slug: /lineage/gravitino-spark-lineage
+keyword: Gravitino Spark OpenLineage
+license: "This software is licensed under the Apache License version 2."
+---
+
+## Overview
+
+By leveraging OpenLineage Spark plugin, Gravitino provides a separate Spark 
plugin to extract data lineage and transform the dataset identifier to 
Gravitino identifier.
+
+## Capabilities
+
+- Supports column lineage.
+- Supports lineage across different catalogs like like fileset, Iceberg, Hudi, 
Paimon, Hive, Model, etc.
+- Supports extract Gravitino dataset from GVFS.
+- Supports Gravitino spark connector and non Gravitino Spark connector.
+
+## Gravitino dataset
+
+The Gravitino OpenLineage Spark plugin transforms the Gravitino metalake name 
into the dataset namespace. The dataset name varies by dataset type when 
generating lineage information.
+
+When using the [Gravitino Spark 
connector](/spark-connector/spark-connector.md) to access tables managed by 
Gravitino, the dataset name follows this format:
+
+| Dataset Type    | Dataset name                                   | Example   
                 | Since Version |
+|-----------------|------------------------------------------------|----------------------------|---------------|
+| Hive catalog    | `$GravitinoCatalogName.$schemaName.$tableName` | 
`hive_catalog.db.student`  | 0.9.0         |
+| Iceberg catalog | `$GravitinoCatalogName.$schemaName.$tableName` | 
`iceberg_catalog.db.score` | 0.9.0         |
+| Paimon catalog  | `$GravitinoCatalogName.$schemaName.$tableName` | 
`paimon_catalog.db.detail` | 0.9.0         |
+| JDBC catalog    | `$GravitinoCatalogName.$schemaName.$tableName` | 
`jdbc_catalog.db.score`    | 0.9.0         |
+
+For datasets not managed by Gravitino, the dataset name is as follows:
+
+| Dataset Type | Dataset name                           | Example              
                 | Since Version |
+|--------------|----------------------------------------|---------------------------------------|---------------|
+| Hive         | `spark_catalog.$schemaName.$tableName` | 
`spark_catalog.db.table`              | 0.9.0         |
+| Iceberg      | `$catalogName.$schemaName.$tableName`  | 
`iceberg_catalog.db.table`            | 0.9.0         |
+| JDBC v2      | `$catalogName.$schemaName.$tableName`  | 
`jdbc_catalog.db.table`               | 0.9.0         |
+| JDBC v1      | `spark_catalog.$schemaName.$tableName` | 
`spark_catalog.postgres.public.table` | 0.9.0         |
+
+When accessing datasets by location (e.g., `SELECT * FROM 
parquet.$dataset_path`), the name is derived from the physical path:
+
+| Location Type  | Dataset name                                     | Example  
                             | Since Version |
+|----------------|--------------------------------------------------|---------------------------------------|---------------|
+| GVFS location  | `$GravitinoCatalogName.$schemaName.$filesetName` | 
`fileset_catalog.schema.fileset_a`    | 0.9.0         |
+| Other location | location path                                    | 
`hdfs://127.0.0.1:9000/tmp/a/student` | 0.9.0         |
+
+For GVFS location, the plugin add `fileset-location` facets which contains the 
location path.
+
+```json
+"fileset-location" :
+{
+"location":"/path/xx",

Review Comment:
   gvfs virtual location



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [#6946] docs(lineage): add lineage document [gravitino]

Reply via email to