(gravitino) branch branch-0.9 updated: [#6946] docs(lineage): add lineage document (#7043)

fanng Tue, 22 Apr 2025 19:41:37 -0700

This is an automated email from the ASF dual-hosted git repository.

fanng pushed a commit to branch branch-0.9
in repository https://gitbox.apache.org/repos/asf/gravitino.git



The following commit(s) were added to refs/heads/branch-0.9 by this push:
     new 75bfe0b362 [#6946] docs(lineage): add lineage document (#7043)
75bfe0b362 is described below

commit 75bfe0b362ea88607077e23d73ff50715856a4cc
Author: github-actions[bot] 
<41898282+github-actions[bot]@users.noreply.github.com>
AuthorDate: Wed Apr 23 10:41:06 2025 +0800

    [#6946] docs(lineage): add lineage document (#7043)
    
    ### What changes were proposed in this pull request?
    
    add lineage document
    
    ### Why are the changes needed?
    
    Fix: #6945
    
    ### Does this PR introduce _any_ user-facing change?
    
    no
    
    ### How was this patch tested?
    just document
    
    Co-authored-by: FANNG <[email protected]>
---
 docs/lineage/gravitino-server-lineage.md |  60 +++++++++++++++++
 docs/lineage/gravitino-spark-lineage.md  | 109 +++++++++++++++++++++++++++++++
 docs/lineage/lineage.md                  |  16 +++++
 3 files changed, 185 insertions(+)

diff --git a/docs/lineage/gravitino-server-lineage.md 
b/docs/lineage/gravitino-server-lineage.md
new file mode 100644
index 0000000000..c5ccb468f1
--- /dev/null
+++ b/docs/lineage/gravitino-server-lineage.md
@@ -0,0 +1,60 @@
+---
+title: "Gravitino server Lineage support"
+slug: /lineage/gravitino-server-lineage
+keyword: Gravitino OpenLineage
+license: "This software is licensed under the Apache License version 2."
+---
+
+## Overview
+
+Gravitino server provides a pluggable lineage framework to receive, process, 
and sink OpenLineage events. By leveraging this, you could do custom process 
for the lineage event and sink to your dedicated systems.
+
+## Lineage Configuration
+
+| Configuration item                            | Description                  
                                                                                
                                                                                
                                                              | Default value   
                                       | Required | Since Version    |
+|-----------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------|----------|------------------|
+| `gravitino.lineage.source`                    | The name of lineage event 
source.                                                                         
                                                                                
                                                                 | http         
                                          | No       | 0.9.0-incubating |
+| `gravitino.lineage.${sourceName}.sourceClass` | The name of the lineage 
source class which should implement 
`org.apache.gravitino.lineage.source.LineageSource` interface.                  
                                                                                
                               | (none)                                         
        | No       | 0.9.0-incubating |
+| `gravitino.lineage.processorClass`            | The name of the lineage 
processor class which should implement 
`org.apache.gravitino.lineage.processor.LineageProcessor` interface. The 
default noop processor do nothing about the run event.                          
                                   | 
`org.apache.gravitino.lineage.processor.NoopProcessor` | No       | 
0.9.0-incubating |
+| `gravitino.lineage.sinks`                     | The Lineage event sink names 
(support multiple sinks separated by commas).                                   
                                                                                
                                                              | log             
                                       | No       | 0.9.0-incubating |
+| `gravitino.lineage.${sinkName}.sinkClass`     | The name of the lineage sink 
class which should implement `org.apache.gravitino.lineage.sink.LineageSink` 
interface.                                                                      
                                                                 | (none)       
                                          | No       | 0.9.0-incubating |
+| `gravitino.lineage.queueCapacity`             | The total capacity of 
lineage event queues. When there are multiple lineage sinks, each sink utilizes 
an isolated event queue. The capacity of each queue is calculated by dividing 
the value of `gravitino.lineage.queueCapacity` by the number of sinks. | 10000  
                                                | No       | 0.9.0-incubating |
+
+## Lineage http source 
+
+Http source provides an endpoint which follows [OpenLineage API 
spec](https://openlineage.io/apidocs/openapi/) to receive OpenLineage run 
event. The following use example:
+
+```shell
+cat <<EOF >source.json
+{
+  "eventType": "START",
+  "eventTime": "2023-10-28T19:52:00.001+10:00",
+  "run": {
+    "runId": "0176a8c2-fe01-7439-87e6-56a1a1b4029f"
+  },
+  "job": {
+    "namespace": "gravitino-namespace",
+    "name": "gravitino-job1"
+  },
+  "inputs": [{
+    "namespace": "gravitino-namespace",
+    "name": "gravitino-table-identifier"
+  }],
+  "producer": "https://github.com/OpenLineage/OpenLineage/blob/v1-0-0/client";,
+  "schemaURL": 
"https://openlineage.io/spec/1-0-5/OpenLineage.json#/definitions/RunEvent";
+}
+EOF
+
+curl -X POST \
+  -i -H 'Content-Type: application/json' \
+  -d '@source.json' \
+  http://localhost:8090/api/lineage
+```
+
+## Lineage log sink
+
+Log sink prints the log in a separate log file `gravitino_lineage.log`, you 
could change the default behavior in `conf/log4j2.properties`.
+
+## High watermark status
+
+When the lineage sink operates slowly, lineage events accumulate in the async 
queue. Once the queue size exceeds 90% of its capacity (high watermark 
threshold), the lineage system enters a high watermark status. In this state, 
the lineage source must implement retry and logging mechanisms for rejected 
events to prevent system overload. For the HTTP source, it returns the `429 Too 
Many Requests` status code to the client.
\ No newline at end of file
diff --git a/docs/lineage/gravitino-spark-lineage.md 
b/docs/lineage/gravitino-spark-lineage.md
new file mode 100644
index 0000000000..01787bb745
--- /dev/null
+++ b/docs/lineage/gravitino-spark-lineage.md
@@ -0,0 +1,109 @@
+---
+title: "Gravitino Spark Lineage support"
+slug: /lineage/gravitino-spark-lineage
+keyword: Gravitino Spark OpenLineage
+license: "This software is licensed under the Apache License version 2."
+---
+
+## Overview
+
+By leveraging OpenLineage Spark plugin, Gravitino provides a separate Spark 
plugin to extract data lineage and transform the dataset identifier to 
Gravitino identifier.
+
+## Capabilities
+
+- Supports column lineage.
+- Supports lineage across different catalogs like like fileset, Iceberg, Hudi, 
Paimon, Hive, Model, etc.
+- Supports extract Gravitino dataset from GVFS.
+- Supports Gravitino spark connector and non Gravitino Spark connector.
+
+## Gravitino dataset
+
+The Gravitino OpenLineage Spark plugin transforms the Gravitino metalake name 
into the dataset namespace. The dataset name varies by dataset type when 
generating lineage information.
+
+When using the [Gravitino Spark 
connector](/spark-connector/spark-connector.md) to access tables managed by 
Gravitino, the dataset name follows this format:
+
+| Dataset Type    | Dataset name                                         | 
Example                    | Since Version    |
+|-----------------|------------------------------------------------------|----------------------------|------------------|
+| Hive catalog    | `${GravitinoCatalogName}.${schemaName}.${tableName}` | 
`hive_catalog.db.student`  | 0.9.0-incubating |
+| Iceberg catalog | `${GravitinoCatalogName}.${schemaName}.${tableName}` | 
`iceberg_catalog.db.score` | 0.9.0-incubating |
+| Paimon catalog  | `${GravitinoCatalogName}.${schemaName}.${tableName}` | 
`paimon_catalog.db.detail` | 0.9.0-incubating |
+| JDBC catalog    | `${GravitinoCatalogName}.${schemaName}.${tableName}` | 
`jdbc_catalog.db.score`    | 0.9.0-incubating |
+
+For datasets not managed by Gravitino, the dataset name is as follows:
+
+| Dataset Type | Dataset name                                | Example         
                      | Since Version    |
+|--------------|---------------------------------------------|---------------------------------------|------------------|
+| Hive         | `spark_catalog.${schemaName}.${tableName}`  | 
`spark_catalog.db.table`              | 0.9.0-incubating |
+| Iceberg      | `${catalogName}.${schemaName}.${tableName}` | 
`iceberg_catalog.db.table`            | 0.9.0-incubating |
+| JDBC v2      | `${catalogName}.${schemaName}.${tableName}` | 
`jdbc_catalog.db.table`               | 0.9.0-incubating |
+| JDBC v1      | `spark_catalog.${schemaName}.${tableName}`  | 
`spark_catalog.postgres.public.table` | 0.9.0-incubating |
+
+When accessing datasets by location (e.g., `SELECT * FROM 
parquet.${dataset_path}`), the name is derived from the physical path:
+
+| Location Type  | Dataset name                                  | Example     
                          | Since Version    |
+|----------------|-----------------------------------------------|---------------------------------------|------------------|
+| GVFS location  | `${catalogName}.${schemaName}.${filesetName}` | 
`fileset_catalog.schema.fileset_a`    | 0.9.0-incubating |
+| Other location | location path                                 | 
`hdfs://127.0.0.1:9000/tmp/a/student` | 0.9.0-incubating |
+
+For GVFS location, this plugin adds `fileset-location` facets which contains 
the location path.
+
+```json
+"fileset-location" :
+{
+"location":"${gvfs-virutal-location}",
+"_producer":"https://github.com/datastrato/...";,
+"_schemaURL":"https://raw.githubusercontent....";
+}
+```
+
+## How to use 
+
+1. Download [Gravitino OpenLineage plugin 
jar](https://github.com/datastrato/gravitino-openlineage-plugins/tree/main/spark-plugin/)
 and place it to the classpath of Spark.
+2. Add configuration to the Spark to enable lineage collection.
+
+Configuration example For Spark shell:
+
+```shell
+./bin/spark-sql -v \
+--jars 
/${path}/openlineage-spark_2.12-${gravitino-specific-version}.jar,/${path}/gravitino-spark-connector-runtime-3.5_2.12-${version}.jar
 \
+--conf 
spark.plugins="org.apache.gravitino.spark.connector.plugin.GravitinoSparkPlugin"
 \
+--conf spark.sql.gravitino.uri=http://localhost:8090 \
+--conf spark.sql.gravitino.metalake=${metalakeName} \
+--conf 
spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener \
+--conf spark.openlineage.transport.type=http \
+--conf spark.openlineage.transport.url=http://localhost:8090 \
+--conf spark.openlineage.transport.endpoint=/api/lineage \
+--conf spark.openlineage.namespace=${metalakeName} \
+--conf spark.openlineage.appName=${appName} \
+--conf spark.openlineage.columnLineage.datasetLineageEnabled=true 
+```
+
+Please refer to [OpenLineage Spark 
guides](https://openlineage.io/docs/guides/spark/) and [Gravitino Spark 
connector](/spark-connector/spark-connector.md) for more details. Additionally, 
Gravitino provides following configurations for lineage. 
+
+<table>
+  <thead>
+    <tr>
+      <th>Configuration item</th>
+      <th>Description</th>
+      <th>Default value</th>
+      <th>Required</th>
+      <th>Since Version</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <td><code>spark.sql.gravitino.useGravitinoIdentifier</code></td>
+      <td>Whether to use Gravitino identifier for the dataset not managed by 
Gravitino. If setting to false, will use origin OpenLineage dataset identifier, 
like <code>hdfs://localhost:9000</code> as namespace and <code>/path/xx</code> 
as name for hive table.</td>
+      <td>True</td>
+      <td>No</td>
+      <td>0.9.0-incubating</td>
+    </tr>
+    <tr>
+      <td><code>spark.sql.gravitino.catalogMappings</code></td>
+      <td>Catalog name mapping roles for the dataset not managed by Gravitino. 
For example <code>spark_catalog:catalog1,iceberg_catalog:catalog2</code> maps 
<code>spark_catalog</code> to <code>catalog1</code> and 
<code>iceberg_catalog</code> to <code>catalog2</code>, the other catalogs will 
not be mapped.</td>
+      <td>None</td>
+      <td>No</td>
+      <td>0.9.0-incubating</td>
+    </tr>
+  </tbody>
+</table>
diff --git a/docs/lineage/lineage.md b/docs/lineage/lineage.md
new file mode 100644
index 0000000000..9c39fcc831
--- /dev/null
+++ b/docs/lineage/lineage.md
@@ -0,0 +1,16 @@
+---
+title: "Apache Gravitino Lineage support"
+slug: /lineage/lineage
+keyword: Gravitino Open Lineage
+license: "This software is licensed under the Apache License version 2."
+---
+
+## Overview
+
+Lineage information is critical for metadata systems, Gravitino supports data 
lineage by leveraging [OpenLineage](https://openlineage.io/). Gravitino 
provides a specific Spark jar to collect lineage information with Gravitino 
identifier, please refer to [Gravitino Spark lineage 
page](./gravitino-spark-lineage.md). Additional, Gravitino server provides 
lineage process framework to receive, process and sink Open lineage events to 
other systems.
+
+## Capabilities
+
+- Supports column lineages.
+- Supports lineage across diverse Gravitino catalogs like fileset, Iceberg, 
Hudi, Paimon, Hive, Model, etc.
+- Supports Spark.

(gravitino) branch branch-0.9 updated: [#6946] docs(lineage): add lineage document (#7043)

Reply via email to