This is an automated email from the ASF dual-hosted git repository.
fanng pushed a commit to branch branch-0.9
in repository https://gitbox.apache.org/repos/asf/gravitino.git
The following commit(s) were added to refs/heads/branch-0.9 by this push:
new 75bfe0b362 [#6946] docs(lineage): add lineage document (#7043)
75bfe0b362 is described below
commit 75bfe0b362ea88607077e23d73ff50715856a4cc
Author: github-actions[bot]
<41898282+github-actions[bot]@users.noreply.github.com>
AuthorDate: Wed Apr 23 10:41:06 2025 +0800
[#6946] docs(lineage): add lineage document (#7043)
### What changes were proposed in this pull request?
add lineage document
### Why are the changes needed?
Fix: #6945
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
just document
Co-authored-by: FANNG <[email protected]>
---
docs/lineage/gravitino-server-lineage.md | 60 +++++++++++++++++
docs/lineage/gravitino-spark-lineage.md | 109 +++++++++++++++++++++++++++++++
docs/lineage/lineage.md | 16 +++++
3 files changed, 185 insertions(+)
diff --git a/docs/lineage/gravitino-server-lineage.md
b/docs/lineage/gravitino-server-lineage.md
new file mode 100644
index 0000000000..c5ccb468f1
--- /dev/null
+++ b/docs/lineage/gravitino-server-lineage.md
@@ -0,0 +1,60 @@
+---
+title: "Gravitino server Lineage support"
+slug: /lineage/gravitino-server-lineage
+keyword: Gravitino OpenLineage
+license: "This software is licensed under the Apache License version 2."
+---
+
+## Overview
+
+Gravitino server provides a pluggable lineage framework to receive, process,
and sink OpenLineage events. By leveraging this, you could do custom process
for the lineage event and sink to your dedicated systems.
+
+## Lineage Configuration
+
+| Configuration item | Description
| Default value
| Required | Since Version |
+|-----------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------|----------|------------------|
+| `gravitino.lineage.source` | The name of lineage event
source.
| http
| No | 0.9.0-incubating |
+| `gravitino.lineage.${sourceName}.sourceClass` | The name of the lineage
source class which should implement
`org.apache.gravitino.lineage.source.LineageSource` interface.
| (none)
| No | 0.9.0-incubating |
+| `gravitino.lineage.processorClass` | The name of the lineage
processor class which should implement
`org.apache.gravitino.lineage.processor.LineageProcessor` interface. The
default noop processor do nothing about the run event.
|
`org.apache.gravitino.lineage.processor.NoopProcessor` | No |
0.9.0-incubating |
+| `gravitino.lineage.sinks` | The Lineage event sink names
(support multiple sinks separated by commas).
| log
| No | 0.9.0-incubating |
+| `gravitino.lineage.${sinkName}.sinkClass` | The name of the lineage sink
class which should implement `org.apache.gravitino.lineage.sink.LineageSink`
interface.
| (none)
| No | 0.9.0-incubating |
+| `gravitino.lineage.queueCapacity` | The total capacity of
lineage event queues. When there are multiple lineage sinks, each sink utilizes
an isolated event queue. The capacity of each queue is calculated by dividing
the value of `gravitino.lineage.queueCapacity` by the number of sinks. | 10000
| No | 0.9.0-incubating |
+
+## Lineage http source
+
+Http source provides an endpoint which follows [OpenLineage API
spec](https://openlineage.io/apidocs/openapi/) to receive OpenLineage run
event. The following use example:
+
+```shell
+cat <<EOF >source.json
+{
+ "eventType": "START",
+ "eventTime": "2023-10-28T19:52:00.001+10:00",
+ "run": {
+ "runId": "0176a8c2-fe01-7439-87e6-56a1a1b4029f"
+ },
+ "job": {
+ "namespace": "gravitino-namespace",
+ "name": "gravitino-job1"
+ },
+ "inputs": [{
+ "namespace": "gravitino-namespace",
+ "name": "gravitino-table-identifier"
+ }],
+ "producer": "https://github.com/OpenLineage/OpenLineage/blob/v1-0-0/client",
+ "schemaURL":
"https://openlineage.io/spec/1-0-5/OpenLineage.json#/definitions/RunEvent"
+}
+EOF
+
+curl -X POST \
+ -i -H 'Content-Type: application/json' \
+ -d '@source.json' \
+ http://localhost:8090/api/lineage
+```
+
+## Lineage log sink
+
+Log sink prints the log in a separate log file `gravitino_lineage.log`, you
could change the default behavior in `conf/log4j2.properties`.
+
+## High watermark status
+
+When the lineage sink operates slowly, lineage events accumulate in the async
queue. Once the queue size exceeds 90% of its capacity (high watermark
threshold), the lineage system enters a high watermark status. In this state,
the lineage source must implement retry and logging mechanisms for rejected
events to prevent system overload. For the HTTP source, it returns the `429 Too
Many Requests` status code to the client.
\ No newline at end of file
diff --git a/docs/lineage/gravitino-spark-lineage.md
b/docs/lineage/gravitino-spark-lineage.md
new file mode 100644
index 0000000000..01787bb745
--- /dev/null
+++ b/docs/lineage/gravitino-spark-lineage.md
@@ -0,0 +1,109 @@
+---
+title: "Gravitino Spark Lineage support"
+slug: /lineage/gravitino-spark-lineage
+keyword: Gravitino Spark OpenLineage
+license: "This software is licensed under the Apache License version 2."
+---
+
+## Overview
+
+By leveraging OpenLineage Spark plugin, Gravitino provides a separate Spark
plugin to extract data lineage and transform the dataset identifier to
Gravitino identifier.
+
+## Capabilities
+
+- Supports column lineage.
+- Supports lineage across different catalogs like like fileset, Iceberg, Hudi,
Paimon, Hive, Model, etc.
+- Supports extract Gravitino dataset from GVFS.
+- Supports Gravitino spark connector and non Gravitino Spark connector.
+
+## Gravitino dataset
+
+The Gravitino OpenLineage Spark plugin transforms the Gravitino metalake name
into the dataset namespace. The dataset name varies by dataset type when
generating lineage information.
+
+When using the [Gravitino Spark
connector](/spark-connector/spark-connector.md) to access tables managed by
Gravitino, the dataset name follows this format:
+
+| Dataset Type | Dataset name |
Example | Since Version |
+|-----------------|------------------------------------------------------|----------------------------|------------------|
+| Hive catalog | `${GravitinoCatalogName}.${schemaName}.${tableName}` |
`hive_catalog.db.student` | 0.9.0-incubating |
+| Iceberg catalog | `${GravitinoCatalogName}.${schemaName}.${tableName}` |
`iceberg_catalog.db.score` | 0.9.0-incubating |
+| Paimon catalog | `${GravitinoCatalogName}.${schemaName}.${tableName}` |
`paimon_catalog.db.detail` | 0.9.0-incubating |
+| JDBC catalog | `${GravitinoCatalogName}.${schemaName}.${tableName}` |
`jdbc_catalog.db.score` | 0.9.0-incubating |
+
+For datasets not managed by Gravitino, the dataset name is as follows:
+
+| Dataset Type | Dataset name | Example
| Since Version |
+|--------------|---------------------------------------------|---------------------------------------|------------------|
+| Hive | `spark_catalog.${schemaName}.${tableName}` |
`spark_catalog.db.table` | 0.9.0-incubating |
+| Iceberg | `${catalogName}.${schemaName}.${tableName}` |
`iceberg_catalog.db.table` | 0.9.0-incubating |
+| JDBC v2 | `${catalogName}.${schemaName}.${tableName}` |
`jdbc_catalog.db.table` | 0.9.0-incubating |
+| JDBC v1 | `spark_catalog.${schemaName}.${tableName}` |
`spark_catalog.postgres.public.table` | 0.9.0-incubating |
+
+When accessing datasets by location (e.g., `SELECT * FROM
parquet.${dataset_path}`), the name is derived from the physical path:
+
+| Location Type | Dataset name | Example
| Since Version |
+|----------------|-----------------------------------------------|---------------------------------------|------------------|
+| GVFS location | `${catalogName}.${schemaName}.${filesetName}` |
`fileset_catalog.schema.fileset_a` | 0.9.0-incubating |
+| Other location | location path |
`hdfs://127.0.0.1:9000/tmp/a/student` | 0.9.0-incubating |
+
+For GVFS location, this plugin adds `fileset-location` facets which contains
the location path.
+
+```json
+"fileset-location" :
+{
+"location":"${gvfs-virutal-location}",
+"_producer":"https://github.com/datastrato/...",
+"_schemaURL":"https://raw.githubusercontent...."
+}
+```
+
+## How to use
+
+1. Download [Gravitino OpenLineage plugin
jar](https://github.com/datastrato/gravitino-openlineage-plugins/tree/main/spark-plugin/)
and place it to the classpath of Spark.
+2. Add configuration to the Spark to enable lineage collection.
+
+Configuration example For Spark shell:
+
+```shell
+./bin/spark-sql -v \
+--jars
/${path}/openlineage-spark_2.12-${gravitino-specific-version}.jar,/${path}/gravitino-spark-connector-runtime-3.5_2.12-${version}.jar
\
+--conf
spark.plugins="org.apache.gravitino.spark.connector.plugin.GravitinoSparkPlugin"
\
+--conf spark.sql.gravitino.uri=http://localhost:8090 \
+--conf spark.sql.gravitino.metalake=${metalakeName} \
+--conf
spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener \
+--conf spark.openlineage.transport.type=http \
+--conf spark.openlineage.transport.url=http://localhost:8090 \
+--conf spark.openlineage.transport.endpoint=/api/lineage \
+--conf spark.openlineage.namespace=${metalakeName} \
+--conf spark.openlineage.appName=${appName} \
+--conf spark.openlineage.columnLineage.datasetLineageEnabled=true
+```
+
+Please refer to [OpenLineage Spark
guides](https://openlineage.io/docs/guides/spark/) and [Gravitino Spark
connector](/spark-connector/spark-connector.md) for more details. Additionally,
Gravitino provides following configurations for lineage.
+
+<table>
+ <thead>
+ <tr>
+ <th>Configuration item</th>
+ <th>Description</th>
+ <th>Default value</th>
+ <th>Required</th>
+ <th>Since Version</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td><code>spark.sql.gravitino.useGravitinoIdentifier</code></td>
+ <td>Whether to use Gravitino identifier for the dataset not managed by
Gravitino. If setting to false, will use origin OpenLineage dataset identifier,
like <code>hdfs://localhost:9000</code> as namespace and <code>/path/xx</code>
as name for hive table.</td>
+ <td>True</td>
+ <td>No</td>
+ <td>0.9.0-incubating</td>
+ </tr>
+ <tr>
+ <td><code>spark.sql.gravitino.catalogMappings</code></td>
+ <td>Catalog name mapping roles for the dataset not managed by Gravitino.
For example <code>spark_catalog:catalog1,iceberg_catalog:catalog2</code> maps
<code>spark_catalog</code> to <code>catalog1</code> and
<code>iceberg_catalog</code> to <code>catalog2</code>, the other catalogs will
not be mapped.</td>
+ <td>None</td>
+ <td>No</td>
+ <td>0.9.0-incubating</td>
+ </tr>
+ </tbody>
+</table>
diff --git a/docs/lineage/lineage.md b/docs/lineage/lineage.md
new file mode 100644
index 0000000000..9c39fcc831
--- /dev/null
+++ b/docs/lineage/lineage.md
@@ -0,0 +1,16 @@
+---
+title: "Apache Gravitino Lineage support"
+slug: /lineage/lineage
+keyword: Gravitino Open Lineage
+license: "This software is licensed under the Apache License version 2."
+---
+
+## Overview
+
+Lineage information is critical for metadata systems, Gravitino supports data
lineage by leveraging [OpenLineage](https://openlineage.io/). Gravitino
provides a specific Spark jar to collect lineage information with Gravitino
identifier, please refer to [Gravitino Spark lineage
page](./gravitino-spark-lineage.md). Additional, Gravitino server provides
lineage process framework to receive, process and sink Open lineage events to
other systems.
+
+## Capabilities
+
+- Supports column lineages.
+- Supports lineage across diverse Gravitino catalogs like fileset, Iceberg,
Hudi, Paimon, Hive, Model, etc.
+- Supports Spark.