justinmclean commented on code in PR #8066:
URL: https://github.com/apache/gravitino/pull/8066#discussion_r2281167915
##########
docs/manage-jobs-in-gravitino.md:
##########
@@ -0,0 +1,585 @@
+---
+title: "Manage jobs in Gravitino"
+slug: /manage-jobs-in-gravitino
+date: 2025-08-13
+keywords: job, job template, gravitino
+license: "This software is licensed under the Apache License version 2."
+---
+
+import Tabs from '@theme/Tabs';
+import TabItem from '@theme/TabItem';
+
+## Introduction
+
+Starting from 1.0.0, Apache Gravitino introduces a new submodule called job
system for users to
+register, run, and manage jobs. This job system collaborates with the existing
metadata
+management, brings users the ability to execute the jobs or actions based on
the metadata, which
+we call metadata-driven actions, for example, running a job to compact some
Iceberg tables,
+running a job to clean old data based on the TTL properties, etc.
+
+The aim of the job system is to provide a unified way to manage job templates
and jobs,
+including registering job templates, running jobs based on the job templates,
etc. The job
+system itself is a unified job submitter, allowing users to run jobs through
the job system
+itself, but it doesn't provide the actual job execution capabilities. Instead,
it relies on the
+existing job executors (schedulers) like Apache Airflow, Apache Livy to
execute the jobs.
+Gravitino's job system provides an extensible way to connect to different job
executors.
+
+:::note
+1. The job system is a new feature introduced in Gravitino 1.0.0, and it is
still under
+ development, so some features may not be fully implemented yet.
+2. The aim of the job system is not to replace the existing job executors. So,
it can only
+ supports running a single job at a time, and it doesn't support job
scheduling for now.
+ :::
+
+## Job operations
+
+### Register a new job template
+
+Before running a job, the first step is to register a job template. Currently,
Gravitino
+supports two types of job templates: `shell` and `spark` (we will add more job
templates in the
+future).
+
+#### Shell job template
+
+The `shell` job template is used to run scripts, it can be a shell script, or
any executable
+script. The template is defined as follows:
+
+```json
+{
+ "name": "my_shell_job_template",
+ "jobType": "shell",
+ "comment": "A shell job template to run a script",
+ "executable": "/path/to/my_script.sh",
+ "arguments": ["{{arg1}}", "{{arg2}}"],
+ "environments": {
+ "ENV_VAR1": "{{value1}}",
+ "ENV_VAR2": "{{value2}}"
+ },
+ "customFields": {
+ "field1": "{{value1}}",
+ "field2": "{{value2}}"
+ },
+ "scripts": ["/path/to/script1.sh", "/path/to/script2.sh"]
+}
+```
+
+Here is a brief description of the fields in the job template:
+
+- `name`: The name of the job template, must be unique.
+- `jobType`: The type of the job template, use `shell` for shell job template.
+- `comment`: A comment for the job template, can be used to describe the job
template.
+- `executable`: The path to the executable script, can be a shell script or
any executable script.
+- `arguments`: The arguments to pass to the executable script, can use
placeholders like `{{arg1}}`
+ and `{{arg2}}` to be replaced with actual values when running the job.
+- `environments`: The environment variables to set when running the job, can
use placeholders like
+ `{{value1}}` and `{{value2}}` to be replaced with actual values when running
the job.
+- `customFields`: Custom fields for the job template, can be used to store
additional
+ information, can use placeholders like `{{value1}}` and `{{value2}}` to be
replaced with actual
+ values when running the job.
+- `scripts`: A list of scripts that can be used by the main executable script.
+
+Please note that:
+
+1. The `executable` and `scripts` must be accessible by the Gravitino server.
Currently,
+ Gravitino supports accessing files from the local file system, HTTP(S)
URLs, and FTP(S) URLs
+ (more distributed file system support will be added in the future). So the
`executable` and
+ `scripts` can be a local file path, or a URL like
`http://example.com/my_script.sh`.
+2. The `arguments`, `environments`, and `customFields` can use placeholders
like `{{arg1}}` and
+ `{{value1}}` to be replaced with actual values when running the job. The
placeholders will be
+ replaced with the actual values when running the job, so you can use them
to pass dynamic values
+ to the job template.
+3. Gravitino will copy the `executable` and `scripts` files to the job working
directory
+ when running the job, so you can use the relative path in the `executable`
and `scripts` to
+ refer to other scripts in the job working directory.
+
+#### Spark job template
+
+The `spark` job template is used to run Spark jobs, it is a Spark application
JAR file for now.
+
+**Note** that the Spark job support is still under development, in 1.0.0, it
only supports
+registering a Spark job template, running a Spark job is not supported yet.
+
+The template is defined as follows:
+
+```json
+{
+ "name": "my_spark_job_template",
+ "jobType": "spark",
+ "comment": "A Spark job template to run a Spark application",
+ "executable": "/path/to/my_spark_app.jar",
+ "arguments": ["{{arg1}}", "{{arg2}}"],
+ "environments": {
+ "ENV_VAR1": "{{value1}}",
+ "ENV_VAR2": "{{value2}}"
+ },
+ "customFields": {
+ "field1": "{{value1}}",
+ "field2": "{{value2}}"
+ },
+ "className": "com.example.MySparkApp",
+ "jars": ["/path/to/dependency1.jar", "/path/to/dependency2.jar"],
+ "files": ["/path/to/file1.txt", "/path/to/file2.txt"],
+ "archives": ["/path/to/archive1.zip", "/path/to/archive2.zip"],
+ "configs": {
+ "spark.executor.memory": "2g",
+ "spark.executor.cores": "2"
+ }
+}
+```
+
+Here is a brief description of the fields in the Spark job template:
+
+- `name`: The name of the job template, must be unique.
+- `jobType`: The type of the job template, use `spark` for Spark job template.
+- `comment`: A comment for the job template, can be used to describe the job
template.
+- `executable`: The path to the Spark application JAR file, can be a local
file path or a URL
+ with supported scheme.
+- `arguments`: The arguments to pass to the Spark application, can use
placeholders like
+ `{{arg1}}` and `{{arg2}}` to be replaced with actual values when running the
job.
+- `environments`: The environment variables to set when running the job, can
use placeholders like
+ `{{value1}}` and `{{value2}}` to be replaced with actual values when running
the job.
+- `customFields`: Custom fields for the job template, can be used to store
additional information.
+ It can use placeholders like `{{value1}}` and `{{value2}}` to be replaced
with actual values
+ when running the job.
+- `className`: The main class of the Spark application, required for Spark job
template.
+- `jars`: A list of JAR files to add to the Spark job classpath, can be a
local file path or a URL
+ with supported scheme.
+- `files`: A list of files to be copied to the working directory of the Spark
job, can be a local
+ file path or a URL with supported scheme.
+- `archives`: A list of archives to be extracted to the working directory of
the Spark job, can be a
+ local file path or a URL with supported scheme.
+- `configs`: A map of Spark configurations to set when running the Spark job,
can use placeholders
+ like `{{value1}}` to be replaced with actual values when running the job.
+
+Note that:
+
+1. The `executable`, `jars`, `files`, and `archives` must be accessible by the
Gravitino server.
+ Currently, Gravitino supports accessing files from the local file system,
HTTP(S) URLs, and
+ FTP(S) URLs (more distributed file system supports will be added in the
future). So the
Review Comment:
support
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]