This is an automated email from the ASF dual-hosted git repository.

klesh pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/incubator-devlake-website.git


The following commit(s) were added to refs/heads/main by this push:
     new 9d74b55816d docs: Add documentation for environment variables to 
control github graphql job collector (#813)
9d74b55816d is described below

commit 9d74b55816d6225c6ea6169ef3d85c908bcc379c
Author: FlomoN <[email protected]>
AuthorDate: Tue Oct 21 05:46:08 2025 +0200

    docs: Add documentation for environment variables to control github graphql 
job collector (#813)
---
 docs/GettingStarted/Environment.md | 33 +++++++++++++++++++++++++++------
 docs/Plugins/github.md             |  1 +
 2 files changed, 28 insertions(+), 6 deletions(-)

diff --git a/docs/GettingStarted/Environment.md 
b/docs/GettingStarted/Environment.md
index 1c7acd6ad31..19b6272d3df 100644
--- a/docs/GettingStarted/Environment.md
+++ b/docs/GettingStarted/Environment.md
@@ -7,16 +7,19 @@ description: How to set up environment variables for DevLake
 This document explains how to set environment variables for Apache DevLake and 
what environment variables can be set.
 
 ## Environment Variables
+
 ### ENABLE_SUBTASKS_BY_DEFAULT
+
 This environment variable is used to enable or disable the execution of 
subtasks.
 
 #### How to set
+
 The format is as follows: 
plugin_name1:subtask_name1:enabled_value,plugin_name2:subtask_name2:enabled_value,plugin_name3:subtask_name3:enabled_value
-  
+
 Guidance on locating the [plugin_name and 
subtask_name](https://github.com/apache/incubator-devlake/blob/release-v1.0/backend/plugins/jira/tasks/issue_changelog_collector.go#L41):
 
 - plugin_name: Represents the plugin's name, such as 'jira' for the Jira 
plugin.
-- subtask_name: Denotes the subtask's name, like 'collectIssueChangelogs' for 
the Jira plugin."  
+- subtask_name: Denotes the subtask's name, like 'collectIssueChangelogs' for 
the Jira plugin."
 
 Example 1: Enable some subtasks that are closed by default
 
@@ -25,18 +28,36 @@ 
ENABLE_SUBTASKS_BY_DEFAULT="jira:collectIssueChangelogs:true,jira:extractIssueCh
 ```
 
 Example 2: Close some subtasks that are executed by default
+
 ```shell
 ENABLE_SUBTASKS_BY_DEFAULT="github_graphql:Collect Job 
Runs:false,github_graphql:Extract Job Runs:false,github_graphql:Convert Job 
Runs:false"
 ```
 
-#### How to take effect
-After setting the environment variable, restart the DevLake service to take 
effect.
-- For Docker Compose, run `docker-compose down` and `docker-compose up -d`.
-- For Helm, run `helm upgrade devlake devlake/devlake --recreate-pods`.
+### GITHUB_GRAPHQL_JOB\_...
+
+This set of environment variables is used to configure and finetune the 
behavior of the GitHub GraphQL Job Runs collection process.
+
+| Environment Variable                    | Description                        
                                                   | Default Value |
+| --------------------------------------- | 
-------------------------------------------------------------------------------------
 | ------------- |
+| GITHUB_GRAPHQL_JOB_COLLECTION_MODE      | Specifies the mode of job 
collection. Possible values are `BATCHING` and `PAGINATING` | `BATCHING`    |
+| GITHUB_GRAPHQL_JOB_BATCHING_INPUT_STEP  | Defines the step size for batching 
mode.                                              | `10`          |
+| GITHUB_GRAPHQL_JOB_BATCHING_PAGE_SIZE   | Defines the limit of jobs to 
collect in a batch for each run.                         | `20`          |
+| GITHUB_GRAPHQL_JOB_PAGINATING_PAGE_SIZE | Defines the page size for 
paginating mode.                                            | `50`          |
 
+#### When to Use
 
+These environment variables are particularly useful when dealing with large 
repositories that have a significant number of job runs. By adjusting these 
settings, you can optimize the data collection process to better suit your 
specific needs and infrastructure capabilities. Also this can help to avoid 
timeouts on the github GraphQL API with too large requests.
 
+- Use `BATCHING` for `GITHUB_GRAPHQL_JOB_COLLECTION_MODE` when your workflow 
runs typically have less than 20 jobs and you want to minimize the number of 
API calls to GitHub.
+  - Adjust `GITHUB_GRAPHQL_JOB_BATCHING_INPUT_STEP` and 
`GITHUB_GRAPHQL_JOB_BATCHING_PAGE_SIZE` to control how many jobs are collected 
in each batch. **NOTE:** Increasing these values can lead to timeouts if the 
requests become too large.
+- Use `PAGINATING` for `GITHUB_GRAPHQL_JOB_COLLECTION_MODE` when your workflow 
runs have a large number of jobs (e.g., more than 50). This mode will only 
query 1 Workflow run at a time and paginate through the jobs, reducing the risk 
of timeouts.
+  - Adjust `GITHUB_GRAPHQL_JOB_PAGINATING_PAGE_SIZE` to control how many jobs 
are fetched per page. A smaller page size can help avoid timeouts but may 
increase the total number of API calls.
 
+TLDR: `BATCHING` is more efficient for smaller workflows, while `PAGINATING` 
will guarantee complete collection of jobs for larger workflows.
 
+## How to take effect
 
+After setting the environment variable, restart the DevLake service to take 
effect.
 
+- For Docker Compose, run `docker-compose down` and `docker-compose up -d`.
+- For Helm, run `helm upgrade devlake devlake/devlake --recreate-pods`.
diff --git a/docs/Plugins/github.md b/docs/Plugins/github.md
index e14cd532abc..992f6641b07 100644
--- a/docs/Plugins/github.md
+++ b/docs/Plugins/github.md
@@ -62,6 +62,7 @@ Metrics that can be calculated based on the data collected 
from GitHub:
 
 - Configuring GitHub via [Config UI](/Configuration/GitHub.md)
 - Configuring GitHub via Config UI's [advanced 
mode](/Configuration/AdvancedMode.md#1-github).
+- Configurable via [Environment 
Variables](/GettingStarted/Environment.md#github_graphql_job_...).
 
 ## API Sample Request
 

Reply via email to