[I] [DSIP-104][Alert] Support Absolute Time SLA Monitoring (Start/End Time) [dolphinscheduler]

via GitHub Fri, 02 Jan 2026 00:07:14 -0800


victorsheng opened a new issue, #17836:
URL: https://github.com/apache/dolphinscheduler/issues/17836


   ### Search before asking
   
   - [x] I had searched in the 
[DSIP](https://github.com/apache/dolphinscheduler/issues/14102) and found no 
similar DSIP.
   
   
   ### Motivation
   
   Apache DolphinScheduler currently provides "Timeout Alarms" based on 
**relative duration** (e.g., alerting if a task runs longer than 30 minutes). 
However, production SLAs are typically defined by **absolute wall-clock time**.
   
   **Problem Statement:**
   
   * **Business Deadline:** Many pipelines must complete by a specific time 
(e.g., 08:00 AM) to meet downstream business reports.
   * **Delayed Start:** Critical tasks must start by a certain time (e.g., 
02:00 AM). If they are stuck in the queue or delayed by upstream dependencies, 
the system should alert before the "end-time" is even reached.
   * **Observability Gap:** There is currently no persistent record of SLA 
violations, making it difficult to generate SLA compliance reports (e.g., "What 
percentage of tasks finished by 09:00 AM last month?").
   
   Introducing absolute time SLA monitoring and a dedicated violation record 
table will provide better governance and auditability for critical data 
pipelines.
   
   
   ### Design Detail
   
   **1. Metadata Configuration:**
   Add the following fields to `t_ds_workflow_definition` and 
`t_ds_task_definition`:
   
   * `expected_start_time`: Absolute time the instance must start (e.g., 
`02:00`).
   * `expected_end_time`: Absolute time the instance must finish (e.g., 
`08:00`).
   
   **2. SLA Record Table:**
   Create a new table **`t_ds_sla_violation`** to persist every breach event.
   Suggested schema:
   
   * `id`: Primary Key.
   * `workflow_definition_code`: The code of the workflow.
   * `instance_id`: ID of the workflow/task instance (if created).
   * `violation_type`: Enum (`START_TIME_BREACH`, `END_TIME_BREACH`).
   * `expected_time`: The configured SLA time.
   * `actual_time`: The time when the violation was detected.
   * `creation_time`: Audit timestamp.
   
   **3. Monitoring Logic (SLA Monitor Thread):**
   The Master Server will run a background thread that periodically:
   
   * **Scans Definitions:** Identifies workflows/tasks with active SLA 
configurations.
   * **Evaluation:**
   * **Start-Time:** If `Current Time > expected_start_time` AND (no instance 
exists OR instance is still `SUBMITTED`/`WAITING`).
   * **End-Time:** If `Current Time > expected_end_time` AND instance status is 
not `SUCCESS`.
   
   
   * **Action:** * Trigger an `SLA_ALARM` via the Alert Server.
   * Insert a record into `t_ds_sla_violation` for persistence and UI display.
   
   ### Compatibility, Deprecation, and Migration Plan
   
   * **Compatibility:** Fully backward compatible. Workflows without these 
fields defined will skip the SLA check.
   * **Database Migration:**
   * Add `sla_start_time` and `sla_end_time` columns to definition tables.
   * New DDL for table `t_ds_sla_violation`.
   
   ### Test Plan
   
   * **Functional Testing:**
   * Verify that if a task remains in the `DELAY` or `SERIAL_WAIT` state past 
its `expected_start_time`, a violation record is created and an alert is sent.
   * Verify that if a task is still `RUNNING` past its `expected_end_time`, the 
system detects the breach.
   
   
   * **Persistence Testing:**
   * Check if the `t_ds_sla_violation` table correctly records the 
`instance_id` (if applicable) and the type of breach.
   
   
   * **Edge Case Testing:**
   * **Cross-day monitoring:** Test a workflow with a start time of 23:30 and 
an end time of 01:30 (next day).
   * **Frequency:** Ensure the monitor thread doesn't create duplicate 
violation records for the same instance in a single cycle.
   
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: 
[email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [DSIP-104][Alert] Support Absolute Time SLA Monitoring (Start/End Time) [dolphinscheduler]

Reply via email to