victorsheng opened a new issue, #17836: URL: https://github.com/apache/dolphinscheduler/issues/17836
### Search before asking - [x] I had searched in the [DSIP](https://github.com/apache/dolphinscheduler/issues/14102) and found no similar DSIP. ### Motivation Apache DolphinScheduler currently provides "Timeout Alarms" based on **relative duration** (e.g., alerting if a task runs longer than 30 minutes). However, production SLAs are typically defined by **absolute wall-clock time**. **Problem Statement:** * **Business Deadline:** Many pipelines must complete by a specific time (e.g., 08:00 AM) to meet downstream business reports. * **Delayed Start:** Critical tasks must start by a certain time (e.g., 02:00 AM). If they are stuck in the queue or delayed by upstream dependencies, the system should alert before the "end-time" is even reached. * **Observability Gap:** There is currently no persistent record of SLA violations, making it difficult to generate SLA compliance reports (e.g., "What percentage of tasks finished by 09:00 AM last month?"). Introducing absolute time SLA monitoring and a dedicated violation record table will provide better governance and auditability for critical data pipelines. ### Design Detail **1. Metadata Configuration:** Add the following fields to `t_ds_workflow_definition` and `t_ds_task_definition`: * `expected_start_time`: Absolute time the instance must start (e.g., `02:00`). * `expected_end_time`: Absolute time the instance must finish (e.g., `08:00`). **2. SLA Record Table:** Create a new table **`t_ds_sla_violation`** to persist every breach event. Suggested schema: * `id`: Primary Key. * `workflow_definition_code`: The code of the workflow. * `instance_id`: ID of the workflow/task instance (if created). * `violation_type`: Enum (`START_TIME_BREACH`, `END_TIME_BREACH`). * `expected_time`: The configured SLA time. * `actual_time`: The time when the violation was detected. * `creation_time`: Audit timestamp. **3. Monitoring Logic (SLA Monitor Thread):** The Master Server will run a background thread that periodically: * **Scans Definitions:** Identifies workflows/tasks with active SLA configurations. * **Evaluation:** * **Start-Time:** If `Current Time > expected_start_time` AND (no instance exists OR instance is still `SUBMITTED`/`WAITING`). * **End-Time:** If `Current Time > expected_end_time` AND instance status is not `SUCCESS`. * **Action:** * Trigger an `SLA_ALARM` via the Alert Server. * Insert a record into `t_ds_sla_violation` for persistence and UI display. ### Compatibility, Deprecation, and Migration Plan * **Compatibility:** Fully backward compatible. Workflows without these fields defined will skip the SLA check. * **Database Migration:** * Add `sla_start_time` and `sla_end_time` columns to definition tables. * New DDL for table `t_ds_sla_violation`. ### Test Plan * **Functional Testing:** * Verify that if a task remains in the `DELAY` or `SERIAL_WAIT` state past its `expected_start_time`, a violation record is created and an alert is sent. * Verify that if a task is still `RUNNING` past its `expected_end_time`, the system detects the breach. * **Persistence Testing:** * Check if the `t_ds_sla_violation` table correctly records the `instance_id` (if applicable) and the type of breach. * **Edge Case Testing:** * **Cross-day monitoring:** Test a workflow with a start time of 23:30 and an end time of 01:30 (next day). * **Frequency:** Ensure the monitor thread doesn't create duplicate violation records for the same instance in a single cycle. ### Code of Conduct - [x] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
