GitHub user shrihari7396 edited a discussion: Design Discussion: Embedding 
AlertServer into dolphinscheduler-api Module

Hi all,

I’ve been studying the architectural requirements for embedding the AlertServer 
into the API Server (related to #8975). After reviewing the initialization 
flows in `dolphinscheduler-alert-server` and `dolphinscheduler-api`, I’d like 
to discuss a potential design direction and gather feedback.

My goal is to transition the alerting mechanism from a standalone process to an 
embedded background service while maintaining DolphinScheduler's 
high-availability and reliability standards.

---

## Proposed Technical Direction

### 1. Logic Decoupling (Modularization)

Instead of source-code duplication, refactor the core alerting logic (e.g., 
`AlertBootstrapService`, `AlertSender`) into a reusable library module. The 
`dolphinscheduler-api` will consume this as a dependency, ensuring a single 
source of truth for alerting logic.

### 2. Lifecycle Integration

Use Spring-managed components and `@PostConstruct` hooks within the API Server 
to initialize the alerting engine. This ensures alerting threads are 
orchestrated alongside the API's primary lifecycle, starting only after the 
server successfully joins the Registry.

### 3. Leader Election & High Availability (HA)

To prevent duplicate alert processing in horizontally scaled API deployments, I 
propose leveraging the existing `RegistryClient` (ZooKeeper/Etcd) to implement 
a **Leader-Follower** model. Only the "Leader" API instance will activate the 
`AlertEventLoop`, with standby nodes ready to take over upon leader failure.

### 4. Fault Tolerance & Data Integrity

* **Atomic Claim Mechanism:** Implement SQL-based optimistic locking (e.g., 
`UPDATE ... SET status = 'SENDING', handler_instance = 'ID' WHERE status = 
'PENDING'`) to ensure thread-safe row acquisition.
* **Self-Healing "Janitor" Thread:** Introduce a background monitoring thread 
on the leader node to identify alerts orphaned in a `SENDING` state due to 
unexpected instance crashes and reset them to `PENDING` for re-delivery.

### 5. Performance Isolation

Configure a dedicated `ThreadPoolExecutor` for alerting tasks. This prevents 
long-running notification I/O (e.g., slow SMTP or Webhook responses) from 
starving the API's Netty/Tomcat worker threads, keeping the REST interface 
responsive.

### 6. SPI Management & Decommissioning

Ensure the API Server remains compatible with the Alert SPI for dynamic plugin 
loading. This plan includes the complete removal of standalone 
`AlertServer.java` entry points, assembly descriptors, and redundant Docker/K8s 
service definitions to simplify the deployment footprint.

---

I would appreciate any feedback or concerns regarding this approach, 
particularly on the distributed coordination strategy, before I proceed further 
with implementation planning.

Best regards,

**Shrihari Rajendrakumar Kulkarni**

GitHub link: https://github.com/apache/dolphinscheduler/discussions/18005

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: 
[email protected]

Reply via email to