santosh-d3vpl3x opened a new issue, #18746:
URL: https://github.com/apache/druid/issues/18746
### Motivation
- Reduce dependence on ZooKeeper by offering Consul as a first-class
discovery and leader-election option.
- Many operators already run Consul; supporting it simplifies their Druid
deployments and aligns with existing service catalogs and security (ACL/TLS).
- Completes the HTTP-based discovery path (works with http server
view/task runner) for clusters outside Kubernetes.
### Proposed changes
- Add contrib extension `druid-consul-extensions` that wires Consul into
Druid discovery (announcer + service watcher) and leader election for
Coordinator/Overlord.
- Provide configuration for agent host/port, service prefix, datacenter
pinning, ACL token, Basic Auth, TLS/mTLS, watch retry/backoff, health-check
interval/TTL, deregister
timeout, and extra service tags.
- Package the extension and document setup, config examples, and
operational notes.
- Emit metrics for Consul registration, leader election, and watch
retries; log failures clearly.
- Add dev harness docs (e.g., docker run consul) and integration
tests/smoke checks that exercise registration, discovery, and leader failover.
### Rationale
- Consul is widely deployed and already handles service catalog + KV +
sessions suitable for leader locks.
- Alternative considered: keep using ZooKeeper or K8s API-based discovery;
those don’t fit environments standardized on Consul or where operators want to
retire ZooKeeper.
- Reuses Druid’s existing HTTP-based server view/task runner, avoiding
additional protocols.
### Operational impact
- Prereqs: `druid.serverview.type=http` and
`druid.indexer.runner.type=httpRemote` must be set cluster-wide before
switching discovery/leader selectors to Consul.
- Migration path from ZooKeeper:
1) Enable HTTP server view/task runner while still on ZooKeeper
discovery.
2) Load the Consul extension and set Consul configs on all nodes (common
DC).
3) Cut over discovery + leader selectors to Consul per role/tier;
monitor metrics/logs.
4) Remove ZooKeeper dependency after stability validated; rollback by
pointing selectors back to ZooKeeper.
- Rolling/zero downtime: true zero downtime is not realistic because
discovery catalogs and leader election can’t be mixed. Roll out the
bits/configs under ZooKeeper, then do a
fast, coordinated switch to Consul (expect a brief disruption for
restarts/leader flip).
### Test plan (optional)
- Stand up a local Consul agent and verify node registration, health TTL
updates, service discovery watch, and Coordinator/Overlord leader election
failover.
- Add unit tests for config parsing and session/watch retry logic;
integration tests for watch loop and leader lock behavior.
- Manual smoke doc: docker-run Consul, run minimal Druid nodes with Consul
discovery/election, exercise restart/failover.
- CI setup that can be optionally triggered.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]