[PR] feat: add DruidNode deploymentGroup field to support R/B deployments (druid)

via GitHub Tue, 05 May 2026 15:20:35 -0700


jtuglu1 opened a new pull request, #19413:
URL: https://github.com/apache/druid/pull/19413

<!-- If you are a committer, follow the PR action item checklist for
committers:

https://github.com/apache/druid/blob/master/dev/committer-instructions.md#pr-and-issue-action-item-checklist-for-committers.
-->

### Description

Currently, deployment in Druid is geared towards "rolling" deployments,
which, while potentially cheaper/faster are not the safest deployment
mechanisms due to the lack of isolation during new cluster bring-up.

A [red/black](https://octopus.com/blog/blue-green-red-black) (a.k.a
blue/green if you're sending traffic) is better suited for cases where you want
to bring another Druid cluster up in isolation of the existing one (but in same
ZK/K8s discoverability namespace). The Overlord already supports this concept
of worker "versioning" where it will only schedule peons on MMs/Indexers that
are running the version it itself is configured with, allowing the cluster to
eventually drain the older version tasks.

However, this functionality gets us part of the way to supporting what's
effectively a zero-downtime (both query + ingest) deployment. To achieve a
fully isolated (with the exception of master nodes: {coordinator, overlord})
deployment environment where we can mirror queries, observe state, etc. we also
need to support version-based routing of queries.

https://github.com/apache/druid/commit/681cbdee15d279acf976a4e851da3e7a03bbba81
provided support for tier aliases (so duplicate historical tier deployments can
be brought up transparently to the user/operator), and this PR provides the
query routing support.

The combination of these 2 changes support the following deployment process:

1. Deploy `black`/`green` Druid ASGs: router, broker, historical,
coordinator, overlord, MM, etc.
2. Configure Coordinator dynamic config to set up tier aliases for
new/existing Druid historical tiers (so same set of segments are loaded in
parallel onto equivalent tiers across the 2 versions).
3. Wait for segments to load on the new Druid ASGs
4. Switch coordinator leader to `new` Druid version coordinator
5. Optionally mirror traffic to the new ASGs (`black`/`green` router/broker
will be able to query only historicals of their same version; peons are by
default queryable by all versions).
6. Switch leader overlord to newer version (using generous timeouts/retries
to avoid ingest task RPC failure)
7. Force supervisor handoff for all running supervisors, wait for all new
tasks to be launched with the `new` Druid version. This handoff process can be
done slowly (e.g. small # of task groups at a time) to avoid spiking lag on any
of the supervisors.
8. Finally, cutover to `new` router/broker/historicals.

This deployment method combines the traditional red/black deployment with
Druid's rolling deployment, providing ~zero ingest downtime as well as ~zero
query downtime for users (both in terms of availability and data freshness). It
also provides ample time to experiment/canary changes without impacting user
traffic.

#### Release note

<hr>

This PR has:

- [ ] been self-reviewed.
- [ ] using the [concurrency
checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md)
(Remove this item if the PR doesn't have any relation to concurrency.)
- [ ] added documentation for new or modified features or behaviors.
- [ ] a release note entry in the PR description.
- [ ] added Javadocs for most classes and all non-trivial methods. Linked
related entities via Javadoc links.
- [ ] added or updated version, license, or notice information in
[licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md)
- [ ] added comments explaining the "why" and the intent of the code
wherever would not be obvious for an unfamiliar reader.
- [ ] added unit tests or modified existing tests to cover new code paths,
ensuring the threshold for [code
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
is met.
- [ ] added integration tests.
- [ ] been tested in a test Druid cluster.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] feat: add DruidNode deploymentGroup field to support R/B deployments (druid)

Reply via email to