capistrant opened a new pull request #10739:
URL: https://github.com/apache/druid/pull/10739


   Fixes #9816 
   
   ### Description
   
   ***Note***: I use the term "Guild" throughout this proposal. It is an 
arbitrary name that I chose for the Historical grouping construct that I am 
proposing. A Guild would be a group of Historical servers. All servers who do 
not specify a Guild in runtime.properties would be assigned to a default Guild.
   
   I am adding the idea of guilds in Druid. A guild is a logical grouping of 
servers. With this PR, the only use of guilds is replicant distribution across 
Historical Servers. The idea has been born out of the desire for HDFS like rack 
aware replication for bare metal deployment on-prem. For this use case, Druid 
Historical services are assigned a guild based on the physical rack that they 
live on. The coordinator uses SegmentReplicantLookup to build a lookup for 
segment:guild replicant count. The Coordinator has a preference for loading 
replicants across 2 or more guilds. It is important to note that, instead of 
having less replicants than specified by Druid Load Rules, the Coordinator will 
load replicants on the same guild if it must.
   
   This idea can go beyond just physical racks in a data center and apply to 
things such as availability zones or arbitrary historical server groupings in a 
virtualized deployment. That is why I came up with the name "guild" instead of 
just saying rack explicitly.
   
   ### Implementation
   
   **Configs**
   runtime.properties:
   * `druid.server.guild` The STRING guild name assigned to a server. Default 
value applied if not specified.
   * `druid.coordinator.guildReplication.on` The BOOLEAN flag telling the 
coordinator if it should use server guilds as a part of its coordination logic. 
Default value is false.
   
   coordinator dynamic config:
   * `guildReplicaitonMaxPercentOfSegmentsToMove` What % of segments moved 
during BalanceSegments duty should be dedicated to moving segments that are not 
meeting guild replication threshold of 2? This is applied against what is left 
over after moving segments off of decommissioning servers, if there are any.
   
   `SegmentReplicantLookup`
   
   The coordinator currently builds this object each time the 
`HistoricalManagementDuties` run. Prior to this proposal, the object would 
contain two main data structures:
   
   `Table<SegmentId, String, Integer> segmentsInCluster` - replicant count for 
each segment by tier for segments served by the cluster.
   `Table<SegmentId, String, Integer> loadingSegments` - replicant count for 
each segment by tier for segments being loaded by the cluster. 
   
   My proposal adds a third data structure that is specific to guild 
replication:
   
   `Table<SegmentId, String, Integer> historicalGuildDistribution` - replicant 
count for each segment by guild.
   
   This new structure is only created if guild replication is enabled. It is 
not worth the resources if we are not going to use it! The structure is used 
for quick lookup to get guild replication state for a given segment. It will be 
used by coordinator duties when making decisions on loading/moving/dropping 
segments. 
   
   `LoadRule`
   
   When Assigning replicas, use the changes in SegmentReplicantLookup in order 
to split `ServerHolders` into groups of servers who are on a guild that is 
serving a replicant of the segment and servers who are on a guild that is not 
serving a replicant of the segment. If possible `LoadRule` will load replicants 
to the best scored server(s) from the guild(s) not serving a replicant. 
However, we always fallback to just loading the specified number of replicants 
even if replication across guilds cannot be achieved.
   
   When picking replicants to drop, we will also split the `ServerHolders` in 
the same way. The segments will be dropped from decommissioning servers first, 
then from servers on guilds with > 1 replicant of the segment, then lastly from 
the remaining servers serving a replicant.
   
   `BalancerStrategy`
   
   Add a method to the interface. `pickSegmentToMove(List<ServerHolder> 
serverHolders, Set<String> broadcastDataSources, DruidCoordinatorRuntimeParams 
params);`
   
   This new method is added so we have the `SegmentReplicantLookup` information 
needed to pick a segment who is not replicated across > 1 guild when balancing 
the cluster. It is needed because we introduce the dynamic config and 
associated balancing phase in BalanceSegments that prioritizes the balancing of 
segments who are not properly replicated across guilds. We need to only pick 
segments that meet the requirements. The existing `pickSegmentToMove` does not 
suffice.
   
   RandomBalancerStrategy and CostBalancerStrategy add implementations.
   
   `ReservoirSegmentSampler`
   
   Adds a method for getting a `SegmentHolder` that is violator of the goal of 
being on > 1 guilds. This method needs to have access to the 
`SegmentReplicantLookup` in order to quickly look up replication state of 
segments it is possibly selecting. It returns the first violator that it finds 
or null if none is found.
   
   `BalanceSegments`
   
   The coordinator balancing duty gets a couple of changes. The first change is 
to the generic balancing that exists today. If guild replication is enabled, 
then we will perform the split of `ServerHolders` based on their guild 
replication status when looking for a server to move a segment to. Just as in 
`LoadRule` we will do our best to make the move to a server that improves or 
maintains the number of guilds that hold a replicant.
   
   We also add a new phase of balancing segments. There is a dedicated move for 
segments off of decommissioning servers. An operator can also choose to add a 
dedicated move for segments that are not replicated on > 1 guild. They do this 
by editing the dynamic coordinator config that this proposal adds. This results 
in the coordinator moving a certain number of segments that are violating guild 
replication rules. It is an optional way for an operator to speed up the 
balancing of segments across guilds.
   
   ### Design Choice Rationale + Alternatives
   
   There has been some feedback that this functionality should be folded into 
the tier construct that already exists. I tend to disagree with that idea for 
multiple reasons.
   * Tiers have been around for a long time and have become cornerstones of 
operator's deployments. It may not be easy or straightforward to re-purpose 
tiers to also work as generic replication groups. I fear we'd be creating a 
single monster that is hard to understand vs having two separate concepts split 
out into their own things (guilds vs tiers)
   * As written, tiers are explicit. I load N replicants onto Tier A, M onto 
Tier B, etc. Whereas guild replication is a generic best effort goal. I try to 
load the aggregate replicants across all tiers onto 2 or more guilds. We can't 
change tiers to be generic with loading to go across 2 or more tiers, because 
operators expect their load rules to be followed to a T. So we end up with 
competing goals that do not play nicely together. This ties back to my point 
that trying to adapt tiering to meet so many goals could create a confusing 
beast.
   * My motivating goal for uptime and maintainability may require a separate 
grouping outside of my existing tiering. Perhaps servers for my different tiers 
are mixed and matched within logical groupings that I would need to separate 
out in order to achieve my motivating goals. If my grouping for uptime and 
maintainability is in regards to physical racks. I could have my slow and fast 
servers racked on the same set of racks with a few of each form-factor on any 
given rack. This means I cannot achieve my same tiers for my speed tiering as 
well as my replication goals. I would need to further split each rack into 
tiers, which once again is getting complex and probably messing up performance 
by adding way too much isolation vs shared resources.
   
   So now that adapting tiering is ruled out as far as I am concerned, what 
else could we do? I can't think of other solutions for the configurations that 
are as generic as my simple guild name assignment and boolean for the 
coordinator on whether or not to follow guild replication logic paths. But what 
about the implementation details using these configurations? It could certainly 
be argued that the cost of the new `SegmentReplicantLookup` data structure is 
too steep for large clusters. However, I fail to come up with a better solution 
in my head. When we are choosing servers to load/drop segments from, we need a 
quick way to look up details on where that segment is loaded. It seems logical 
that this facility would require this large structure in the replicant lookup.
   
   With all this being said, I am open to improvements to my proposal. At the 
end of the day my motivating factors just need to be achieved. If there is a 
better way to achieve the simple goals required, I'm all for it. 
   
   ### Operational impact of deployment
   
   **- Is anything going to be deprecated or removed by this change? How will 
we phase out old behavior?**
   
   Nothing is deprecated or removed.
   
   **- Is there a migration path that cluster operators need to be aware of?**
   
   To go from not using guild replication, to using guild replication, there 
will be a simple migration path. Operators will need to assign guilds to their 
historical servers and restart them. They will then have to flip the 
coordinator config to turn on guild replication. They may choose to use the 
coordinator dynamic config to speed up balancing based on guild replication if 
they so choose.
   
   **- Will there be any effect on the ability to do a rolling upgrade, or to 
do a rolling _downgrade_ if an operator wants to switch back to a previous 
version?**
   
   Upgrading to the first version of druid that has guild replication in it 
will not require any special work. Upgrading to the default configs will keep 
guild replication off. If they choose to use guild replication, they can 
perform the migration steps after the upgrade. If they choose not to use guild 
replication, no action is needed! Downgrading would require a pre-step by an 
operator if they have already turned guild replication on. They would need to 
turn guild replication off on the coordinator by updating their config and 
restarting it. That way there will not be any issues with split versions across 
coordinator/historical during downgrade.
   
   **- Other**
   
   This change as proposed results in additional resource utilization by the 
Coordinator. Extra data structures are created in `SegmentReplicantLookup` that 
help make replicant loading decisions based on guild distribution of segments. 
This may require attention from operators as far as configuring their runtime 
environment for the coordinator.
   
   The changed as proposed introduces a major change to the decision making 
process by the Coordinator. Enabling guild aware replication may result in 
replicant loading decisions that the coordinator would have previously 
considered sub-optimal. Loading replicants to different guilds to gain better 
uptime and maintainability standards could result in segment distribution skew 
that negatively impacts performance. It is a trade-off that an operator will 
need to carefully consider.
   
   
   <!--
   In each section, please describe design decisions made, including:
    - Choice of algorithms
    - Behavioral aspects. What configuration values are acceptable? How are 
corner cases and error conditions handled, such as when there are insufficient 
resources?
    - Class organization and design (how the logic is split between classes, 
inheritance, composition, design patterns)
    - Method organization and design (how the logic is split between methods, 
parameters and return types)
    - Naming (class, method, API, configuration, HTTP endpoint, names of 
emitted metrics)
   -->
   
   
   <!-- It's good to describe an alternative design (or mention an alternative 
name) for every design (or naming) decision point and compare the alternatives 
with the designs that you've implemented (or the names you've chosen) to 
highlight the advantages of the chosen designs and names. -->
   
   <!-- If there was a discussion of the design of the feature implemented in 
this PR elsewhere (e. g. a "Proposal" issue, any other issue, or a thread in 
the development mailing list), link to that discussion from this PR description 
and explain what have changed in your final design compared to your original 
proposal or the consensus version in the end of the discussion. If something 
hasn't changed since the original discussion, you can omit a detailed 
discussion of those aspects of the design here, perhaps apart from brief 
mentioning for the sake of readability of this PR description. -->
   
   <!-- Some of the aspects mentioned above may be omitted for simple and small 
changes. -->
   
   <hr>
   
   This PR has:
   - [ ] been self-reviewed.
      - [ ] using the [concurrency 
checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md)
 (Remove this item if the PR doesn't have any relation to concurrency.)
   - [ ] added documentation for new or modified features or behaviors.
   - [ ] added Javadocs for most classes and all non-trivial methods. Linked 
related entities via Javadoc links.
   - [ ] added or updated version, license, or notice information in 
[licenses.yaml](https://github.com/apache/druid/blob/master/licenses.yaml)
   - [ ] added comments explaining the "why" and the intent of the code 
wherever would not be obvious for an unfamiliar reader.
   - [ ] added unit tests or modified existing tests to cover new code paths, 
ensuring the threshold for [code 
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
 is met.
   - [ ] added integration tests.
   - [ ] been tested in a test Druid cluster.
   
   <!-- Check the items by putting "x" in the brackets for the done things. Not 
all of these items apply to every PR. Remove the items which are not done or 
not relevant to the PR. None of the items from the checklist above are strictly 
necessary, but it would be very helpful if you at least self-review the PR. -->
   
   <hr>
   
   ##### Key changed/added classes in this PR
    * `SegmentReplicantLookup`
    * `CoordinatorDynamicConfig`
    * `BalanceSegments`
    * `LoadRule`
    * `ReserviorSegmentSampler`
    * `BalancerStrategy` interface
    * `DruidCoordinatorConfig`
    * `EmitClusterStatsAndMetrics`
    * `DruidServer`
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org

Reply via email to