[GitHub] [druid] capistrant opened a new issue #10606: Proposal: Separate Primary Replicant loading from the rest of HistoricalManagementDuties

GitBox Wed, 25 Nov 2020 10:02:42 -0800


capistrant opened a new issue #10606:
URL: https://github.com/apache/druid/issues/10606



   ### Motivation
   
   Loading primary replicants for Druid Segments is one of the most important 
things that the Coordinator does. Without a primary replicant available on the 
cluster, a segment is not available for querying. The Coordinator performs 
primary replicant loading within a set of Coordinator duties that relate to 
Historical Management. This grouping can result in the coordinator spending a 
lot of time doing other things such as loading non-primary replicants, 
balancing segments, etc. A side effect of this waiting for other Coordinator 
jobs to complete before more primary replicants can be loaded is that data 
stays unavailable for longer than it otherwise might have to. This can be a 
negative end user experience. Breaking primary replicant loading out into its 
own scheduled runnable group can guarantee that primary replicants are loaded 
more regularly.
   
   ### Proposed changes
   
   I am proposing an optional new `DutiesRunnable` in the `DruidCoordinator`. 
Operators can choose whether or not to break primary replicant loading out into 
its own `DutiesRunnable`. If they choose not to enable the dedicated primary 
replicant loading, their coordinator will function just as it always has. If 
they choose to enable the dedicated primary replicant loading, their 
coordinator will add a scheduled `DutiesRunnable` dedicated to executing 
matching `LoadRule` for segments and only doing the primary replicant load for 
that `LoadRule` when ran. The `HistoricalManagement` `DutiesRunnable` will 
continue all other `HistoricalManagement` duties including performing 
non-primary replicant loading and replicant dropping while executing a matched 
`LoadRule` for a segment.
   
   My POC implementation for the proposal exposes two new Coordinator runtime 
configurations for operators: 
`druid.coordinator.loadPrimaryReplicantSeparately` and 
`druid.coordinator.period.primaryReplicantLoaderPeriod`. If they choose to 
enable the first, then a scheduled executor with a configurable backoff period 
is configured for loading primary replicants.
   
   The new `DutiesRunnable` would have consist of two duties, 
`UpdateCoordinatorStateAndPrepareCluster` and `RunRules`. 
   * There is an open TODO on analyzing the negative effects of having two 
`DutiesRunnable` with `UpdateCoordinatorStateAndPrepareCluster`. It is possible 
that only one of the two should execute the full thing and the other should run 
a scaled down duty.
   
   `RunRules` and `LoadRule` will need a mode associated with them. Now we will 
be executing `RunRules` in one of two modes. One mode is to only execute 
`LoadRule` rules that match. The other is to run all matched `Rule`. `LoadRule` 
is similar, for the primary replicant load, it should run in a mode where it 
only loads a primary replicant. There also needs to be a mode for skipping 
primary replicant load. And then lastly, a mode for running all of `LoadRule` 
and not worrying about replicant types.
   
   ### Rationale
   
   I think the biggest benefit here is more control for the operator to ensure 
that primary replicant loading is running as often as needed. In the case of 
large clusters who do lots of balancing, and non-primary replicant loading due 
to servers coming in and out of the cluster, primary replicant loading can get 
blocked often enough that users are asking about why their new segments aren't 
becoming available in a timely manner after batch indexing finishes.
   
   As for alternative approaches, I have not thought of any similar ways to 
achieve this elevated priority for loading primary replicants at this time. I 
am definitely open to suggestions though.
   
   ### Operational impact
   
   This section should describe how the proposed changes will impact the 
operation of existing clusters. It should answer questions such as:
   
   - Is anything going to be deprecated or removed by this change? How will we 
phase out old behavior?
   * N/A
   - Is there a migration path that cluster operators need to be aware of?
   * Enabling this requires coordinator config changes and a restart.
   - Will there be any effect on the ability to do a rolling upgrade, or to do 
a rolling _downgrade_ if an operator wants to switch back to a previous version?
   * rolling upgrade to the first version that includes this would not require 
any changes because not adding the configs will leave the coordinator as is. An 
operator can enable after upgrade if they so choose.
   * Downgrading should not have any impact. The configs, even if specified by 
operator would be ignored and coordinator would go back to how it operated 
before there was a dedicated primary replicant loader.
   
   ### Test plan (optional)
   
   TBD
   
   ### Future work (optional)
   
   TBD
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] capistrant opened a new issue #10606: Proposal: Separate Primary Replicant loading from the rest of HistoricalManagementDuties

Reply via email to