ArvinZheng opened a new issue #10268:
URL: https://github.com/apache/druid/issues/10268


   Costs are not calculated accurately and coordinator fills up few historical 
nodes and makes them **_hot_** for some intervals.
   
   ### Affected Version
   
   All
   
   ### Description
   We are running a Druid cluster which has
   - Druid version `0.15.0-incubating`
   - One Coordinator
   - 120 nodes in one tier to serve recent 6 months worth of data
   - 15+ datasources in this cluster and segments are partitioned by hour
   - `druid.coordinator.balancer.strategy` not set, default to `cost `
   
   We were re-indexing segments of all datasources for few days, we created 
many new segments to the cluster at the same time and all of them were picked 
up by the same coordination. Within the single coordination, there are 100K+ 
segments (maybe more) to be distributed to the same tier.
   
   During the coordination, I noticed that few historical nodes were filled up 
(more than 90% disk usage while the others had only ~ 50%) by coordinator, then 
I looked into the cost function and found an potential issue there.
   
   In the following method, we try to compute the follow costs and sum them up 
as the total joint cost for distributing this segment to this ServerHolder, but 
the problem is that the pending load queue 
`server.getPeon().getSegmentsToLoad()` could be changed when computing costs 
for each segment.
   
   Think about this, if I have 1000 segments to be distributed, all of them are 
belong to the same interval, if the segment loading runs super fast and 
segments won't stay in the queue too long. In an extreme case, the costs of 
distributing these segments to all of the historical nodes will be exactly the 
same, and if only one of the historical nodes has the lowest cost to distribute 
this interval to, then we may end up with distributing all the segments to the 
same historical node.
   
   - joint cost of current served segments of this ServerHolder
   - joint cost of segments in the pending load queue of this ServerHolder
   - joint cost of segments in the pending unload queue of this ServerHolder
   ```
   protected double computeCost(
         final DataSegment proposalSegment,
         final ServerHolder server,
         final boolean includeCurrentServer
     )
     {
       final long proposalSegmentSize = proposalSegment.getSize();
   
       // (optional) Don't include server if it is already serving segment
       if (!includeCurrentServer && server.isServingSegment(proposalSegment)) {
         return Double.POSITIVE_INFINITY;
       }
   
       // Don't calculate cost if the server doesn't have enough space or is 
loading the segment
       if (proposalSegmentSize > server.getAvailableSize() || 
server.isLoadingSegment(proposalSegment)) {
         return Double.POSITIVE_INFINITY;
       }
   
       // The contribution to the total cost of a given server by proposing to 
move the segment to that server is...
       double cost = 0d;
   
       // the sum of the costs of other (exclusive of the proposalSegment) 
segments on the server
       cost += computeJointSegmentsCost(
           proposalSegment,
           Iterables.filter(server.getServer().iterateAllSegments(), segment -> 
!proposalSegment.equals(segment))
       );
   
       // plus the costs of segments that will be loaded
       cost += computeJointSegmentsCost(proposalSegment, 
server.getPeon().getSegmentsToLoad());
   
       // minus the costs of segments that are marked to be dropped
       cost -= computeJointSegmentsCost(proposalSegment, 
server.getPeon().getSegmentsMarkedToDrop());
   
       return cost;
     }
   ```
   
   ### Solution
   Snapshot at the beginning, we can create a snapshot of the load queue at the 
beginning of distributing new segments, and every time when we distribute a 
segment to a historical node, we update the snapshot to keep track of the 
change, so that we can have a more accurate view of the load queue and compute 
a more accurate costs for each historical node during the coordination.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to