ArvinZheng opened a new issue #10268:
URL: https://github.com/apache/druid/issues/10268
Costs are not calculated accurately and coordinator fills up few historical
nodes and makes them **_hot_** for some intervals.
### Affected Version
All
### Description
We are running a Druid cluster which has
- Druid version `0.15.0-incubating`
- One Coordinator
- 120 nodes in one tier to serve recent 6 months worth of data
- 15+ datasources in this cluster and segments are partitioned by hour
- `druid.coordinator.balancer.strategy` not set, default to `cost `
We were re-indexing segments of all datasources for few days, we created
many new segments to the cluster at the same time and all of them were picked
up by the same coordination. Within the single coordination, there are 100K+
segments (maybe more) to be distributed to the same tier.
During the coordination, I noticed that few historical nodes were filled up
(more than 90% disk usage while the others had only ~ 50%) by coordinator, then
I looked into the cost function and found an potential issue there.
In the following method, we try to compute the follow costs and sum them up
as the total joint cost for distributing this segment to this ServerHolder, but
the problem is that the pending load queue
`server.getPeon().getSegmentsToLoad()` could be changed when computing costs
for each segment.
Think about this, if I have 1000 segments to be distributed, all of them are
belong to the same interval, if the segment loading runs super fast and
segments won't stay in the queue too long. In an extreme case, the costs of
distributing these segments to all of the historical nodes will be exactly the
same, and if only one of the historical nodes has the lowest cost to distribute
this interval to, then we may end up with distributing all the segments to the
same historical node.
- joint cost of current served segments of this ServerHolder
- joint cost of segments in the pending load queue of this ServerHolder
- joint cost of segments in the pending unload queue of this ServerHolder
```
protected double computeCost(
final DataSegment proposalSegment,
final ServerHolder server,
final boolean includeCurrentServer
)
{
final long proposalSegmentSize = proposalSegment.getSize();
// (optional) Don't include server if it is already serving segment
if (!includeCurrentServer && server.isServingSegment(proposalSegment)) {
return Double.POSITIVE_INFINITY;
}
// Don't calculate cost if the server doesn't have enough space or is
loading the segment
if (proposalSegmentSize > server.getAvailableSize() ||
server.isLoadingSegment(proposalSegment)) {
return Double.POSITIVE_INFINITY;
}
// The contribution to the total cost of a given server by proposing to
move the segment to that server is...
double cost = 0d;
// the sum of the costs of other (exclusive of the proposalSegment)
segments on the server
cost += computeJointSegmentsCost(
proposalSegment,
Iterables.filter(server.getServer().iterateAllSegments(), segment ->
!proposalSegment.equals(segment))
);
// plus the costs of segments that will be loaded
cost += computeJointSegmentsCost(proposalSegment,
server.getPeon().getSegmentsToLoad());
// minus the costs of segments that are marked to be dropped
cost -= computeJointSegmentsCost(proposalSegment,
server.getPeon().getSegmentsMarkedToDrop());
return cost;
}
```
### Solution
Snapshot at the beginning, we can create a snapshot of the load queue at the
beginning of distributing new segments, and every time when we distribute a
segment to a historical node, we update the snapshot to keep track of the
change, so that we can have a more accurate view of the load queue and compute
a more accurate costs for each historical node during the coordination.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]