mounikanakkala opened a new issue #10940:
URL: https://github.com/apache/druid/issues/10940
### Affected Version
0.20.0
### Description
Coordinator picks only few historicals of a tier and fills all the data in
them. Rest of them are left almost empty.
### Series of events
- We have a tier which had four historicals to begin with. All the four
reached 88% disk storage.
- We added two historicals of exactly same configuration. Rebalancing
started to happen, observed that max (disk storage on any instance) = 67%. This
took 40 minutes
- In the next 40 minutes, it continued to rebalanced by draining data from
existing historicals and adding to the newly added two historicals. This was
the percentage of disk usage on the historicals 40%, 32%, 99.4%, 99.5%, 48%,
37%. It didn't stop at 99.4%, it continued to keep adding data to the same two
instances which were almost 100%
- We added two more instances. Rebalancing started to happen again
- Now we have in total 8 instances. In Current configuration 4 instances
have 84%, rest of the 4 have 5% data
- We observed that Coordinator was draining segments from four of the
instances and distributing them among the rest of the four instances.
### Balancing Strategy
We were reading about coordinator balancing strategy and we have question on
that
- Why did coordinator add most of the data only on 4 historicals when we
have 8 historicals in total.
- Under Balancing segment load section in the [Coordinator documentation
page](https://druid.apache.org/docs/latest/design/coordinator.html), we have
the following sentence
`For every Historical process tier in the cluster, the Coordinator process
will determine the Historical process with the highest utilization and the
Historical process with the lowest utilization. The percent difference in
utilization between the two processes is computed, and if the result exceeds a
certain threshold, a number of segments will be moved from the highest utilized
process to the lowest utilized process.`
Can you please explain what that threshold is and where we can find this
value? Tried to go through the Druid code on GitHub but could not find it.
Is this the reason why only 4 instances have most of the data?
- We did not apply specific configuration for
`druid.coordinator.balancer.strategy`. So we must be using the default which is
'cost' on version 0.20.0
- We know that cachingCost and diskNormalized are two other options. If we
want equal distribution of segments, we might have to use diskNormalized. Are
there any problems with this configuration? Do we need to change other
configurations so that diskNormalized works well?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]