Github user revans2 commented on the issue:
https://github.com/apache/storm/pull/2400
This is very long and I don't expect everyone to read the whole thing.
@jerrypeng your numbers are off for big vs small topologies. The scores
will be negative for your example.
-0.1 for Topology 1 vs -0.01 for Topology 2. The smallest of these scores
(-0.1) is the larger topology (Topology 1) so it would be prioritized first.
The algorithm is kind of odd. It was designed with #2385 (Generic RAS) in
mind. Specifically GPU support. There were 2 big goals here. First we wanted
something that we could apply in a simple way across all types of resources (No
matter how many resources there were), and second we wanted to prioritize
requests that needed a rare resource over requests that needed common
resources. We had a lesser goal of trying to be fair between users over their
guarantees.
Why do we have these goals? We found a rather large problem with resource
fragmentation in our clusters. Up to 20% of the total resources end up not
being usable because a single resource was effectively exhausted on a node
before the other one could be. GPUs are expensive and having one of them
sitting idle because we filled the node up with other things is really hard to
explain to a customer that needs the GPU. If you look at #2385 you see that
@govind-menon has adjusted how nodes are selected for an executor. Each time
an executor is scheduled the nodes are sorted again. The sorting of the nodes
has also changed, even though that might be hard to see because of some of the
refactoring. It now tries to avoid nodes that have a "rare" resource on them
unless the request needs that resource.
This algorithm tries to help by handing the other algorithm topologies that
use rare resources first. Rare in this case means it uses a higher percentage
of what is in the cluster as a whole. It is not perfect in this, because It
has to take into account the user given priority of the topology first along
with guaranteed resources. So it ends up prioritizing larger topologies that
fit within a users guarantee. But once a user goes over their guarantee, it
treats all users equally and will prioritize users that have less resources
requested (hence smaller topologies likely show up ahead of bigger ones). We
may adjust this over time as we get more experience running with these kinds of
topologies, and may end up evicting parts of a topology to make room for
another topology that needs a fragmented resource. But I would really like to
avoid this if at all possible.
Now to answer your second question about why I removed the eviction
strategy. On our staging clusters we have been running with a FIFO eviction
strategy for the same reasons described in the FIFO prioritization strategy in
this patch. A few times we saw nimbus get stuck in what looked like an
infinite loop trying to schedule topologies. We were never able to come up
with a reproducible use case but the way the APIs were written with each
strategy returning a single item (the next one to schedule or evict) that you
could imagine how they could feed off of each other and cause a loop to happen.
We "fixed" the issue internally by limiting the total number of times a
topology will be scheduled before we give up on trying to schedule it. But it
is ugly and inefficient. I kept it for this patch, but I am leaning towards
removing it. Instead if we have a single source of truth for a total ordering
of topologies there is no way for it to be inconsistent. We just try to
schedule s
tarting at the most important and evict starting at the least important. This
means there are no unbounded loops in the code, so there is no way to be stuck
in one of those loops. With that change we only needed one plugin to sort all
of the topologies into a total order, so that is the new prioritization
strategy API. Hence no need for an eviction strategy.
---