Github user revans2 commented on the issue:

    https://github.com/apache/storm/pull/2400
  
    This is very long and I don't expect everyone to read the whole thing.
    
    @jerrypeng your numbers are off for big vs small topologies.  The scores 
will be negative for your example.
    
    -0.1 for Topology 1 vs  -0.01 for Topology 2.  The smallest of these scores 
(-0.1) is the larger topology (Topology 1) so it would be prioritized first. 
    
    The algorithm is kind of odd.  It was designed with #2385 (Generic RAS) in 
mind.  Specifically GPU support.  There were 2 big goals here.  First we wanted 
something that we could apply in a simple way across all types of resources (No 
matter how many resources there were), and second we wanted to prioritize 
requests that needed a rare resource over requests that needed common 
resources.  We had a lesser goal of trying to be fair between users over their 
guarantees.
    
    Why do we have these goals? We found a rather large problem with resource 
fragmentation in our clusters.  Up to 20% of the total resources end up not 
being usable because a single resource was effectively exhausted on a node 
before the other one could be.  GPUs are expensive and having one of them 
sitting idle because we filled the node up with other things is really hard to 
explain to a customer that needs the GPU.  If you look at #2385 you see that 
@govind-menon  has adjusted how nodes are selected for an executor.  Each time 
an executor is scheduled the nodes are sorted again.  The sorting of the nodes 
has also changed, even though that might be hard to see because of some of the 
refactoring.  It now tries to avoid nodes that have a "rare" resource on them 
unless the request needs that resource.
    
    This algorithm tries to help by handing the other algorithm topologies that 
use rare resources first.  Rare in this case means it uses a higher percentage 
of what is in the cluster as a whole.  It is not perfect in this, because It 
has to take into account the user given priority of the topology first along 
with guaranteed resources.  So it ends up prioritizing larger topologies that 
fit within a users guarantee.  But once a user goes over their guarantee, it 
treats all users equally and will prioritize users that have less resources 
requested (hence smaller topologies likely show up ahead of bigger ones).  We 
may adjust this over time as we get more experience running with these kinds of 
topologies, and may end up evicting parts of a topology to make room for 
another topology that needs a fragmented resource.  But I would really like to 
avoid this if at all possible.
    
    Now to answer your second question about why I removed the eviction 
strategy.  On our staging clusters we have been running with a FIFO eviction 
strategy for the same reasons described in the FIFO prioritization strategy in 
this patch.  A few times we saw nimbus get stuck in what looked like an 
infinite loop trying to schedule topologies.  We were never able to come up 
with a reproducible use case but the way the APIs were written with each 
strategy returning a single item (the next one to schedule or evict) that you 
could imagine how they could feed off of each other and cause a loop to happen. 
 We "fixed" the issue internally by limiting the total number of times a 
topology will be scheduled before we give up on trying to schedule it.  But it 
is ugly and inefficient.  I kept it for this patch, but I am leaning towards 
removing it.  Instead if we have a single source of truth for a total ordering 
of topologies there is no way for it to be inconsistent.  We just try to 
schedule s
 tarting at the most important and evict starting at the least important.  This 
means there are no unbounded loops in the code, so there is no way to be stuck 
in one of those loops.  With that change we only needed one plugin to sort all 
of the topologies into a total order, so that is the new prioritization 
strategy API.  Hence no need for an eviction strategy.


---

Reply via email to