Github user revans2 commented on the issue:

    https://github.com/apache/storm/pull/2400
  
    @jerrypeng,
    
    I am not sure your intuition is right, but this is all still theoretical 
until we roll it out and see what happens in real life.  We have run some 
simulations that at least for our simulated load it does not appear to be any 
worse and in cases where there are GPU like resources it is a lot better. 
    
    Solving fragmentation should mostly be around matching the ratio of 
resources in a request to the resources left on a node, which is a lot of what 
your initial RAS paper was about.  The problem is that when sorting the nodes 
we prioritize proximity to other parts of the same topology over fixing 
fragmentation.  So fragmentation really only matters when scheduling the first 
executor on a node.  Because of this I don't think it is the size of the 
topology that matters as much as it is the size of the individual components.  
To really solve this we need a way to balance the desire to co-locate 
components with how well does this request fill what is left on the node.  I am 
hopeful that when we finish https://issues.apache.org/jira/browse/STORM-2684 
the scheduler will group parts of a topology that give it the biggest win 
within a single "super component" and then if we need to we can look at having 
a config that controls when to switch from sorting by co-locating to sorting to 
reduce
  fragmentation.  i.e. when do we move on to the next node in the rack even if 
this one is not full because we are starting to run low on resource X.
    
    If you have suggestions or want to collaborate on it that would be great, 
but you know how hard it is to for us to get legal approval to share too much 
more than just code. So for now we want to try and get this feature rolled out 
and then monitor it to see how it goes and if we need to adjust anything. 
    
    @govind-menon,
    
    Do you have some of the simulation results in a human readable format we 
can share?  Also if you have the code we used to run the simulation putting an 
Apache license on it posting it would be great so that others can reproduce 
what we have done.


---

Reply via email to