Github user revans2 commented on the issue:
https://github.com/apache/storm/pull/2400
@jerrypeng,
I am not sure your intuition is right, but this is all still theoretical
until we roll it out and see what happens in real life. We have run some
simulations that at least for our simulated load it does not appear to be any
worse and in cases where there are GPU like resources it is a lot better.
Solving fragmentation should mostly be around matching the ratio of
resources in a request to the resources left on a node, which is a lot of what
your initial RAS paper was about. The problem is that when sorting the nodes
we prioritize proximity to other parts of the same topology over fixing
fragmentation. So fragmentation really only matters when scheduling the first
executor on a node. Because of this I don't think it is the size of the
topology that matters as much as it is the size of the individual components.
To really solve this we need a way to balance the desire to co-locate
components with how well does this request fill what is left on the node. I am
hopeful that when we finish https://issues.apache.org/jira/browse/STORM-2684
the scheduler will group parts of a topology that give it the biggest win
within a single "super component" and then if we need to we can look at having
a config that controls when to switch from sorting by co-locating to sorting to
reduce
fragmentation. i.e. when do we move on to the next node in the rack even if
this one is not full because we are starting to run low on resource X.
If you have suggestions or want to collaborate on it that would be great,
but you know how hard it is to for us to get legal approval to share too much
more than just code. So for now we want to try and get this feature rolled out
and then monitor it to see how it goes and if we need to adjust anything.
@govind-menon,
Do you have some of the simulation results in a human readable format we
can share? Also if you have the code we used to run the simulation putting an
Apache license on it posting it would be great so that others can reproduce
what we have done.
---