Github user revans2 commented on the issue:
https://github.com/apache/storm/pull/2618
OK good I do understand the problem.
There really are a few ways that I see we can make the stack trace much
less likely to come out in the common case. The following are in my preferred
order, but I am open to other ideas.
1) We don't delete the blobs on the nimbus side for a while after we kill
the topology.
Currently we delete the blobs on a timer that runs every 10 seconds by
default, and I would have to trace through things, but I think we may do some
other deletions before that happens. If instead we kept a separate map (TOPO_X
can be cleaned up after Y) then when cleanup runs it can check that map and if
it does not find the topo it wants to clean up, or if it finds it and the time
has passed, then it cleans it up.
2) We don't output the stack trace until it has failed some number of times
in a row. This would mean that we would still output the error if the blob was
deleted when it should not have been, but would not look like an error until it
had been gone for 1 or 2 seconds. Hopefully long enough to actually have
killed the workers.
3) We have the supervisor inform the AsyncLocalizer about topologies that
are in the process of being killed.
Right now part of the issue with the race is that killing a worker can take
a non-trivial amount of time. This makes the window that the race can happen
in much larger. If as soon as the supervisors know that a topology is being
killed they tell the AsyncLocalizer it could then not output errors for any
topology in the process of being killed. The issue here is that informing the
supervisors happens in a background thread and is not guaranteed to happen, so
it might not work as frequently as we would like.
---