> Instead of resending the message to the pool immediately, it just waits in the runbuffer, and the runbuffer is processed in reaction to any potential change in resources: NeedWork, ContainerRemoved, etc. This may add delay to any buffered message(s), but seems to avoid the catastrophic crash in our systems.
This makes sense since the rescheduling is really an indication of something going badly at the container level from my recollection. A better solution might be to reschedule the request to another invoker for some fairness criteria (but not easily guaranteed with the current architecture). (This is testing my memory but...) we used to see these in a previous incarnation of the scheduler as a precursor to docker daemon going out to lunch. (Markus might remember better.) -r