[pool] Recovering from transient factory outages

Phil Steitz Tue, 13 Feb 2024 12:12:22 -0800

POOL-407 tracks a basic liveness problem that we have never been able to
solve:


A factory "goes down" resulting in either failed object creation or failed
validation during the outage.  The pool has capacity to create, but the
factory fails to serve threads as they arrive, so they end up parked
waiting on the idle object pool.  After a possibly very brief interruption,
the factory heals itself (maybe a database comes back up) and the waiting
threads can be served, but until other threads arrive, get served and
return instances to the pool, the parked threads remain blocked.
Configuring minIdle and pool maintenance (timeBetweenEvictionRuns > 0) can
improve the situation, but running the evictor at high enough frequency to
handle every transient failure is not a great solution.

I am stuck on how to improve this.  I have experimented with the idea of a
ResilientFactory, placing the responsibility on the factory to know when it
is down and when it comes back up and when it does, to keep calling it's
pool's create as long as it has take waiters and capacity; but I am not
sure that is the best approach.  The advantage of this is that
resource-specific failure and recovery-detection can be implemented.

Another option that I have played with is to have the pool keep track of
factory failures and when it observes enough failures over a long enough
time, it starts a thread to do some kind of exponential backoff to keep
retrying the factory.  Once the factory comes back, the recovery thread
creates as many instances as it can without exceeding capacity and adds
them to the pool.

I don't really like either of these.  Anyone have any better ideas?

Phil

[pool] Recovering from transient factory outages

Reply via email to