Re: [pool] Recovering from transient factory outages

Romain Manni-Bucau Tue, 13 Feb 2024 12:27:42 -0800

Hi Phil,

What I used by the past for this kind of thing was to rely on the timeout
of the pool plus in the healthcheck - external to the pool - have some
trigger (the simplest was "if 5 healthchecks fail without any success in
between" for ex), such trigger will spawn a task (think thread even if it
uses an executor but guarantee to have a place for this task) which will
retry but at a faster pace (instead of every 30s it is 5 times in a run for
- number was tunable but 5 was my default).
If still detected as down - vs not overloaded or alike - it will consider
the database down and spawn a task which will retry every 30 seconds, if
the database comes back - I added some business check but idea is not just
check the connection but the tables are accessible cause often after such a
downtime the db does not come at once - just destroy/recreate the pool.
The destroy/recreate was handled using a DataSource proxy in front of the
pool and change the delegate.
Indeed it is not magic inside the pool but can only better work than the
pool solution cause you can integrate to your already existing checks and
add more advanced checks - if you have jpa just do a fast query on any
table to validate db is back for ex.
At the end code is pretty simple and has another big advantage: you can
circuit break the database completely while you consider the db is down
just letting passing 10% of whatever ratio you want - of the requests (kind
of canary testing which avoids too much pressure on the pool).


I guess it was not exactly the answer you expected but think it can be a
good solution and ultimately can site in a new package in dbcp or alike?

Best,
Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> |  Blog
<https://rmannibucau.metawerx.net/> | Old Blog
<http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
LinkedIn <https://www.linkedin.com/in/rmannibucau> | Book
<https://www.packtpub.com/application-development/java-ee-8-high-performance>


Le mar. 13 févr. 2024 à 21:11, Phil Steitz <phil.ste...@gmail.com> a écrit :

> POOL-407 tracks a basic liveness problem that we have never been able to
> solve:
>
> A factory "goes down" resulting in either failed object creation or failed
> validation during the outage.  The pool has capacity to create, but the
> factory fails to serve threads as they arrive, so they end up parked
> waiting on the idle object pool.  After a possibly very brief interruption,
> the factory heals itself (maybe a database comes back up) and the waiting
> threads can be served, but until other threads arrive, get served and
> return instances to the pool, the parked threads remain blocked.
> Configuring minIdle and pool maintenance (timeBetweenEvictionRuns > 0) can
> improve the situation, but running the evictor at high enough frequency to
> handle every transient failure is not a great solution.
>
> I am stuck on how to improve this.  I have experimented with the idea of a
> ResilientFactory, placing the responsibility on the factory to know when it
> is down and when it comes back up and when it does, to keep calling it's
> pool's create as long as it has take waiters and capacity; but I am not
> sure that is the best approach.  The advantage of this is that
> resource-specific failure and recovery-detection can be implemented.
>
> Another option that I have played with is to have the pool keep track of
> factory failures and when it observes enough failures over a long enough
> time, it starts a thread to do some kind of exponential backoff to keep
> retrying the factory.  Once the factory comes back, the recovery thread
> creates as many instances as it can without exceeding capacity and adds
> them to the pool.
>
> I don't really like either of these.  Anyone have any better ideas?
>
> Phil
>

Re: [pool] Recovering from transient factory outages

Reply via email to