Re: Limiting the blast radius of connection leaks

Julian Sedding Tue, 04 Aug 2020 01:43:46 -0700

Hi

I don't know if Java's "reachability" of responses would be a good
indicator for when a connection has erroneously not been returned to a
pool. If that is the case (and the stated memory leak makes me think
it might be), then the below information may be helpful.


I'll share some (positive) experiences from the Apache Sling project
with an approach that may work here as well. Sling is a Web-Framework
built on top of JCR (Java Content Repository, think of it as a
hierarchical database). For each request, a ResourceResolver is
created based on user credentials (often anonymous). The
ResourceResolver (RR) needs to be closed at the end of the request. So
far the framework can manage the lifecycle of the RR - no problem.
However, it is also possible to obtain a RR in components that
implement some backend logic, e.g. a scheduled task performing some
maintenance or reporting. In these cases the responsibility for
closing the RR is with the programmer implementing that piece of
backend logic. Of course that is often forgotten and leads to a memory
leak (leaked RR instances).

Solving this issue with a finalize method would work, but finalize is
known to require synchronization and to be rather slow. Instead we
implemented a mechanism to close unclosed RR instances using Java's
java.lang.ref.ReferenceQueue and WeakReference, see
https://github.com/apache/sling-org-apache-sling-resourceresolver/blob/30f7d91c6c99dc674fe361060aaa90f6e0252a6a/src/main/java/org/apache/sling/resourceresolver/impl/CommonResourceResolverFactoryImpl.java#L525

The trick is to encapsulate all required state in one object and then
create another object that references the state, but has no (relevant)
state of its own, a bit like a facade. When the facade gets GC'ed you
get a WeakReference referencing the state object enqueued in the
ReferenceQueue and all resources can be closed. In Sling we also give
the possibility to log the "opening stacktrace" when an RR is closed
by this fallback mechanism. This can help identify places in the
source code where RRs aren't correctly closed.

Hope this may help.

Regards
Julian

On Mon, Aug 3, 2020 at 7:39 PM Gary Gregory <garydgreg...@gmail.com> wrote:
>
> Hi Carter,
>
> I don't have a super strong feeling for this either. I just would like us
> to try and reuse a library before reinventing the wheel, but my time is
> limited, like everyone else.
>
> Gary
>
> On Mon, Aug 3, 2020, 13:19 Carter Kozak <cko...@ckozak.net> wrote:
>
> > Hi Gary,
> >
> > That sounds reasonable, however there are some subtle distinctions
> > between general purpose object pools and those used for http
> > connections, perhaps the most obvious of which is http/2 multiplexed
> > connections. It's certainly possible that it could work, but I imagine it
> > could become a large project and may result in something very
> > specific to http clients. Pools used by blocking and asynchronous
> > clients may also differ substantially, so I'm not sure how general this
> > implementation could be.
> >
> > I'd be happy to help in any way I can if folks more familiar than myself
> > with the current hc pool implementations are supportive of your proposal,
> > but I'm worried that it may become a blocker that prevents incremental
> > improvements.
> >
> > In conclusion, I don't feel comfortable taking a stance on this and I trust
> > the team to make the right decision.
> >
> > -ck
> >
> > On Fri, Jul 31, 2020, at 17:38, Gary Gregory wrote:
> > > What I would prefer we do is not reinvent yet another pooling feature,
> > all
> > > its potential bugs and added maintenance load when that wheel's been
> > > invented over and over. For me personally, I would prefer that this
> > project
> > > adopt a FOSS pooling library and improve _it_ if that were needed,
> > thereby
> > > improving the whole ecosystem.
> > >
> > > Gary
> > >
> > > On Fri, Jul 31, 2020, 16:22 Carter Kozak <cko...@ckozak.net> wrote:
> > >
> > > > On Fri, Jul 31, 2020, at 14:31, Gary Gregory wrote:
> > > > > On Fri, Jul 31, 2020 at 12:45 PM Oleg Kalnichevski <ol...@apache.org
> > >
> > > > wrote:
> > > > >
> > > > > > On Thu, 2020-07-30 at 13:28 -0400, Carter Kozak wrote:
> > > > > > > Hello Friends,
> > > > > > >
> > > > > > > I recently had a service crash due to an out of memory error as
> > the
> > > > > > > result of a response leak. While this is caused by code that uses
> > > > > > > closeable resources incorrectly, and that code should be fixed,
> > I'd
> > > > > > > like to discuss some ideas which could avoid catastrophic
> > failure in
> > > > > > > this scenario.
> > > > > > >
> > > > > > > A little background on my environment: I maintain an RPC library
> > > > > > > which wraps an hc5 client. There's no limit set on the hc5
> > connection
> > > > > > > pool (Integer.MAX_VALUE), and resource use is bounded using a
> > > > > > > variation of additive-increase/multiplicative-decrease limiters
> > to
> > > > > > > scale based on load. Some endpoints are capable of returning a
> > raw
> > > > > > > binary stream, this introduces some risk of resource leaks --
> > ideally
> > > > > > > we would take an inputstream consumer to fully own the
> > lifecycle, but
> > > > > > > unfortunately that's not an API change we can make at this
> > point. We
> > > > > > > return permits before passing the resulting inputstream to custom
> > > > > > > code to prevent leaks from blocking the entire service.
> > > > > > >
> > > > > > > Unfortunately leased connections are anchored directly to both
> > > > > > > StrictConnPool and LaxConnPool in order to validate that
> > connections
> > > > > > > are returned to the pool where they originated, which prevents a
> > > > > > > relatively large swath of resources from being garbage collected
> > in
> > > > > > > the event of a leak. I realize my case is a bit different due to
> > the
> > > > > > > effectively unlimited pool size, where memory becomes the
> > limiting
> > > > > > > factor rather than requests blocking against the pool limit, but
> > I
> > > > > > > believe there are a few options that can help us to both detect
> > and
> > > > > > > endure leaks.
> > > > > > >
> > > > > > > A couple potential solutions:
> > > > > > >
> > > > > > > Use weak references to track leased connections with a periodic
> > sweep
> > > > > > > to detect leaks and reclaim per-route permits. This approach
> > would be
> > > > > > > complex, but allows the default configuration to detect and
> > handle
> > > > > > > rare response leaks without eventually blocking connection
> > requests.
> > > > > > > Using this approach, a leaked connection wouldn't necessarily be
> > > > > > > closed when the pool is shut down, but may survive until a full
> > GC.
> > > > > > > In most cases I expect clients to be long-lived, and I'm not sure
> > > > > > > what guarantees are made about leaked connections.
> > > > > > >
> > > > > > > Use a counter to track leased connections instead of tracking
> > leased
> > > > > > > entries directly. This loses the safety guarantees that
> > connections
> > > > > > > aren't already leased, and that they're being returned to the
> > pool
> > > > > > > where they originated, however it eliminates the need for the
> > > > > > > connection pools hash table and references to checked-out values.
> > > > > > > This solution would not be helpful when connections are leaked
> > and
> > > > > > > the route limit is reached, so it may make more sense as a
> > > > > > > PoolConcurrencyPolicy.UNLIMITED option. This design would not be
> > able
> > > > > > > to close leased connections when the pool is closed, they would
> > > > > > > remain open until they are released, when the closed pool could
> > close
> > > > > > > the incoming connection.
> > > > > > >
> > > > > > > Have you encountered similar problems? If you would consider a
> > > > > > > contribution with one of these designs or another I have not yet
> > > > > > > considered, I will open a ticket and plan to implement a
> > solution.
> > > > > > > Any and all feedback is greatly appreciated.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Carter Kozak
> > > > > > >
> > > > > >
> > > > > > Hi Carter
> > > > > >
> > > > > > Automatic recovery of leaked connections has been actively
> > discussed in
> > > > > > HC 3.x and early HC 4.x days without ever producing any practical
> > > > > > outcome. It is _massively_ easier to just release connections then
> > jump
> > > > > > through many hoops to try and reclaim what might or might not be
> > some
> > > > > > leaked connections.
> > > > > >
> > > > > > You are very welcome to give it a shot but I would kindly ask you
> > to
> > > > > > start it off as a completely new connection manager initially. Once
> > > > > > there is a working solution we can decide how to fold it into the
> > > > > > project and in what form.
> > > > > >
> > > > >
> > > > > This might be made easier by using Apache Commons Pool which can
> > track
> > > > > borrowed but abandoned objects:
> > > > >
> > > > > The pool can also be configured to detect and remove "abandoned"
> > objects,
> > > > > i.e. objects that have been checked out of the pool but neither used
> > nor
> > > > > returned before the configured removeAbandonedTimeout
> > > > > <
> > > >
> > https://commons.apache.org/proper/commons-pool/apidocs/org/apache/commons/pool2/impl/AbandonedConfig.html#getRemoveAbandonedTimeout--
> > > > >.
> > > > > Abandoned object removal can be configured to happen when
> > borrowObject is
> > > > > invoked and the pool is close to starvation, or it can be executed
> > by the
> > > > > idle object evictor, or both. If pooled objects implement the
> > TrackedUse
> > > > > <
> > > >
> > https://commons.apache.org/proper/commons-pool/apidocs/org/apache/commons/pool2/TrackedUse.html
> > > > >
> > > > > interface,
> > > > > their last use will be queried using the getLastUsed method on that
> > > > > interface; otherwise abandonment is determined by how long an object
> > has
> > > > > been checked out from the pool.
> > > > >
> > > > > See
> > > > >
> > > >
> > https://commons.apache.org/proper/commons-pool/apidocs/index.html?org/apache/commons/pool2/impl/GenericObjectPool.html
> > > > >
> > > > > Gary
> > > > >
> > > > > Cheers
> > > > > >
> > > > > > Oleg
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > ---------------------------------------------------------------------
> > > > > > To unsubscribe, e-mail: dev-unsubscr...@hc.apache.org
> > > > > > For additional commands, e-mail: dev-h...@hc.apache.org
> > > > > >
> > > > > >
> > > > >
> > > >
> > > > Thank you both,
> > > >
> > > > Oleg, I agree that it's much easier to close resources than it is to
> > write
> > > > a pool that is capable of recovery, and we shouldn't take on undue
> > > > complexity to force a solution to work. That said, the surface area of
> > code
> > > > that has the potential to leak (code that uses HC) is orders of
> > magnitude
> > > > larger than the connection pool, and a fix could reduce JIRA load on
> > the
> > > > project. I will plan to implement an initial experimental version as an
> > > > alternative to PoolingHttpClientConnectionManager so we can avoid
> > impacting
> > > > users, and I can validate it in a high-load production environment
> > before
> > > > it considered for a release.
> > > >
> > > > Gary, thanks for pointing out the abandonment feature in commons-pool!
> > I
> > > > worry about detection based on a timeout because it's difficult to be
> > > > certain an object is abandoned. While multi-hour requests waiting for
> > > > response headers aren't a good idea, I've run into a few places that
> > depend
> > > > on them and wouldn't want to break users.
> > > >
> > > > Carter
> > > >
> > >
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@hc.apache.org
For additional commands, e-mail: dev-h...@hc.apache.org

Re: Limiting the blast radius of connection leaks

Reply via email to