Le 19/10/2015 17:01, Willy Tarreau a écrit :
On Mon, Oct 19, 2015 at 03:06:44PM +0200, Christopher Faulet wrote:
OK so the unused objects in the tree have a refcount of 1 while the used
ones have 2 or more, thus the refcount is always valid. Good that also
means we must not test if the tree is null or not in ssl_sock_close(),
we must always free the ssl_ctx as long as it was dynamically created,
so that its refcount decreases, otherwise it keeps increasing upon every
reuse.

No. Maybe my explanation was not really clear. The SSL_CTX refcount is
not exposed. It is an internal parameter. So, it is not incremented when
the SSL_CTX is pushed in the cache tree.

The call to SSL_set_SSL_CTX increases the refcount and the call to
SSL_free decrements it (when the connection is closed). And, of course,
the call to SSL_CTX_free decrements it too. The SSL_CTX object is
released when the refcount reaches 0.

For a SSL_CTX object, SSL_CTX_free must be called once. When it is
evicted from the cache tree (or when the tree is destroyed) _OR_ when
the connection is closed if there is no cache tree. If we always release
SSL_CTX objects when the SSL connection is closed, we will have
undefined references for cached objects, leading to a segfault.

OK, I understood the opposite, which is that we kept a refcount for each
user (cache and/or sessions).

But then how do we know that an SSL_CTX is still in use when we want to
evict it from the cache and that we must not free it ? Is it just the
fact that between the moment it's picked from the cache using
ssl_sock_get_generated_cert() and the moment it's associated to a session
using SSL_set_SSL_CTX() it's not possible to yield and destroy the cached
object so no race is possible here ? If so I'm fine with it for now (though
it will become "fun" when we start to play with threads), I just want to
be certain we're not overlooking this part as well.

This is not an issue because when we get (or create) a SSL_CTX object then it is associated to a session, without any interruption. So it cannot be evicted from the cache in the middle. After this step, the refcount is >= 2, so, if the SSL_CTX object is evicted from the cache, the refcount is decremented and the SSL_CTX is not released. It will be automatically released with the closure of the last SSL connection using it.

But, now this works for a non-threaded environment. Is there any plan to add thread support? If yes, this feature will not work.


Also that raises another point : if the issue is related to SSL_CTX_free()
being called on static contexts, then to me it means that these contexts
were not properly refcounted when assigned to the SSL. Don't you think
that we shouldn't instead do something like the following to properly
refcount any context attached to an SSL and ensure that the SSL_CTX_free()
can always be performed regardless of parallel activities in the LRU tree
or anything else ?

                 /* Alloc a new SSL session ctx */
                 conn->xprt_ctx = 
SSL_new(objt_server(conn->target)->ssl_ctx.ctx);
+               SSL_set_SSL_CTX(conn->xprt_ctx, 
objt_server(conn->target)->ssl_ctx.ctx);

This last call will have no effect. Because the SSL_CTX is the same, this function returns immediately. Note that if the context changes, the refcount of the old one is decremented.

But there is no issue here, because the static contexts are only released when HAProxy is stopped.

Here is live cycle of static contexts:

- HAProxy is started, static contexts are initialized by calling SSL_CTX_new (refcount is set to 1).

- SSL connections use these contexts. SSL_new or SSL_set_SSL_CTX are called to assign a context to a SSL object. The refcount is incremented by 1 each time. When a SSL connection is closed, a call to SSL_free is done to release the SSL object and the refcount of the associated context is decremented. So the refcount is always greater or equal to 1.

- HAPRoxy is stopped, all connections are closed, and finally, static contexts are freed by calling SSL_CTX_free. The refcount is equal to 1, so when SSL_CTX_free is called, it reaches 0 and the contexts is freed.

The refcount is not incremented when a SSL_CTX object is pushed in the
cache. There is no way to manually increment or decrement it. So, we
must really know if the SSL_CTX object was cached or not when the SSL
connection is closed.

I'm having an issue here as well since the LRU's destroy callback is set
to SSL_CTX_free. This we start with a non-null refcount. I'm sorry if I am
not clear, but the problem I'm having could be described like this :

   - two sets of entities can use a shared resource at any instant : cache
     and SSL sessions ;
   - each of them uses SSL_CTX_free() at release time to release the object ;
   - SSL_CTX_free() takes care of the refcount to know if it must free or not,
     which means that these two entities above are each responsible for one
     refcount point ;
   - the SSL_CTX_free() called by the cache is unconditional when the object
     is evicted from the cache ;
   - the SSL_CTX_free() is only done if the cache is enabled ;

This last step is wrong. SSL_CTX_free is only done if the cache is _DISABLED_ or NULL if you prefer. SSL_CTX_free is called when the object is evicted from the cache _OR_ when the cache is disabled. So everything works fine.

Due to last line I deduce that with this condition we're leaking SSL_CTX
when the cache is disabled. I'm possibly missing something, it's just that
seeing one entity monitor its adversary in a refcounted system triggers in
me the feeling that we're painting over the real problem.

I understand that what we're trying to achieve is only to avoid calling
SSL_CTX_free() on objects that were never dynamically allocated, which
is why I'm seriously thinking about refcounting these ones as well. I don't
agree with not calling it with objects that were dynamically allocated but
could never be added to the cache (ie cache disabled), for me it's a leak.
And in the end I suspect that we're still facing an imbalanced refcounting
system that instead of crashing will slowly leak some SSL_CTX.

Due to you previous mistake, this reasoning is wrong of course.

--
Christopher Faulet

Reply via email to