On Thu, 2018-10-04 at 11:31 -0500,  Benjamin Marzinski  wrote:
> On Mon, Oct 01, 2018 at 09:51:22PM +0200, Martin Wilck wrote:
> > On Fri, 2018-09-21 at 18:05 -0500, Benjamin Marzinski wrote:
> > > The code previously was timing out mode if ct->thread was 0 but
> > > ct->running wasn't. This combination never happens.  The idea was
> > > to
> > > timeout if for some reason the path checker tried to kill the
> > > thread,
> > > but it didn't die.  The correct thing to check for this is ct-
> > > > holders.
> > > 
> > > ct->holders will always be at least one when libcheck_check() is
> > > called,
> > > since libcheck_free() won't get called until the device is no
> > > longer
> > > being checked. So, if ct->holders is 2, that means that the tur
> > > thread
> > > is has not shut down yet.
> > > 
> > > Also, instead of returning PATH_TIMEOUT whenever the thread
> > > hasn't
> > > died,
> > > it should only time out if the thread didn't successfully get a
> > > value,
> > > which means the previous state was already PATH_TIMEOUT.
> > 
> > What about PATH_UNCHECKED?
> > 
> 
> I don't see how that could happen. In this state we've definitely set
> ct->state to PATH_PENDING when we started the thread. The thread will
> only change  ct->state to the return of tur_check(), which doesn't
> ever return PATH_UNCHECKED. Am I missing something here?

OK. The only thing you missed was a comment :-)

This stuff is so subtle that we should describe exactly how
the locking works, which actor is supposed/entitled to set what field
to what value in what situation, and so on.

> 
> > > 
> > > Signed-off-by: Benjamin Marzinski <[email protected]>
> > > ---
> > >  libmultipath/checkers/tur.c | 15 +++++++++------
> > >  1 file changed, 9 insertions(+), 6 deletions(-)
> > > 
> > > diff --git a/libmultipath/checkers/tur.c
> > > b/libmultipath/checkers/tur.c
> > > index bf8486d..275541f 100644
> > > --- a/libmultipath/checkers/tur.c
> > > +++ b/libmultipath/checkers/tur.c
> > > @@ -355,12 +355,15 @@ int libcheck_check(struct checker * c)
> > >           }
> > >           pthread_mutex_unlock(&ct->lock);
> > >   } else {
> > > -         if (uatomic_read(&ct->running) != 0) {
> > > -                 /* pthread cancel failed. continue in sync mode
> > > */
> > > -                 pthread_mutex_unlock(&ct->lock);
> > > -                 condlog(3, "%s: tur thread not responding",
> > > -                         tur_devt(devt, sizeof(devt), ct));
> > > -                 return PATH_TIMEOUT;
> > > +         if (uatomic_read(&ct->holders) > 1) {
> > > +                 /* pthread cancel failed. If it didn't get the
> > > path
> > > +                    state already, timeout */
> > > +                 if (ct->state == PATH_PENDING) {
> > > +                         pthread_mutex_unlock(&ct->lock);
> > > +                         condlog(3, "%s: tur thread not
> > > responding",
> > > +                                 tur_devt(devt, sizeof(devt),
> > > ct));
> > > +                         return PATH_TIMEOUT;
> > > +                 }
> > >           }
> > 
> > I'm not quite getting it yet. We're entering this code path if 
> > ct->thread is 0. As you point out, if there are >1 holders, a
> > previously cancelled TUR thread must be still "alive". But now we
> > want
> > to check *again*. The previous code refused to do that - a single
> > hanging TUR thread could block further checks from happening,
> > forever.
> > The PATH_TIMEOUT return code was actually questionable - in this
> > libcheck_check() invocation, we didn't even try to check, so
> > nothing
> > could actually time out (but that's obviously independent of your
> > patch).
> > 
> > With your patch, if ct->state != PATH_PENDING, we'd try to start
> > another TUR thread although there's still a hanging one. That's not
> > necessarily wrong, but I fail to see why it depends on ct->state.
> > Anyway, if we do this, we take the risk of starting more and more
> > TUR
> > threads which might all end up hanging; I wonder if we should stop
> > creating more threads if ct->holders grows further (e.g. > 5).
> 
> It's been a while, and I'm not exactly sure what I was thinking here.
> But IIRC the idea was that if the state isn't set yet, then the old
> thread could mess with the results of a future thread. Also, it was
> to
> avoid a corner case where the tur checker was called, started the
> thread, got the result before the checker exitted, cancelled the
> thread
> and returned the result and then was called very shortly afterwards,
> before the thread could clean itself up.  In that case, the right
> thing
> to do (I thought) would be to start a new thread, because the other
> one
> would be ending soon.  In truth, starting a new thread at all is
> probably a bad idea, since the old thread will still mess with the
> checker context on exit. 
> 
> A better idea might be to simply fail back to a syncronous call to
> tur_check() when you notice a cancelled thread that hasn't exitted.
> That can cause all the other checkers to get delayed, but at least in
> that case you are still checking paths. The other option is to do as
> before and just not check this path, which won't effect the other
> checkers. It seems very unlikely that the thread could remain
> uncancelled forever, especially if the path was usable.
>
> thoughts?

Generally speaking, we're deeply in the realm of highly unlikely
situations I would say. But we should get it right eventually.

Maybe we can add logic to the tur thread to keep its hands off the
context if it's a "zombie", like below (just a thought, untested)?

Martin


diff --git a/libmultipath/checkers/tur.c b/libmultipath/checkers/tur.c
index bf8486d3..e8493ca8 100644
--- a/libmultipath/checkers/tur.c
+++ b/libmultipath/checkers/tur.c
@@ -219,11 +219,18 @@ static void cleanup_func(void *data)
        rcu_unregister_thread();
 }
 
+#define tur_thread_quit_unless_owner(__ct)        \
+       if (__ct->thread != pthread_self()) {      \
+               pthread_mutex_unlock(&__ct->lock); \
+               pthread_exit(NULL);                \
+       }
+
 static void copy_msg_to_tcc(void *ct_p, const char *msg)
 {
        struct tur_checker_context *ct = ct_p;
 
        pthread_mutex_lock(&ct->lock);
+       tur_thread_quit_unless_owner(ct);
        strlcpy(ct->message, msg, sizeof(ct->message));
        pthread_mutex_unlock(&ct->lock);
 }
@@ -243,6 +250,7 @@ static void *tur_thread(void *ctx)
 
        /* TUR checker start up */
        pthread_mutex_lock(&ct->lock);
+       tur_thread_quit_unless_owner(ct);
        ct->state = PATH_PENDING;
        ct->message[0] = '\0';
        pthread_mutex_unlock(&ct->lock);
@@ -252,6 +260,7 @@ static void *tur_thread(void *ctx)
 
        /* TUR checker done */
        pthread_mutex_lock(&ct->lock);
+       tur_thread_quit_unless_owner(ct);
        ct->state = state;
        pthread_cond_signal(&ct->active);
        pthread_mutex_unlock(&ct->lock);




-- 
Dr. Martin Wilck <[email protected]>, Tel. +49 (0)911 74053 2107
SUSELinux GmbH, GF: Felix Imendörffer, Jane Smithard,  Graham Norton
HRB 21284 (AG Nürnberg)


--
dm-devel mailing list
[email protected]
https://www.redhat.com/mailman/listinfo/dm-devel

Reply via email to