> -----Original Message----- > From: [email protected] > [mailto:[email protected]] On > Behalf Of Doug Ledford > Sent: Wednesday, December 10, 2014 11:47 AM > To: [email protected]; [email protected] > Cc: Doug Ledford > Subject: [PATCH 03/10] IPoIB: fix MCAST_FLAG_BUSY usage > > Commit a9c8ba5884 (IPoIB: Fix usage of uninitialized multicast objects) > added a new flag MCAST_JOIN_STARTED, but was not very strict in how it > was used. We didn't always initialize the completion struct before we > set the flag, and we didn't always call complete on the completion > struct from all paths that complete it. This made it less than totally > effective, and certainly made its use confusing. And in the flush > function we would use the presence of this flag to signal that we should > wait on the completion struct, but we never cleared this flag, ever. > This is further muddied by the fact that we overload the MCAST_FLAG_BUSY > flag to mean two different things: we have a join in flight, and we have > succeeded in getting an ib_sa_join_multicast. > > In order to make things clearer and aid in resolving the rtnl deadlock > bug I've been chasing, I cleaned this up a bit. > > 1) Remove the MCAST_JOIN_STARTED flag entirely > 2) Un-overload MCAST_FLAG_BUSY so it now only means a join is in-flight > 3) Test on mcast->mc directly to see if we have completed > ib_sa_join_multicast (using IS_ERR_OR_NULL) > 4) Make sure that before setting MCAST_FLAG_BUSY we always initialize > the mcast->done completion struct > 5) Make sure that before calling complete(&mcast->done), we always clear > the MCAST_FLAG_BUSY bit > 6) Take the mcast_mutex before we call ib_sa_multicast_join and also > take the mutex in our join callback. This forces ib_sa_multicast_join > to return and set mcast->mc before we process the callback. This way, > our callback can safely clear mcast->mc if there is an error on the join > and we will do the right thing as a result in mcast_dev_flush. > 7) Because we need the mutex to synchronize mcast->mc, we can no longer > call mcast_sendonly_join directly from mcast_send and instead must add > sendonly join processing to the mcast_join_task > > A number of different races are resolved with these changes. These > races existed with the old MCAST_FLAG_BUSY usage, the > MCAST_JOIN_STARTED flag was an attempt to address them, and while it > helped, a determined effort could still trip things up. > > One race looks something like this: > > Thread 1 Thread 2 > ib_sa_join_multicast (as part of running restart mcast task) > alloc member > call callback > ifconfig ib0 down > wait_for_completion > callback call completes > wait_for_completion in > mcast_dev_flush completes > mcast->mc is PTR_ERR_OR_NULL > so we skip ib_sa_leave_multicast > return from callback > return from ib_sa_join_multicast > set mcast->mc = return from ib_sa_multicast > > We now have a permanently unbalanced join/leave issue that trips up the > refcounting in core/multicast.c > > Another like this: > > Thread 1 Thread 2 Thread 3 > ib_sa_multicast_join > ifconfig ib0 down > priv->broadcast = NULL > join_complete > wait_for_completion > mcast->mc is not yet set, so don't clear > return from ib_sa_join_multicast and set mcast->mc > complete > return -EAGAIN (making mcast->mc invalid) > call ib_sa_multicast_leave > on invalid mcast->mc, hang > forever > > By holding the mutex around ib_sa_multicast_join and taking the mutex > early in the callback, we force mcast->mc to be valid at the time we run > the callback. This allows us to clear mcast->mc if there is an error > and the join is going to fail. We do this before we complete the mcast. > In this way, mcast_dev_flush always sees consistent state in regards to > mcast->mc membership at the time that the wait_for_completion() returns. > > Signed-off-by: Doug Ledford <[email protected]>
Acked-by: Alex Estrin <[email protected]> > --- > drivers/infiniband/ulp/ipoib/ipoib.h | 10 +- > drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 148 > ++++++++++++++++--------- > 2 files changed, 101 insertions(+), 57 deletions(-) > > diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h > b/drivers/infiniband/ulp/ipoib/ipoib.h > index d7562beb542..f4c1b20b23b 100644 > --- a/drivers/infiniband/ulp/ipoib/ipoib.h > +++ b/drivers/infiniband/ulp/ipoib/ipoib.h > @@ -98,9 +98,15 @@ enum { > > IPOIB_MCAST_FLAG_FOUND = 0, /* used in set_multicast_list */ > IPOIB_MCAST_FLAG_SENDONLY = 1, > - IPOIB_MCAST_FLAG_BUSY = 2, /* joining or already joined */ > + /* > + * For IPOIB_MCAST_FLAG_BUSY > + * When set, in flight join and mcast->mc is unreliable > + * When clear and mcast->mc IS_ERR_OR_NULL, need to restart or > + * haven't started yet > + * When clear and mcast->mc is valid pointer, join was successful > + */ > + IPOIB_MCAST_FLAG_BUSY = 2, > IPOIB_MCAST_FLAG_ATTACHED = 3, > - IPOIB_MCAST_JOIN_STARTED = 4, > > MAX_SEND_CQE = 16, > IPOIB_CM_COPYBREAK = 256, > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c > b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c > index 9862c76a83f..a52c9f3f7e4 100644 > --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c > +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c > @@ -271,16 +271,27 @@ ipoib_mcast_sendonly_join_complete(int status, > struct ipoib_mcast *mcast = multicast->context; > struct net_device *dev = mcast->dev; > > + /* > + * We have to take the mutex to force mcast_sendonly_join to > + * return from ib_sa_multicast_join and set mcast->mc to a > + * valid value. Otherwise we were racing with ourselves in > + * that we might fail here, but get a valid return from > + * ib_sa_multicast_join after we had cleared mcast->mc here, > + * resulting in mis-matched joins and leaves and a deadlock > + */ > + mutex_lock(&mcast_mutex); > + > /* We trap for port events ourselves. */ > if (status == -ENETRESET) > - return 0; > + goto out; > > if (!status) > status = ipoib_mcast_join_finish(mcast, &multicast->rec); > > if (status) { > if (mcast->logcount++ < 20) > - ipoib_dbg_mcast(netdev_priv(dev), "multicast join > failed for > %pI6, status %d\n", > + ipoib_dbg_mcast(netdev_priv(dev), "sendonly multicast " > + "join failed for %pI6, status %d\n", > mcast->mcmember.mgid.raw, status); > > /* Flush out any queued packets */ > @@ -290,11 +301,15 @@ ipoib_mcast_sendonly_join_complete(int status, > dev_kfree_skb_any(skb_dequeue(&mcast->pkt_queue)); > } > netif_tx_unlock_bh(dev); > - > - /* Clear the busy flag so we try again */ > - status = test_and_clear_bit(IPOIB_MCAST_FLAG_BUSY, > - &mcast->flags); > } > +out: > + clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags); > + if (status) > + mcast->mc = NULL; > + complete(&mcast->done); > + if (status == -ENETRESET) > + status = 0; > + mutex_unlock(&mcast_mutex); > return status; > } > > @@ -312,12 +327,14 @@ static int ipoib_mcast_sendonly_join(struct ipoib_mcast > *mcast) > int ret = 0; > > if (!test_bit(IPOIB_FLAG_OPER_UP, &priv->flags)) { > - ipoib_dbg_mcast(priv, "device shutting down, no multicast > joins\n"); > + ipoib_dbg_mcast(priv, "device shutting down, no sendonly " > + "multicast joins\n"); > return -ENODEV; > } > > - if (test_and_set_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags)) { > - ipoib_dbg_mcast(priv, "multicast entry busy, skipping\n"); > + if (test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags)) { > + ipoib_dbg_mcast(priv, "multicast entry busy, skipping " > + "sendonly join\n"); > return -EBUSY; > } > > @@ -325,6 +342,9 @@ static int ipoib_mcast_sendonly_join(struct ipoib_mcast > *mcast) > rec.port_gid = priv->local_gid; > rec.pkey = cpu_to_be16(priv->pkey); > > + mutex_lock(&mcast_mutex); > + init_completion(&mcast->done); > + set_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags); > mcast->mc = ib_sa_join_multicast(&ipoib_sa_client, priv->ca, > priv->port, &rec, > IB_SA_MCMEMBER_REC_MGID | > @@ -337,12 +357,14 @@ static int ipoib_mcast_sendonly_join(struct ipoib_mcast > *mcast) > if (IS_ERR(mcast->mc)) { > ret = PTR_ERR(mcast->mc); > clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags); > - ipoib_warn(priv, "ib_sa_join_multicast failed (ret = %d)\n", > - ret); > + complete(&mcast->done); > + ipoib_warn(priv, "ib_sa_join_multicast for sendonly join " > + "failed (ret = %d)\n", ret); > } else { > - ipoib_dbg_mcast(priv, "no multicast record for %pI6, starting > join\n", > - mcast->mcmember.mgid.raw); > + ipoib_dbg_mcast(priv, "no multicast record for %pI6, starting " > + "sendonly join\n", mcast->mcmember.mgid.raw); > } > + mutex_unlock(&mcast_mutex); > > return ret; > } > @@ -390,22 +412,28 @@ static int ipoib_mcast_join_complete(int status, > ipoib_dbg_mcast(priv, "join completion for %pI6 (status %d)\n", > mcast->mcmember.mgid.raw, status); > > + /* > + * We have to take the mutex to force mcast_join to > + * return from ib_sa_multicast_join and set mcast->mc to a > + * valid value. Otherwise we were racing with ourselves in > + * that we might fail here, but get a valid return from > + * ib_sa_multicast_join after we had cleared mcast->mc here, > + * resulting in mis-matched joins and leaves and a deadlock > + */ > + mutex_lock(&mcast_mutex); > + > /* We trap for port events ourselves. */ > - if (status == -ENETRESET) { > - status = 0; > + if (status == -ENETRESET) > goto out; > - } > > if (!status) > status = ipoib_mcast_join_finish(mcast, &multicast->rec); > > if (!status) { > mcast->backoff = 1; > - mutex_lock(&mcast_mutex); > if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) > queue_delayed_work(ipoib_workqueue, > &priv->mcast_task, 0); > - mutex_unlock(&mcast_mutex); > > /* > * Defer carrier on work to ipoib_workqueue to avoid a > @@ -413,37 +441,35 @@ static int ipoib_mcast_join_complete(int status, > */ > if (mcast == priv->broadcast) > queue_work(ipoib_workqueue, &priv->carrier_on_task); > - > - status = 0; > - goto out; > - } > - > - if (mcast->logcount++ < 20) { > - if (status == -ETIMEDOUT || status == -EAGAIN) { > - ipoib_dbg_mcast(priv, "multicast join failed for %pI6, > status > %d\n", > - mcast->mcmember.mgid.raw, status); > - } else { > - ipoib_warn(priv, "multicast join failed for %pI6, > status %d\n", > - mcast->mcmember.mgid.raw, status); > + } else { > + if (mcast->logcount++ < 20) { > + if (status == -ETIMEDOUT || status == -EAGAIN) { > + ipoib_dbg_mcast(priv, "multicast join failed > for %pI6, > status %d\n", > + mcast->mcmember.mgid.raw, > status); > + } else { > + ipoib_warn(priv, "multicast join failed for > %pI6, status > %d\n", > + mcast->mcmember.mgid.raw, status); > + } > } > - } > - > - mcast->backoff *= 2; > - if (mcast->backoff > IPOIB_MAX_BACKOFF_SECONDS) > - mcast->backoff = IPOIB_MAX_BACKOFF_SECONDS; > - > - /* Clear the busy flag so we try again */ > - status = test_and_clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags); > > - mutex_lock(&mcast_mutex); > + mcast->backoff *= 2; > + if (mcast->backoff > IPOIB_MAX_BACKOFF_SECONDS) > + mcast->backoff = IPOIB_MAX_BACKOFF_SECONDS; > + } > +out: > spin_lock_irq(&priv->lock); > - if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) > + clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags); > + if (status) > + mcast->mc = NULL; > + complete(&mcast->done); > + if (status == -ENETRESET) > + status = 0; > + if (status && test_bit(IPOIB_MCAST_RUN, &priv->flags)) > queue_delayed_work(ipoib_workqueue, &priv->mcast_task, > mcast->backoff * HZ); > spin_unlock_irq(&priv->lock); > mutex_unlock(&mcast_mutex); > -out: > - complete(&mcast->done); > + > return status; > } > > @@ -492,10 +518,9 @@ static void ipoib_mcast_join(struct net_device *dev, > struct > ipoib_mcast *mcast, > rec.hop_limit = priv->broadcast->mcmember.hop_limit; > } > > - set_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags); > + mutex_lock(&mcast_mutex); > init_completion(&mcast->done); > - set_bit(IPOIB_MCAST_JOIN_STARTED, &mcast->flags); > - > + set_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags); > mcast->mc = ib_sa_join_multicast(&ipoib_sa_client, priv->ca, priv->port, > &rec, comp_mask, GFP_KERNEL, > ipoib_mcast_join_complete, mcast); > @@ -509,13 +534,12 @@ static void ipoib_mcast_join(struct net_device *dev, > struct > ipoib_mcast *mcast, > if (mcast->backoff > IPOIB_MAX_BACKOFF_SECONDS) > mcast->backoff = IPOIB_MAX_BACKOFF_SECONDS; > > - mutex_lock(&mcast_mutex); > if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) > queue_delayed_work(ipoib_workqueue, > &priv->mcast_task, > mcast->backoff * HZ); > - mutex_unlock(&mcast_mutex); > } > + mutex_unlock(&mcast_mutex); > } > > void ipoib_mcast_join_task(struct work_struct *work) > @@ -568,7 +592,8 @@ void ipoib_mcast_join_task(struct work_struct *work) > } > > if (!test_bit(IPOIB_MCAST_FLAG_ATTACHED, &priv->broadcast->flags)) { > - if (!test_bit(IPOIB_MCAST_FLAG_BUSY, &priv->broadcast->flags)) > + if (IS_ERR_OR_NULL(priv->broadcast->mc) && > + !test_bit(IPOIB_MCAST_FLAG_BUSY, &priv->broadcast->flags)) > ipoib_mcast_join(dev, priv->broadcast, 0); > return; > } > @@ -576,23 +601,33 @@ void ipoib_mcast_join_task(struct work_struct *work) > while (1) { > struct ipoib_mcast *mcast = NULL; > > + /* > + * Need the mutex so our flags are consistent, need the > + * priv->lock so we don't race with list removals in either > + * mcast_dev_flush or mcast_restart_task > + */ > + mutex_lock(&mcast_mutex); > spin_lock_irq(&priv->lock); > list_for_each_entry(mcast, &priv->multicast_list, list) { > - if (!test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags) > - && !test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags) > - && !test_bit(IPOIB_MCAST_FLAG_ATTACHED, > &mcast->flags)) { > + if (IS_ERR_OR_NULL(mcast->mc) && > + !test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags) && > + !test_bit(IPOIB_MCAST_FLAG_ATTACHED, > &mcast->flags)) { > /* Found the next unjoined group */ > break; > } > } > spin_unlock_irq(&priv->lock); > + mutex_unlock(&mcast_mutex); > > if (&mcast->list == &priv->multicast_list) { > /* All done */ > break; > } > > - ipoib_mcast_join(dev, mcast, 1); > + if (test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)) > + ipoib_mcast_sendonly_join(mcast); > + else > + ipoib_mcast_join(dev, mcast, 1); > return; > } > > @@ -638,6 +673,9 @@ static int ipoib_mcast_leave(struct net_device *dev, > struct > ipoib_mcast *mcast) > int ret = 0; > > if (test_and_clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags)) > + ipoib_warn(priv, "ipoib_mcast_leave on an in-flight join\n"); > + > + if (!IS_ERR_OR_NULL(mcast->mc)) > ib_sa_free_multicast(mcast->mc); > > if (test_and_clear_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags)) { > @@ -690,6 +728,8 @@ void ipoib_mcast_send(struct net_device *dev, u8 *daddr, > struct > sk_buff *skb) > memcpy(mcast->mcmember.mgid.raw, mgid, sizeof (union ib_gid)); > __ipoib_mcast_add(dev, mcast); > list_add_tail(&mcast->list, &priv->multicast_list); > + if (!test_and_set_bit(IPOIB_MCAST_RUN, &priv->flags)) > + queue_delayed_work(ipoib_workqueue, &priv->mcast_task, > 0); > } > > if (!mcast->ah) { > @@ -703,8 +743,6 @@ void ipoib_mcast_send(struct net_device *dev, u8 *daddr, > struct > sk_buff *skb) > if (test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags)) > ipoib_dbg_mcast(priv, "no address vector, " > "but multicast join already started\n"); > - else if (test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)) > - ipoib_mcast_sendonly_join(mcast); > > /* > * If lookup completes between here and out:, don't > @@ -766,7 +804,7 @@ void ipoib_mcast_dev_flush(struct net_device *dev) > > /* seperate between the wait to the leave*/ > list_for_each_entry_safe(mcast, tmcast, &remove_list, list) > - if (test_bit(IPOIB_MCAST_JOIN_STARTED, &mcast->flags)) > + if (test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags)) > wait_for_completion(&mcast->done); > > list_for_each_entry_safe(mcast, tmcast, &remove_list, list) { > -- > 2.1.0 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to [email protected] > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html
