Cong Wang wrote:
> On Wed, Sep 2, 2020 at 7:22 PM Yunsheng Lin <linyunsh...@huawei.com> wrote:
> >
> > On 2020/9/3 9:48, Cong Wang wrote:
> > > On Wed, Sep 2, 2020 at 6:22 PM Yunsheng Lin <linyunsh...@huawei.com> 
> > > wrote:
> > >>
> > >> On 2020/9/3 8:35, Cong Wang wrote:
> > >>> On Tue, Sep 1, 2020 at 11:35 PM Yunsheng Lin <linyunsh...@huawei.com> 
> > >>> wrote:
> > >>>>
> > >>>> On 2020/9/2 12:41, Cong Wang wrote:
> > >>>>> On Tue, Sep 1, 2020 at 6:42 PM Yunsheng Lin <linyunsh...@huawei.com> 
> > >>>>> wrote:
> > >>>>>>
> > >>>>>> On 2020/9/2 2:24, Cong Wang wrote:
> > >>>>>>> On Mon, Aug 31, 2020 at 5:59 PM Yunsheng Lin 
> > >>>>>>> <linyunsh...@huawei.com> wrote:
> > >>>>>>>>
> > >>>>>>>> Currently there is concurrent reset and enqueue operation for the
> > >>>>>>>> same lockless qdisc when there is no lock to synchronize the
> > >>>>>>>> q->enqueue() in __dev_xmit_skb() with the qdisc reset operation in
> > >>>>>>>> qdisc_deactivate() called by dev_deactivate_queue(), which may 
> > >>>>>>>> cause
> > >>>>>>>> out-of-bounds access for priv->ring[] in hns3 driver if user has
> > >>>>>>>> requested a smaller queue num when __dev_xmit_skb() still enqueue a
> > >>>>>>>> skb with a larger queue_mapping after the corresponding qdisc is
> > >>>>>>>> reset, and call hns3_nic_net_xmit() with that skb later.
> > >>>>>>>
> > >>>>>>> Can you be more specific here? Which call path requests a smaller
> > >>>>>>> tx queue num? If you mean netif_set_real_num_tx_queues(), clearly
> > >>>>>>> we already have a synchronize_net() there.
> > >>>>>>
> > >>>>>> When the netdevice is in active state, the synchronize_net() seems to
> > >>>>>> do the correct work, as below:
> > >>>>>>
> > >>>>>> CPU 0:                                       CPU1:
> > >>>>>> __dev_queue_xmit()                       
> > >>>>>> netif_set_real_num_tx_queues()
> > >>>>>> rcu_read_lock_bh();
> > >>>>>> netdev_core_pick_tx(dev, skb, sb_dev);
> > >>>>>>         .
> > >>>>>>         .                               dev->real_num_tx_queues = 
> > >>>>>> txq;
> > >>>>>>         .                                       .
> > >>>>>>         .                                       .
> > >>>>>>         .                               synchronize_net();
> > >>>>>>         .                                       .
> > >>>>>> q->enqueue()                                    .
> > >>>>>>         .                                       .
> > >>>>>> rcu_read_unlock_bh()                            .
> > >>>>>>                                         qdisc_reset_all_tx_gt
> > >>>>>>
> > >>>>>>
> > >>>>>
> > >>>>> Right.
> > >>>>>
> > >>>>>
> > >>>>>> but dev->real_num_tx_queues is not RCU-protected, maybe that is a 
> > >>>>>> problem
> > >>>>>> too.
> > >>>>>>
> > >>>>>> The problem we hit is as below:
> > >>>>>> In hns3_set_channels(), hns3_reset_notify(h, HNAE3_DOWN_CLIENT) is 
> > >>>>>> called
> > >>>>>> to deactive the netdevice when user requested a smaller queue num, 
> > >>>>>> and
> > >>>>>> txq->qdisc is already changed to noop_qdisc when calling
> > >>>>>> netif_set_real_num_tx_queues(), so the synchronize_net() in the 
> > >>>>>> function
> > >>>>>> netif_set_real_num_tx_queues() does not help here.
> > >>>>>
> > >>>>> How could qdisc still be running after deactivating the device?
> > >>>>
> > >>>> qdisc could be running during the device deactivating process.
> > >>>>
> > >>>> The main process of changing channel number is as below:
> > >>>>
> > >>>> 1. dev_deactivate()
> > >>>> 2. hns3 handware related setup
> > >>>> 3. netif_set_real_num_tx_queues()
> > >>>> 4. netif_tx_wake_all_queues()
> > >>>> 5. dev_activate()
> > >>>>
> > >>>> During step 1, qdisc could be running while qdisc is resetting, so
> > >>>> there could be skb left in the old qdisc(which will be restored back to
> > >>>> txq->qdisc during dev_activate()), as below:
> > >>>>
> > >>>> CPU 0:                                       CPU1:
> > >>>> __dev_queue_xmit():                      dev_deactivate_many():
> > >>>> rcu_read_lock_bh();                      qdisc_deactivate(qdisc);
> > >>>> q = rcu_dereference_bh(txq->qdisc);             .
> > >>>> netdev_core_pick_tx(dev, skb, sb_dev);          .
> > >>>>         .
> > >>>>         .                               
> > >>>> rcu_assign_pointer(dev_queue->qdisc, qdisc_default);
> > >>>>         .                                       .
> > >>>>         .                                       .
> > >>>>         .                                       .
> > >>>>         .                                       .
> > >>>> q->enqueue()                                    .
> > >>>
> > >>>
> > >>> Well, like I said, if the deactivated bit were tested before 
> > >>> ->enqueue(),
> > >>> there would be no packet queued after qdisc_deactivate().

Trying to unwind this through git history :/

Original code had a test_bit in dev_xmit_skb(),

        if (q->flags & TCQ_F_NOLOCK) {
                if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
                        __qdisc_drop(skb, &to_free);
                        rc = NET_XMIT_DROP;
                } else {
                        rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
                        __qdisc_run(q);
                }

                if (unlikely(to_free))
                        kfree_skb_list(to_free);
                return rc;
        }

So we would never enqueue something on top of a deactivated qdisc. And to ensure
we don't have any in-flight enqueues we have to swap qdiscs, wait a grace
period, and then reset the qdisc. That _should_ be OK.

But, I'm still not entirely sure how you got here. In the drivers I did I always
stop the queue before messing with these things with netif_tx_stop_queue(). Then
we really should not get these packets into the driver.

In sch_direct_xmit():

        if (likely(skb)) {
                HARD_TX_LOCK(dev, txq, smp_processor_id());
                if (!netif_xmit_frozen_or_stopped(txq))
                        skb = dev_hard_start_xmit(skb, dev, txq, &ret);

                HARD_TX_UNLOCK(dev, txq);
        } else {
                if (root_lock)
                        spin_lock(root_lock);
                return true;
        }

Maybe I missed something? Does your driver use the netif_tx_stop/start APIs?
 
> > >>
> > >> Only if the deactivated bit testing is also protected by qdisc->seqlock?
> > >> otherwise there is still window between setting and testing the 
> > >> deactivated bit.
> > >
> > > Can you be more specific here? Why testing or setting a bit is not atomic?
> >
> > testing a bit or setting a bit separately is atomic.
> > But testing a bit and setting a bit is not atomic, right?
> >
> >   cpu0:                   cpu1:
> >                         testing A bit
> > setting A bit                .
> >        .                     .
> >        .               qdisc enqueuing
> > qdisc reset
> >
> 
> Well, this was not a problem until commit d518d2ed8640c1cbbb.
> Prior to that commit, qdsic can still be scheduled even with this
> race condition, that is, the packet just enqueued after resetting can
> still be dequeued with qdisc_run().
> 
> It is amazing to see how messy the lockless qdisc is now.
> 
> Thanks.


Reply via email to