Hi Stephen,
Many thanks for the review. I will send out a v3 to address most of them.
In the meantime, I would like to discuss further about two points from the
review comments. Please see my input below.
> -----Original Message-----
> 3. Spinlock held across blocking operations.
>
> priv->reset_ops_lock is rte_spinlock_t and is held across:
>
> - mana_reset_enter: mana_dev_stop, mana_dev_close,
> ibv_close_device,
> mana_mp_req_on_rxtx (5s IPC timeout)
> - mana_reset_exit_delay: ibv_close_device, mana_pci_probe,
> mana_mp_req_on_rxtx, mana_dev_start
>
> IPC with a 5-second timeout and device probe under a spinlock is
> not acceptable. Use a sleeping mutex (pthread_mutex_t initialized
> with PTHREAD_PROCESS_SHARED, since priv is in shared memory), or
> split the lock so the long operations run outside it.
>
The concern about holding rte_spinlock_t across blocking operations
is well noted. However, in the current design this works correctly, for
the following reasons:
1. mana_reset_enter runs on the EAL interrupt thread in response
to IBV_EVENT_DEVICE_FATAL. The device is already in a fatal
error state - the blocking IB verbs calls (ibv_close_device,
ibv_dealloc_pd) and IPC (mana_mp_req_on_rxtx) are unavoidable
teardown steps that must complete before recovery can begin.
2. The ethdev operations (configure, queue_setup, rss_hash_update,
etc.) all use rte_spinlock_trylock and return -EBUSY immediately
if the lock is held. They never spin-wait on the lock during
reset, so there is no starvation or latency impact on the
application threads.
3. dev_stop_lock and dev_close_lock use blocking rte_spinlock_lock,
but they first join the reset thread (which releases the lock
when done), so by the time they attempt to acquire the lock,
the reset path has already released it. IMHO, the actual spin-wait
is negligible.
4. The mana_reset_exit_delay path (re-probe, re-start) also holds
the lock, but this runs on a dedicated control thread where
blocking is acceptable. No data path or application thread is
affected.
The net effect is that the spinlock is held for an extended period
only on the interrupt thread and the reset control thread - neither
of which are latency sensitive. Application threads never block on
the spinlock.
That said, switching to pthread_mutex_t with PTHREAD_PROCESS_SHARED
is a valid improvement that would make the locking more conventional
and avoid any theoretical busy-wait concerns. I can make this change if
it is still desirable. I would like to know your thoughts on this.
> Warnings
> --------
>
> 5. Secondary data path has no synchronization with the doorbell unmap.
>
> MANA_MP_REQ_RESET_ENTER causes the secondary MP handler to
> munmap
> proc_priv->db_page and set it to NULL. The secondary's rx_burst /
> tx_burst protect themselves only with an atomic state check:
>
> rte_rcu_qsbr_thread_online(dstate_qsv, tid);
> if (state != MANA_DEV_ACTIVE || !db_page) { ... return 0; }
>
> But the qsbr only has primary threads registered (registration
> happens in mana_dev_configure, which never runs in secondary), so
> thread_online/offline in secondary do not block the primary's
> qsbr_check. The MP handler in secondary therefore unmaps db_page
> while a peer secondary lcore can still be inside rx/tx_burst with
> a stale pointer. Result: SIGSEGV on the next doorbell write.
>
> The secondary needs its own quiescence mechanism (a per-process
> qsbr or a reader-side rwlock around the data path that the MP
> handler acquires before unmap).
>
The concern is based on the assumption that "the qsbr only has
primary threads registered (registration happens in
mana_dev_configure, which never runs in secondary)". This seems
incorrect.
My understanding is when the primary starts, the total number of queues
for primary and all secondaries is already known, which is stored in
priv->num_queues by the time mana_dev_congiure() is called.
The RCU thread IDs are per-queue, not per-process. In mana_dev_configure,
I am assigning tids used in rte_rcu_qsbr_thread_online/offline as:
- RX path: tid = rxq->rxq_idx (0 to num_queues-1)
- TX path: tid = num_queues + txq_idx (num_queues to 2*num_queues-1)
These queue structures (rxq, txq) live in shared hugepage memory
via dev->data->rx_queues[] and dev->data->tx_queues[]. The QSV
itself (priv->dev_state_qsv) is also in shared hugepage memory
(allocated with rte_zmalloc_socket). So both primary and secondary
processes operate on the same QSV with the same thread IDs.
The rte_rcu_qsbr_thread_register call in mana_dev_configure
registers IDs 0..2*num_queues-1, which covers every queue
regardless of which process polls it. I think in DPDK
a given queue is polled by only one lcore at a time, so whether
that lcore is in the primary or a secondary process, it uses the
same tid for the same queue.
The reset synchronization flow is:
1. mana_reset_enter sets dev_state = MANA_DEV_RESET_ENTER
2. Calls rte_rcu_qsbr_check to wait for all registered threads
3. Secondary lcores polling rx/tx_burst see dev_state != ACTIVE,
call rte_rcu_qsbr_thread_offline, and return 0
4. Only after ALL threads (primary and secondary) are quiescent
does rte_rcu_qsbr_check return success
5. Then mana_reset_enter proceeds to stop queues and send
MANA_MP_REQ_RESET_ENTER to unmap secondary doorbells
At step 5, no secondary lcore can be inside the data path with
a stale doorbell pointer - the RCU quiescence check at step 2-4
guarantees this. The secondary MP handler unmaps db_page only
after all secondary data path threads have exited.
So, in my opinion, this works correctly for secondaries as well.
Thanks,
Wei