Michael Kelly, le lun. 01 sept. 2025 23:07:44 +0100, a ecrit: > On 31/08/2025 22:47, Samuel Thibault wrote: > > Michael Kelly, le sam. 30 août 2025 21:29:46 +0100, a ecrit: > > > This sequence of hurd_thread_cancel() calls all occur whilst a single > > > process wide mutex is held locked (see libports:interrupt_rpcs.c). > > You mean the _ports_lock mutex? > Yes.
Note that this mutex protects the current_rpcs list. Going through the list without the mutex would be unsafe. Another way would be to record the thread ports in a local array, and call hurd_thread_cancel() in a loop after releasing the mutex. But then the threads might still die in-between. We could add a reference to keep the port allocated, but hurd_thread_cancel would then allocate again a signal state for dead threads... > > > The same lock is also required to begin or end other RPCs on other > > > ports and so they must wait until the initial interrupt_operation > > > completes. > > ? I don't think ports_interrupt_rpcs actually waits for something to > > finish? hurd_thread_cancel() should be asynchronous, and > > _ports_record_interruption clearly is. > > I hadn't any evidence to present so today I reran the stress test without my > code changes to ports_interrupt_rpcs(). I have attached a reduced version of > the very long set of stack traces from the ext2fs server. I have the > complete list of all threads saved but it's a bit long for this message. In > summary the traces show: > > 1) One thread (thread: 35) handling an interrupt_operation request. This > shows it making a secondary interrupt_operation RPC to a storeio task. The > port in use has a msgcount of 5 preventing immediate delivery of this > message. Ah! interrupt_operation calls pile up... So indeed ports_interrupt_rpcs takes some time. But this is actually useless, one is enough for a given thread. I wonder if in glibc's hurd_thread_cancel, we could just add an if (!ss->cancel) condition on the lines from ss->cancel = 1; to calling the cancel hook. That way, if we try to cancel the same thread several times, we'll just suspend/resume it several times, and not call interrupt_operation on the server several times. Samuel

