Michael Kelly, le ven. 12 sept. 2025 19:27:00 +0100, a ecrit: > On 08/09/2025 21:58, Samuel Thibault wrote: > > Michael Kelly, le lun. 08 sept. 2025 07:05:39 +0100, a ecrit: > > > The changes I suggest do not access the list in this way after the mutex > > > has > > > been released. The next iteration restarts the scan from the (possibly > > > new) > > > head of the list. > > Ah, it restarts over on each cancellation. That's really bad asymptotic > > quadratic complexity. That will hit us sooner or later. > > Well, yes, I must agree with you. I had assumed a very small list length of > the order of 10 or so but clearly the length is potentially limited only by > memory resources so the total number of iterations grows rapidly as the > length increases. Fair point. I did actually address this issue by managing > additional list pointers within rpc_info to maintain a separate > 'cancellation list'. I then however concluded that the whole approach was > flawed. For example, 2 separate threads initiate an interrupt_operation on > the same remote port. On the remote side, one thread would capture the RPCs > to cancel whilst a 2nd thread could soon commence (when _ports_lock is > released), find no RPCs to cancel and complete before the first group were > actually cancelled. This doesn't seem right and in any case is a change in > behaviour. > > I think what is needed is a feature to guarantee the current sequential > behaviour per port_info but which also permits genuine concurrency across > different port_info. That isn't really true at the moment because of the > usage of the global _ports_lock. This could possibly be a port_info specific > mutex or perhaps an extension of the 'blocking' flags. I'm investigating > those options currently but it seems difficult to ensure locking order to > prevent deadlock.
I don't think we need a strict sequential guarantee. Why not recording the port_info* pointers in a local array? This requires extending their life in case of a concurrent termination of the RPC, but that can be done by adding a reference counter and some port_info_ref/_unref helpers that maintain it, and ports_end_rpc can wait for the counter to lower to zero. > > > I don't understand the suggestion about not re-cancelling a thread already > > > in cancellation due to a signal. > > I'm not saying only about signals, but also about interruption: > > our issue is that ports_interrupt_rpcs calls hurd_thread_cancel > > which cancels the thread, and for that calls _hurdsig_abort_rpcs > > which might get stuck inside the __interrupt_operation() call. If > > in hurd_thread_cancel we check ss->cancel and avoid calling > > _hurdsig_abort_rpcs again, we won't call __interrupt_operation() again > > and get stuck there. > > > > > That occurs within the originating client but isn't the storm of > > > interruptions being generated on the server side? > > On the server side there can be a cascade of interruptions too, yes, but > > at least it wouldn't pile cancellations. > > I tested your idea of not re-cancelling a thread by wrapping the code from > where the 'cancel' state is set to 1 to after the call to the cancellation > hook with a guard of if (ss->cancel != 1). To capture a record of this > 'needless cancellation' I dumped some debug output using mach_print (ext2fs > doesn't seem to report on stderr). It took 3.5 hours of my test case before > the system locked with the scenario involving a call to > interrupt_operation() on ext2fs which then calls interrupt_operation on > another task (as reported in earlier messages). This was using released hurd > source code without my alterations. > > During the test run, there were 1825 needless cancellations (2 (term), 362 > (ext2fs), 1439 (proc), 22 (storeio)). Ok, that's not that much actually. > I think that this is an above average 'time to failure' but I'd have to make > a statistically relevant number of runs for this to mean much. It does > however seem to show that there is still a need to address the issue of > maintaining _ports_lock whilst calling hurd_thread_cancel(). If the traces still show that the threads are stuck in interrupt_operation, indeed. Samuel

