Hi, On 6/20/16 12:30 PM, Gleb Smirnoff wrote: > On Mon, Jun 20, 2016 at 12:14:18PM +0200, Hans Petter Selasky wrote: > H> On 06/20/16 11:58, Gleb Smirnoff wrote: > H> > The fix I am working on now is doing exactly that. callout_reset must > H> > return 0 if the callout is currently running. > H> > > H> > What are the old paths impacted? > H> > H> I'll dig into the matter aswell and give some comments. Thanks for the > H> analysis, Gleb. > H> > H> FYI: This class of problems wouldn't exist if the callout could be > H> associated with a mutex! > > Exactly! I am convinced that all callouts should be locked, and non-locked > one should simply go away, as well as async drain. > > What does prevent us from converting TCP timeouts to locked? To my > understanding it is the lock order of taking pcbinfo after pcb lock. > > I'm now trying to understand Julien's conversion from pcbinfo lock > to pcbinfo + pcblist locks. It seems to me that we actually can convert > TCP timers to locked callouts. > > What for do we need pcbinfo read lock in a tcp timer? The timer works > only on the pcb and doesn't modify global lists, does it?
tcp_timer_keep() for example can modify global pcb list, see the call stack below: tcp_timer_keep() tcp_drop() tcp_close() sofree() tcp_usr_detach() (via pr->pr_usrreqs->pru_detach() in sofree()) tcp_detach() in_pcbfree() in_pcbremlists() Anyway, a bit of history here: o In stable/10 the TCP locking order is: ipi_lock (before) inpcb lock and in stable/10 ipi_lock is protecting the global pcb list. Then, only in the cases where you were absolutely sure you are _not_ going to modify the global pcb list you are allowed to take the inpcb lock only. For all the other cases, you have to take the write lock on ipi_lock first due to lock order. And in context of short-lived TCP connection of the 5 received packets for a TCP connection, 4 require the write ipi_lock lock, and that's does not scale well. Lesson learned: For scaling perspective, in lock ordering always put the most global lock last. It was improved in a lean way in 11: o In 11 the TCP lock order became: ipi_lock (before) inpcb lock (before) ipi_list And in 11 ipi_list protects global pcb list, and only the code actually modifying the global list is protected by a global write lock, e.g.: https://github.com/freebsd/freebsd/blob/master/sys/netinet/in_pcb.c#L1285 Then why keeping the ipi_lock lock at all now? It is solely for one case: When you need the complete stability of the full pcb list while traversing it. These full pcb list traversals are done in functions like: in_pcbnotifyall(), in_pcbpurgeif0(), inp_apply_all(), tcp_ccalgounload(), tcp_drain(), etc. Thus in 11 ipi_lock write lock is required only when you do this full traversal with list stability requirement, and the ipi_lock read lock in all other cases like tcp_input() that then scales better. Sadly in 11, you cannot use the inpcb lock as is for the TCP timers callout, because like tcp_timer_keep() most TCP timers callout can modify the global pcb list and then you need the read lock ipi_lock in top of inpcb lock... o Future/12: The next natural step is to remove the ipi_lock from the picture to get: inpcb lock (before) ipi_list It /just/ requires a global pcb list that allows addition/deletion while doing a full traversal concurrently. A classical way to achieve that, is to use a RCU-based list. As soon as RCU (or anything equivalent like lwref) are supported in kernel, we will implement this change. Just history here, as I already did a presentation on this exact topic at BSDCan 2015: https://wiki.freebsd.org/201506DevSummit#line-83 It was recorded but I never saw the footage/presentation actually published. :) -- Julien
Description: OpenPGP digital signature