On 6/20/16 12:30 PM, Gleb Smirnoff wrote:
> On Mon, Jun 20, 2016 at 12:14:18PM +0200, Hans Petter Selasky wrote:
> H> On 06/20/16 11:58, Gleb Smirnoff wrote:
> H> > The fix I am working on now is doing exactly that. callout_reset must
> H> > return 0 if the callout is currently running.
> H> >
> H> > What are the old paths impacted?
> H> 
> H> I'll dig into the matter aswell and give some comments. Thanks for the 
> H> analysis, Gleb.
> H> 
> H> FYI: This class of problems wouldn't exist if the callout could be 
> H> associated with a mutex!
> Exactly! I am convinced that all callouts should be locked, and non-locked
> one should simply go away, as well as async drain.
> What does prevent us from converting TCP timeouts to locked? To my
> understanding it is the lock order of taking pcbinfo after pcb lock.
> I'm now trying to understand Julien's conversion from pcbinfo lock
> to pcbinfo + pcblist locks. It seems to me that we actually can convert
> TCP timers to locked callouts.
> What for do we need pcbinfo read lock in a tcp timer? The timer works
> only on the pcb and doesn't modify global lists, does it?

 tcp_timer_keep() for example can modify global pcb list, see the call
stack below:

tcp_usr_detach() (via pr->pr_usrreqs->pru_detach() in sofree())

 Anyway, a bit of history here:

 o In stable/10 the TCP locking order is:

ipi_lock (before) inpcb lock

 and in stable/10 ipi_lock is protecting the global pcb list.  Then,
only in the cases where you were absolutely sure you are _not_ going to
modify the global pcb list you are allowed to take the inpcb lock only.
For all the other cases, you have to take the write lock on ipi_lock
first due to lock order.

 And in context of short-lived TCP connection of the 5 received packets
for a TCP connection, 4 require the write ipi_lock lock, and that's does
not scale well.

 Lesson learned:  For scaling perspective, in lock ordering always put
the most global lock last.

 It was improved in a lean way in 11:

 o In 11 the TCP lock order became:

ipi_lock (before) inpcb lock (before) ipi_list

 And in 11 ipi_list protects global pcb list, and only the code actually
modifying the global list is protected by a global write lock, e.g.:


 Then why keeping the ipi_lock lock at all now?

 It is solely for one case:  When you need the complete stability of the
full pcb list while traversing it.  These full pcb list traversals are
done in functions like:  in_pcbnotifyall(), in_pcbpurgeif0(),
inp_apply_all(), tcp_ccalgounload(), tcp_drain(), etc.

 Thus in 11 ipi_lock write lock is required only when you do this full
traversal with list stability requirement, and the ipi_lock read lock in
all other cases like tcp_input() that then scales better.

 Sadly in 11, you cannot use the inpcb lock as is for the TCP timers
callout, because like tcp_timer_keep() most TCP timers callout can
modify the global pcb list and then you need the read lock ipi_lock in
top of inpcb lock...

 o Future/12:

 The next natural step is to remove the ipi_lock from the picture to get:

inpcb lock (before) ipi_list

 It /just/ requires a global pcb list that allows addition/deletion
while doing a full traversal concurrently.  A classical way to achieve
that, is to use a RCU-based list.  As soon as RCU (or anything
equivalent like lwref) are supported in kernel, we will implement this

 Just history here, as I already did a presentation on this exact topic
at BSDCan 2015:

 It was recorded but I never saw the footage/presentation actually
published. :)


Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to