On 5/28/13 10:08 PM, Konstantin Belousov wrote:
On Tue, May 28, 2013 at 01:35:01PM -0700, Alfred Perlstein wrote:
[[  moved to hackers@ from private mail. ]]

On 5/28/13 1:13 PM, John Baldwin wrote:
On Tuesday, May 28, 2013 3:29:41 pm Alfred Perlstein wrote:
On 5/28/13 9:04 AM, John Baldwin wrote:
On Tuesday, May 28, 2013 2:13:32 am Alfred Perlstein wrote:
Hey folks,

I had a talk with Nathan Whitehorn about the camisr issue.  The issue we
are seeing we mostly know, but to summarize, we are losing the camisr
signal and the camisr is not being run.

I gave him a summary of what we have been seeing and pointed him to the
code I am concerned about here:
http://pastebin.com/tLKr7mCV  (this is inside of kern_intr.c).

What I think that is happening is that the setting of it_need to 0
inside of sys/kern/kern_intr.c:ithread_loop() is not being scheduled
correctly and it is being delayed until AFTER the call to
ithread_execute_handlers() right below the atomic_store_rel_int().
This seems highly unlikely, to the extent that if this were true all our
locking primitives would be broken.  The store_rel is actually a release
barrier which acts like more a read/write fence.  No memory accesses (read or
write) from before the release can be scheduled after the associated store,
either by the compiler or CPU.  That is what Konstantin is referring to in his
commit when he says "release" semantics".
Yes, that makes sense, however does it specify that the writes *must*
occur at that *point*?  If it only enforces ordering then we may have
some issue, specifically because the setting of it to '1' inside of
intr_event_schedule_thread has no barrier other than the acq semantics
of the thread lock.  I am wondering what is forcing out the '1' there.
Nothing ever forces writes.  You would have to invalidate the cache to do that
and that is horribly expensive.  It is always only about ordering and knowing
that if you can complete another operation on the same "cookie" variable with
acquire semantics that earlier writes will be visible.
By cookie, you mean a specific memory address, basically a lock? This is
starting to reinforce my suspicions as the setting of it_need is done
with release semantics, however the acq on the other CPU is done on the
thread lock.  Maybe that is irrelevant.  We will find out shortly.

See below as I think we have proof that this is somehow happening.
Having ih_need of 1 and it_need of 0 is certainly busted.  The simplest fix
is probably to stop using atomics on it_need and just grab the thread lock
in the main ithread loop and hold it while checking and clearing it_need.

OK, we have some code that will either prove this, or perturb the memory
ordering enough to make the bug go away, or prove this assertion wrong.

We will update on our findings hopefully in the next few days.
IMO the read of it_need in the 'while (ithd->it_need)' should
have acquire semantic, otherwise the future reads in the
ithread_execute_handlers(), in particular, of the ih_need, could pass
the read of it_need and cause the situation you reported.  I do not
see any acquire barrier between a condition in the while() statement
and the read of ih_need in the execute_handlers().

It is probably true that the issue you see was caused by r236456, in the
sense that implicitely locked xchgl instruction on x86 has a full barrier
semantic.  As result, the store_rel() was actually an acquire too, making
this reordering impossible.  I argue that this is not a bug in r236456,
but the issue in the kern_intr.c.
If I remember the code correctly that would probably explain why we see it only on 9.1 system.

On the other hand, the John' suggestion to move the manipulations of
it_need under the lock is probably the best anyway.

I was wondering if it would be lower latency to maintain it_need, however to keep another variable it_needlocked under the thread lock. This would result in potential superfluous interrupts, however under load you would allow the ithread to loop without taking the thread lock some number of times.

I am not really sure if this is really worth the optimization (especially since it can result in superfluous interrupts) however it may reduce latency and that might be important.

Is there some people that I can pass the patch onto for help with performance once we confirm that this is the actual bug? We can do internal testing, but I am worried about regressing performance of any form of IO for the kernel.

I'll show the patch soon.

Thank you for the information.  This is promising.

-Alfred



_______________________________________________
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Reply via email to