On Thursday 12 July 2007 22:24:36 Joshua Wise wrote: > Some time after my first[1] patch to mce.c, I had continued testing using > the test method described there (inject hundreds of thousands of single-bit > errors while reading from /dev/mcelog), and I came across a strange issue. > Once every couple test cycles (i.e., every hundred-thousand injects), the > system would begin behaving very strangely. The serial port would stop > responding entirely, and my SSH sessions would respond, but to one keystroke > behind what I had typed. After two minutes, the watchdog timer would reboot > the system.
Thanks for the analysis. I guess at some point it would make sense to model the whole thing in spin or similar and find even the last race. Anyone interested? @) > > At this point, mce_read() and mce_log() are blocking on each other, and will > be for all eternity. mce_log() is waiting for the on_each_cpu() to complete > so that the loop can complete, and on_each_cpu() (from mce_read()) is > waiting for mce_log() to finish. I suspect the right fix is to get rid of collect_tscs(). One possible way would be to get the TSCs from the mce_log table before the synchronization instead of asking the CPUs. Or switch it over to the NMI supporting RCU that was recently posted. > -- there may be other edge cases other than > this one. I'm actually surprised that this wasn't a ring buffer to start > with -- it certainly seems like it wanted to be one. The problem with a ring buffer is that it would lose old entries; but for machine checks you really want the first entries because the later ones might be just junk. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/

