On Mon, 2014-04-07 at 13:02 +0300, Faidon Liambotis wrote: > Hi, > > This regression is fairly complicated and it's high impact, as mptsas is > being used to drive fairly popular controllers, including the > entry-level ones in several generations of Dell PowerEdge servers. > > We've been debugging this for a while now over at Ubuntu's Launchpad[1] > and the issue has been subsequently been raised on both the > linux-scsi[2] & systemd mailing lists[3]. > > In essence, there are four different behaviors/bugs here: > > 1) The kthread_create() semantics have changed in 3.13 with 786235ee by > making kthreads killable. Not a bug on its own, but it's a "breaks > previously working userspace configuration" kind of bug. Ubuntu has > reverted this patch for trusty as a workaround.
No, kthread_create() itself is killable. > 2) mptsas, to probe the SAS bus, spawns a kthread that takes more than > 30s to complete. The consensus on the list AIUI is that it's a bug and > it should not take that long. kthread_create() spawns a thread, it doesn't wait for the thread to complete (what would be the point of creating a thread, then?). So this hang in kthread_create() needs to be understood. > 3) systemd-udev by default sends SIGKILL to kthreads that have been > running for more than 30s. systemd developers do not consider this a bug > but an intended behavior and refuse to fix this issue. Adding > "OPTIONS+="event_timeout=120" to the udev config would probably > workaround this. systemd-udev will kill its own child process, which is stuck in kthread_create(). > 4) Unrelated to the bug at hand, mptsas is buggy in the error handling > codepath, when the kthread spawning fails. It tries to clean up by > dereferencing a NULL pointer and hence the kernel oopses, while > otherwise it'd just continue running, just without any mptsas devices > present. I've made an analysis of the buggy codepath on comment #27 on > the LP bug above. This has always been a bug, it's just that that > codepath was untested until now. Right. > The end result is that this regression is somewhere in the limbo land > between kernel/systemd for the two features (1)/(2) that are valid on > their own but reveal a regression in combination with (3) and each other. > > Issue (2) seems like a real bug and the root cause here, but one that > probably can't be easily fixed in a point release -- I don't think it > hasn't even been fixed in master yet. No sign of it, but SCSI bug fixes never seem to get applied quickly. Ben. > Issue (4) is easily fixable but it's orthogonal and not going to solve > the real problem here. It will just downgrade this from an oops to > "just" a system with no disk drives but an otherwise working kernel. -- Ben Hutchings Sturgeon's Law: Ninety percent of everything is crap.
signature.asc
Description: This is a digitally signed message part