Re: [PATCH 3/4] mce/copyin: fix to not SIGBUS when copying from user hits poison
On Thu, 8 Apr 2021 10:08:52 -0700, Tony Luck wrote: > KVM apparently passes a machine check into the guest. Though it seems > to be misisng the MCG_STATUS information to tell the guest whether this > is an "Action Required" machine check, or an "Action Optional" (i.e. > whether the poison was found synchonously by execution of the current > instruction, or asynchronously). The KVM_X86_SET_MCE ioctl takes a parameter of struct kvm_x86_mce, hypervisor can set with necessary semantics. 1140 #ifdef KVM_CAP_MCE 1141 /* x86 MCE */ 1142 struct kvm_x86_mce { 1143 __u64 status; 1144 __u64 addr; 1145 __u64 misc; 1146 __u64 mcg_status; 1147 __u8 bank; 1148 __u8 pad1[7]; 1149 __u64 pad2[3]; 1150 }; 1151 #endif > > Are we documenting somewhere: "if your process gets a SIGBUS and this > > and that, which means your page got offlined, you should do this and > > that to recover"? > Essentially it boils down to: > SIGBUS handler gets additional data giving virtual address that has gone away > 1) Can the application replace the lost page? > Use mmap(addr, MAP_FIXED, ...) to map a fresh page into the gap > and fill with replacement data. This case can return from SIGBUS > handler to re-execute failed instruction > 2) Can the application continue in degraded mode w/o the lost page? > Hunt down pointers to lost page and update structures to say > "this data lost". Use siglongjmp() to go to preset recovery path > 3) Can the application shut down gracefully? > Record details of the lost page. Inform next-of-kin. Exit. > 4) Default - just exit Two possible addition to these great points: 5) If for some reason the page cannot be unmapped (e.g., either losing to much memory like hugetlbfs 1G pages, or THP split failure for SHMEM THP), kernel maintains a consistent semantic (i.e., MCE SIGBUS with vaddr) to all future accesses from user space, by leaving the hwpoisoned page mapped or in the radix tree. 6). If for some reason the vaddr is not available upon the first MCE recovery and page is unmapped, kernel provides correct semantic (MCE SIGBUS with vaddr) in subsequent page faults from user space accessing the same vaddr.
Re: [PATCH 3/4] mce/copyin: fix to not SIGBUS when copying from user hits poison
On Wed, Apr 14, 2021 at 07:46:49AM -0700, Jue Wang wrote: > I can see this is useful in other types of domains, e.g., on multi-tenant > cloud > servers where many VMs are collocated on the same host, > with proper recovery + live migration, a single MCE would only affect a single > VM at most. > > Another type of generic use case may be services that can tolerate > abrupt crash, > i.e., they periodically save checkpoints to persistent storage or are > stateless > services in nature and are managed by some process manager to automatically > restart and resume from where the work was left at when crashed. Yap, thanks for those. So I do see a disconnect between us doing those features in the kernel and not really seeing how people use them. So this helps, I guess the VM angle will become important real soon - if not already - so hopefully we'll get more feedback in the future. Thx. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette
Re: [PATCH 3/4] mce/copyin: fix to not SIGBUS when copying from user hits poison
On Wed, Apr 14, 2021 at 6:10 AM Borislav Petkov wrote: > > On Tue, Apr 13, 2021 at 10:47:21PM -0700, Jue Wang wrote: > > This path is when EPT #PF finds accesses to a hwpoisoned page and > > sends SIGBUS to user space (KVM exits into user space) with the same > > semantic as if regular #PF found access to a hwpoisoned page. > > > > The KVM_X86_SET_MCE ioctl actually injects a machine check into the guest. > > > > We are in process to launch a product with MCE recovery capability in > > a KVM based virtualization product and plan to expand the scope of the > > application of it in the near future. > > Any pointers to code or is this all non-public? Any text on what that > product does with the MCEs? These are non-public at this point. User-facing docs and blog post are expected to be released towards the launch (i.e., in 3-4 months from now). > > > The in-memory database and analytical domain are definitely using it. > > A couple examples: > > SAP HANA - as we've tested and planned to launch as a strategic > > enterprise use case with MCE recovery capability in our product > > SQL server - > > https://support.microsoft.com/en-us/help/2967651/inf-sql-server-may-display-memory-corruption-and-recovery-errors > > Aha, so they register callbacks for the processes to exec on a memory > error. Good to know, thanks for those. My other 2 cents: I can see this is useful in other types of domains, e.g., on multi-tenant cloud servers where many VMs are collocated on the same host, with proper recovery + live migration, a single MCE would only affect a single VM at most. Another type of generic use case may be services that can tolerate abrupt crash, i.e., they periodically save checkpoints to persistent storage or are stateless services in nature and are managed by some process manager to automatically restart and resume from where the work was left at when crashed. Thanks, -Jue > > Thx. > > -- > Regards/Gruss, > Boris. > > https://people.kernel.org/tglx/notes-about-netiquette
Re: [PATCH 3/4] mce/copyin: fix to not SIGBUS when copying from user hits poison
On Tue, Apr 13, 2021 at 10:47:21PM -0700, Jue Wang wrote: > This path is when EPT #PF finds accesses to a hwpoisoned page and > sends SIGBUS to user space (KVM exits into user space) with the same > semantic as if regular #PF found access to a hwpoisoned page. > > The KVM_X86_SET_MCE ioctl actually injects a machine check into the guest. > > We are in process to launch a product with MCE recovery capability in > a KVM based virtualization product and plan to expand the scope of the > application of it in the near future. Any pointers to code or is this all non-public? Any text on what that product does with the MCEs? > The in-memory database and analytical domain are definitely using it. > A couple examples: > SAP HANA - as we've tested and planned to launch as a strategic > enterprise use case with MCE recovery capability in our product > SQL server - > https://support.microsoft.com/en-us/help/2967651/inf-sql-server-may-display-memory-corruption-and-recovery-errors Aha, so they register callbacks for the processes to exec on a memory error. Good to know, thanks for those. Thx. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette
Re: [PATCH 3/4] mce/copyin: fix to not SIGBUS when copying from user hits poison
On Tue, Apr 13, 2021 at 04:13:03PM +, Luck, Tony wrote: > Even if no applications ever do anything with it, it is still useful to avoid > crashing the whole system and just terminate one application/guest. True. > There's one more item on my long term TODO list. Add fixups so that > copy_to_user() from poison in the page cache doesn't crash, but just > checks to see if the page was clean .. .in which case re-read from the > filesystem into a different physical page and retire the old page ... the > read can now succeed. If the page is dirty, then fail the read (and retire > the page ... need to make sure filesystem knows the data for the page > was lost so subsequent reads return -EIO or something). Makes sense. > Page cache occupies enough memory that it is a big enough > source of system crashes that could be avoided. I'm not sure > if there are any other obvious cases after this ... it all gets into > diminishing returns ... not really worth it to handle a case that > only occupies 0.2% of memory. Ack. > See above. With core counts continuing to increase, the cloud service > providers really want to see fewer events that crash the whole physical > machine (taking down dozens, or hundreds, of guest VMs). Yap. Thx. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette
Re: [PATCH 3/4] mce/copyin: fix to not SIGBUS when copying from user hits poison
On Tue, 13 Apr 2021 12:07:22 +0200, Petkov, Borislav wrote: >> KVM apparently passes a machine check into the guest. > Ah, there it is: > static void kvm_send_hwpoison_signal(unsigned long address, struct > task_struct *tsk) > { > send_sig_mceerr(BUS_MCEERR_AR, (void __user *)address, PAGE_SHIFT, > tsk); > } This path is when EPT #PF finds accesses to a hwpoisoned page and sends SIGBUS to user space (KVM exits into user space) with the same semantic as if regular #PF found access to a hwpoisoned page. The KVM_X86_SET_MCE ioctl actually injects a machine check into the guest. We are in process to launch a product with MCE recovery capability in a KVM based virtualization product and plan to expand the scope of the application of it in the near future. > So what I'm missing with all this fun is, yeah, sure, we have this > facility out there but who's using it? Is anyone even using it at all? The in-memory database and analytical domain are definitely using it. A couple examples: SAP HANA - as we've tested and planned to launch as a strategic enterprise use case with MCE recovery capability in our product SQL server - https://support.microsoft.com/en-us/help/2967651/inf-sql-server-may-display-memory-corruption-and-recovery-errors Cheers, -Jue
RE: [PATCH 3/4] mce/copyin: fix to not SIGBUS when copying from user hits poison
> So what I'm missing with all this fun is, yeah, sure, we have this > facility out there but who's using it? Is anyone even using it at all? Even if no applications ever do anything with it, it is still useful to avoid crashing the whole system and just terminate one application/guest. > If so, does it even make sense, does it need improvements, etc? There's one more item on my long term TODO list. Add fixups so that copy_to_user() from poison in the page cache doesn't crash, but just checks to see if the page was clean .. .in which case re-read from the filesystem into a different physical page and retire the old page ... the read can now succeed. If the page is dirty, then fail the read (and retire the page ... need to make sure filesystem knows the data for the page was lost so subsequent reads return -EIO or something). Page cache occupies enough memory that it is a big enough source of system crashes that could be avoided. I'm not sure if there are any other obvious cases after this ... it all gets into diminishing returns ... not really worth it to handle a case that only occupies 0.2% of memory. > Because from where I stand it all looks like we do all these fancy > recovery things but is userspace even paying attention or using them or > whatever... See above. With core counts continuing to increase, the cloud service providers really want to see fewer events that crash the whole physical machine (taking down dozens, or hundreds, of guest VMs). -Tony
Re: [PATCH 3/4] mce/copyin: fix to not SIGBUS when copying from user hits poison
On Thu, Apr 08, 2021 at 10:08:52AM -0700, Luck, Tony wrote: > Also not clear to me either ... but sending a SIGBUS to a kthread isn't > going to do anything useful. So avoiding doing that is another worthy > goal. Ack. > With these patches nothing gets killed when kernel touches user poison. > If this is in a regular system call then these patches will return EFAULT > to the user (but now that I see EHWPOISON exists that looks like a better > choice - so applications can distinguish the "I just used an invalid address > in > a parameter to a syscall" from "This isn't my fault, the memory broke". Yap, makes sense. > KVM apparently passes a machine check into the guest. Ah, there it is: static void kvm_send_hwpoison_signal(unsigned long address, struct task_struct *tsk) { send_sig_mceerr(BUS_MCEERR_AR, (void __user *)address, PAGE_SHIFT, tsk); } > Though it seems to be misisng the MCG_STATUS information to tell > the guest whether this is an "Action Required" machine check, or an > "Action Optional" (i.e. whether the poison was found synchonously by > execution of the current instruction, or asynchronously). Yeah, someone needs to deal with that sooner or later, considering how virt is becoming ubiquitous. > There is the ancient Documentation/vm/hwpoison.rst from 2009 ... nothing > seems wrong in that, but could use some updates. Ah yap, that. So what I'm missing with all this fun is, yeah, sure, we have this facility out there but who's using it? Is anyone even using it at all? If so, does it even make sense, does it need improvements, etc? Because from where I stand it all looks like we do all these fancy recovery things but is userspace even paying attention or using them or whatever... > I don't know how much detail we might want to go into on recovery > stratgies for applications. Perhaps an example or two how userspace is supposed to use it... > In terms of production s/w there was one ISV who prototyped recovery > for their application but last time I checked didn't enable it in the > production version. See, one and it hasn't even enabled it. So it all feels like a lot of wasted energy to me to do all those and keep 'em working. But maybe I'm missing the big picture and someone would come and say, no, Boris, we use this and this and that is our feedback... > Essentially it boils down to: > SIGBUS handler gets additional data giving virtual address that has gone away > > 1) Can the application replace the lost page? > Use mmap(addr, MAP_FIXED, ...) to map a fresh page into the gap > and fill with replacement data. This case can return from SIGBUS > handler to re-execute failed instruction > 2) Can the application continue in degraded mode w/o the lost page? > Hunt down pointers to lost page and update structures to say > "this data lost". Use siglongjmp() to go to preset recovery path > 3) Can the application shut down gracefully? > Record details of the lost page. Inform next-of-kin. Exit. > 4) Default - just exit Right and this should probably start with "Hey userspace folks, here's what you can do when you get a hwpoison signal and here's how we envision this recovery to work" and then we can all discuss and converge on an agreeable solution which is actually used and there will be selftests and so on and so on... But what the h*ll do I know. Thx. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette
Re: [PATCH 3/4] mce/copyin: fix to not SIGBUS when copying from user hits poison
On Thu, Apr 08, 2021 at 10:49:58AM +0200, Borislav Petkov wrote: > On Wed, Apr 07, 2021 at 02:43:10PM -0700, Luck, Tony wrote: > > On Wed, Apr 07, 2021 at 11:18:16PM +0200, Borislav Petkov wrote: > > > On Thu, Mar 25, 2021 at 05:02:34PM -0700, Tony Luck wrote: > > > > Andy Lutomirski pointed out that sending SIGBUS to tasks that > > > > hit poison in the kernel copying syscall parameters from user > > > > address space is not the right semantic. > > > > > > What does that mean exactly? > > > > Andy said that a task could check a memory range for poison by > > doing: > > > > ret = write(fd, buf, size); > > if (ret == size) { > > memory range is all good > > } > > > > That doesn't work if the kernel sends a SIGBUS. > > > > It doesn't seem a likely scenario ... but Andy is correct that > > the above ought to work. > > We need to document properly what this is aiming to fix. He said > something yesterday along the lines of kthread_use_mm() hitting a SIGBUS > when a kthread "attaches" to an address space. I'm still unclear as to > how exactly that happens - there are only a handful of kthread_use_mm() > users in the tree... Also not clear to me either ... but sending a SIGBUS to a kthread isn't going to do anything useful. So avoiding doing that is another worthy goal. > > Yes. This is for kernel reading memory belongng to "current" task. > > Provided "current" is really the task to which the poison page belongs. > That kthread_use_mm() thing sounded like the wrong task gets killed. But that > needs more details. With these patches nothing gets killed when kernel touches user poison. If this is in a regular system call then these patches will return EFAULT to the user (but now that I see EHWPOISON exists that looks like a better choice - so applications can distinguish the "I just used an invalid address in a parameter to a syscall" from "This isn't my fault, the memory broke". > > Same in that the page gets unmapped. Different in that there > > is no SIGBUS if the kernel did the access for the user. > > What is even the actual use case with sending tasks SIGBUS on poison > consumption? KVM? Others? KVM apparently passes a machine check into the guest. Though it seems to be misisng the MCG_STATUS information to tell the guest whether this is an "Action Required" machine check, or an "Action Optional" (i.e. whether the poison was found synchonously by execution of the current instruction, or asynchronously). > Are we documenting somewhere: "if your process gets a SIGBUS and this > and that, which means your page got offlined, you should do this and > that to recover"? There is the ancient Documentation/vm/hwpoison.rst from 2009 ... nothing seems wrong in that, but could use some updates. I don't know how much detail we might want to go into on recovery stratgies for applications. In terms of production s/w there was one ISV who prototyped recovery for their application but last time I checked didn't enable it in the production version. Essentially it boils down to: SIGBUS handler gets additional data giving virtual address that has gone away 1) Can the application replace the lost page? Use mmap(addr, MAP_FIXED, ...) to map a fresh page into the gap and fill with replacement data. This case can return from SIGBUS handler to re-execute failed instruction 2) Can the application continue in degraded mode w/o the lost page? Hunt down pointers to lost page and update structures to say "this data lost". Use siglongjmp() to go to preset recovery path 3) Can the application shut down gracefully? Record details of the lost page. Inform next-of-kin. Exit. 4) Default - just exit -Tony
Re: [PATCH 3/4] mce/copyin: fix to not SIGBUS when copying from user hits poison
On Wed, Apr 07, 2021 at 02:43:10PM -0700, Luck, Tony wrote: > On Wed, Apr 07, 2021 at 11:18:16PM +0200, Borislav Petkov wrote: > > On Thu, Mar 25, 2021 at 05:02:34PM -0700, Tony Luck wrote: > > > Andy Lutomirski pointed out that sending SIGBUS to tasks that > > > hit poison in the kernel copying syscall parameters from user > > > address space is not the right semantic. > > > > What does that mean exactly? > > Andy said that a task could check a memory range for poison by > doing: > > ret = write(fd, buf, size); > if (ret == size) { > memory range is all good > } > > That doesn't work if the kernel sends a SIGBUS. > > It doesn't seem a likely scenario ... but Andy is correct that > the above ought to work. We need to document properly what this is aiming to fix. He said something yesterday along the lines of kthread_use_mm() hitting a SIGBUS when a kthread "attaches" to an address space. I'm still unclear as to how exactly that happens - there are only a handful of kthread_use_mm() users in the tree... > Yes. This is for kernel reading memory belongng to "current" task. Provided "current" is really the task to which the poison page belongs. That kthread_use_mm() thing sounded like the wrong task gets killed. But that needs more details. > Same in that the page gets unmapped. Different in that there > is no SIGBUS if the kernel did the access for the user. What is even the actual use case with sending tasks SIGBUS on poison consumption? KVM? Others? Are we documenting somewhere: "if your process gets a SIGBUS and this and that, which means your page got offlined, you should do this and that to recover"? Thx. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette
Re: [PATCH 3/4] mce/copyin: fix to not SIGBUS when copying from user hits poison
On Wed, Apr 07, 2021 at 11:18:16PM +0200, Borislav Petkov wrote: > On Thu, Mar 25, 2021 at 05:02:34PM -0700, Tony Luck wrote: > > Andy Lutomirski pointed out that sending SIGBUS to tasks that > > hit poison in the kernel copying syscall parameters from user > > address space is not the right semantic. > > What does that mean exactly? Andy said that a task could check a memory range for poison by doing: ret = write(fd, buf, size); if (ret == size) { memory range is all good } That doesn't work if the kernel sends a SIGBUS. It doesn't seem a likely scenario ... but Andy is correct that the above ought to work. > > From looking at the code, that is this conditional: > > if (t == EX_HANDLER_UACCESS && regs && is_copy_from_user(regs)) { > m->kflags |= MCE_IN_KERNEL_RECOV; > m->kflags |= MCE_IN_KERNEL_COPYIN; > > so what does the above have to do with syscall params? Most "copy from user" instances are the result of a system call parameter (e.g. "buf" in the write(2) example above). > If it is about us being in ring 0 and touching user memory and eating > poison in same *user* memory while doing so, then sure, that makes > sense. Yes. This is for kernel reading memory belongng to "current" task. > > So stop doing that. Add a new kill_me_never() call back that > > simply unmaps and offlines the poison page. > > Right, that's the same as handling poisoned user memory. Same in that the page gets unmapped. Different in that there is no SIGBUS if the kernel did the access for the user. -Tony
Re: [PATCH 3/4] mce/copyin: fix to not SIGBUS when copying from user hits poison
On Thu, Mar 25, 2021 at 05:02:34PM -0700, Tony Luck wrote: > Andy Lutomirski pointed out that sending SIGBUS to tasks that > hit poison in the kernel copying syscall parameters from user > address space is not the right semantic. What does that mean exactly? >From looking at the code, that is this conditional: if (t == EX_HANDLER_UACCESS && regs && is_copy_from_user(regs)) { m->kflags |= MCE_IN_KERNEL_RECOV; m->kflags |= MCE_IN_KERNEL_COPYIN; so what does the above have to do with syscall params? If it is about us being in ring 0 and touching user memory and eating poison in same *user* memory while doing so, then sure, that makes sense. > So stop doing that. Add a new kill_me_never() call back that > simply unmaps and offlines the poison page. Right, that's the same as handling poisoned user memory. Thx. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette