Re: [PATCH v2] mm, hwpoison: Try to recover from copy-on write faults

2022-10-23 Thread Shuai Xue
在 2022/10/22 AM12:30, Luck, Tony 写道: >>> But maybe it is some RMW instruction ... then, if all the above options >>> didn't happen ... we >>> could get another machine check from the same address. But then we just >>> follow the usual >>> recovery path. > > >> Let assume the instruction

RE: [PATCH v2] mm, hwpoison: Try to recover from copy-on write faults

2022-10-21 Thread Luck, Tony
>> But maybe it is some RMW instruction ... then, if all the above options >> didn't happen ... we >> could get another machine check from the same address. But then we just >> follow the usual >> recovery path. > Let assume the instruction that cause the COW is in the 63/64 case, aka, > it is

Re: [PATCH v2] mm, hwpoison: Try to recover from copy-on write faults

2022-10-21 Thread Shuai Xue
在 2022/10/21 PM12:41, Luck, Tony 写道: >>> When we do return to user mode the task is going to be busy servicing >>> a SIGBUS ... so shouldn't try to touch the poison page before the >>> memory_failure() called by the worker thread cleans things up. >> >> What about an RT process on a busy

Re: [PATCH v2] mm, hwpoison: Try to recover from copy-on write faults

2022-10-21 Thread Shuai Xue
在 2022/10/21 PM12:08, Tony Luck 写道: > On Fri, Oct 21, 2022 at 09:52:01AM +0800, Shuai Xue wrote: >> >> >> 在 2022/10/21 AM4:05, Tony Luck 写道: >>> On Thu, Oct 20, 2022 at 09:57:04AM +0800, Shuai Xue wrote: 在 2022/10/20 AM1:08, Tony Luck 写道: > >>> I'm experimenting with using

RE: [PATCH v2] mm, hwpoison: Try to recover from copy-on write faults

2022-10-20 Thread Luck, Tony
>> When we do return to user mode the task is going to be busy servicing >> a SIGBUS ... so shouldn't try to touch the poison page before the >> memory_failure() called by the worker thread cleans things up. > > What about an RT process on a busy system? > The worker threads are pretty low

RE: [PATCH v2] mm, hwpoison: Try to recover from copy-on write faults

2022-10-20 Thread David Laight
From: Tony Luck > Sent: 21 October 2022 05:08 > When we do return to user mode the task is going to be busy servicing > a SIGBUS ... so shouldn't try to touch the poison page before the > memory_failure() called by the worker thread cleans things up. What about an RT process on a busy

Re: [PATCH v2] mm, hwpoison: Try to recover from copy-on write faults

2022-10-20 Thread Tony Luck
On Fri, Oct 21, 2022 at 09:52:01AM +0800, Shuai Xue wrote: > > > 在 2022/10/21 AM4:05, Tony Luck 写道: > > On Thu, Oct 20, 2022 at 09:57:04AM +0800, Shuai Xue wrote: > >> > >> > >> 在 2022/10/20 AM1:08, Tony Luck 写道: > > I'm experimenting with using sched_work() to handle the call to > >

RE: [PATCH v2] mm, hwpoison: Try to recover from copy-on write faults

2022-10-20 Thread Luck, Tony
>> +INIT_WORK(>work, do_sched_memory_failure); >> +p->pfn = pfn; >> +schedule_work(>work); > > There is already memory_failure_queue() that can do this. Can we use it > directly? Miaohe Lin, Yes, can use that. A thousand thanks for pointing it out. I just tried it, and it works

Re: [PATCH v2] mm, hwpoison: Try to recover from copy-on write faults

2022-10-20 Thread Shuai Xue
在 2022/10/21 AM4:05, Tony Luck 写道: > On Thu, Oct 20, 2022 at 09:57:04AM +0800, Shuai Xue wrote: >> >> >> 在 2022/10/20 AM1:08, Tony Luck 写道: >>> If the kernel is copying a page as the result of a copy-on-write >>> fault and runs into an uncorrectable error, Linux will crash because >>> it does

Re: [PATCH v2] mm, hwpoison: Try to recover from copy-on write faults

2022-10-20 Thread Miaohe Lin
On 2022/10/21 4:05, Tony Luck wrote: > On Thu, Oct 20, 2022 at 09:57:04AM +0800, Shuai Xue wrote: >> >> >> 在 2022/10/20 AM1:08, Tony Luck 写道: >>> If the kernel is copying a page as the result of a copy-on-write >>> fault and runs into an uncorrectable error, Linux will crash because >>> it does

Re: [PATCH v2] mm, hwpoison: Try to recover from copy-on write faults

2022-10-20 Thread Tony Luck
On Thu, Oct 20, 2022 at 09:57:04AM +0800, Shuai Xue wrote: > > > 在 2022/10/20 AM1:08, Tony Luck 写道: > > If the kernel is copying a page as the result of a copy-on-write > > fault and runs into an uncorrectable error, Linux will crash because > > it does not have recovery code for this case where

Re: [PATCH v2] mm, hwpoison: Try to recover from copy-on write faults

2022-10-20 Thread Shuai Xue
在 2022/10/20 AM1:08, Tony Luck 写道: > If the kernel is copying a page as the result of a copy-on-write > fault and runs into an uncorrectable error, Linux will crash because > it does not have recovery code for this case where poison is consumed > by the kernel. > > It is easy to set up a test

RE: [PATCH v2] mm, hwpoison: Try to recover from copy-on write faults

2022-10-19 Thread Luck, Tony
> Given there is no use case for the residue value returned by > copy_mc_to_kernel() perhaps just return EHWPOISON directly from > copyuser_highpage_mc() in the short-copy case? I don't think it hurts to keep the return value as residue count. It isn't making that code any more complex and could

RE: [PATCH v2] mm, hwpoison: Try to recover from copy-on write faults

2022-10-19 Thread Dan Williams
Tony Luck wrote: > If the kernel is copying a page as the result of a copy-on-write > fault and runs into an uncorrectable error, Linux will crash because > it does not have recovery code for this case where poison is consumed > by the kernel. > > It is easy to set up a test case. Just inject an