RE: [PATCH] x86/mce: Clean up TP_printk() output line of the mce_record tracepoint
>> - While at it, don't forget to: >> >>s/ADDR/MISC/SYND >> /addr/misc/synd >> > These are actually acronyms for Address, Miscellaneous and Syndrome registers. They look like abbreviations, not acronyms to me. -Tony
RE: [PATCH v4 2/2] tracing: Include Microcode Revision in mce_record tracepoint
> Export microcode version through the tracepoint to prevent ambiguity over > the active version on the system when the MCE was received. > > Signed-off-by: Avadhut Naik > Reviewed-by: Sohil Mehta > Reviewed-by: Steven Rostedt (Google) Reviewed-by: Tony Luck
RE: [PATCH v4 1/2] tracing: Include PPIN in mce_record tracepoint
> Export PPIN through the tracepoint as it provides a unique identifier for > the system (or socket in case of multi-socket systems) on which the MCE > has been received. > > Signed-off-by: Avadhut Naik > Reviewed-by: Sohil Mehta > Reviewed-by: Steven Rostedt (Google) Reviewed-by: Tony Luck
RE: [PATCH v2 0/2] Update mce_record tracepoint
> But no, that's not the right question to ask. > > It is rather: which bits of information are very relevant to an error > record and which are transient enough so that they cannot be gathered > from a system by other means or only gathered in a difficult way, and > should be part of that record. > > The PPIN is not transient but you have to go map ->extcpu to the PPIN so > adding it to the tracepoint is purely a convenience thing. More or less. > > The microcode revision thing I still don't buy but it is already there > so whateva... > > So we'd need a rule hammered out and put there in a prominent place so > that it is clear what goes into struct mce and what not. My personal evaluation of the value of these two additions to the trace record: PPIN: Nice to have. People that send stuff to me are terrible about providing surrounding details. The record already includes CPUID(1).EAX ... so I can at least skip the step of asking them which CPU family/model/stepping they were using). But PPIN can be recovered (so long as the submitter kept good records about which system generated the record). MICROCODE: Must have. Microcode version can be changed at run time. Going back to the system to check later may not give the correct answer to what was active at the time of the error. Especially for an error reported while a microcode update is waling across the CPUs poking the MSR on each in turn. -Tony
RE: [PATCH v2 0/2] Update mce_record tracepoint
> > Is it so very different to add this to a trace record so that rasdaemon > > can have feature parity with mcelog(8)? > > I knew you were gonna say that. When someone decides that it is > a splendid idea to add more stuff to struct mce then said someone would > want it in the tracepoint too. > > And then we're back to my original question: > > "And where does it end? Stick full dmesg in the tracepoint too?" > > Where do you draw the line in the sand and say, no more, especially > static, fields bloating the trace record should be added and from then > on, you should go collect the info from that box. Something which you're > supposed to do anyway. Every patch that adds new code or data structures adds to the kernel memory footprint. Each should be considered on its merits. The basic question being: "Is the new functionality worth the cost?" Where does it end? It would end if Linus declared: "Linux is now complete. Stop sending patches". I.e. it is never going to end. If somebody posts a patch asking to add the full dmesg to a tracepoint, I'll stand with you to say: "Not only no, but hell no". So for Naik's two patches we have: 1) PPIN Cost = 8 bytes. Benefit: Emdeds a system identifier into the trace record so there can be no ambiguity about which machine generated this error. Also definitively indicates which socket on a multi-socket system. 2) MICROCODE Cost = 4 bytes Benefit: Certainty about the microcode version active on the core at the time the error was detected. RAS = Reliability, Availability, Serviceability These changes fall into the serviceability bucket. They make it easier to diagnose what went wrong. -Tony
RE: [PATCH v2 0/2] Update mce_record tracepoint
> > You've spent enough time with Ashok and Thomas tweaking the Linux > > microcode driver to know that going back to the machine the next day > > to ask about microcode version has a bunch of ways to get a wrong > > answer. > > Huh, what does that have to do with this? If deployment of a microcode update across a fleet always went flawlessly, life would be simpler. But things can fail. And maybe the failure wasn't noticed. Perhaps a node was rebooting when the sysadmin pushed the update to the fleet and so missed the deployment. Perhaps one core was already acting weird and the microcode update didn't get applied to that core. > IIUC, if someone changes something on the system, whether that is > updating microcode or swapping a harddrive or swapping memory or > whatever, that invalidates the errors reported, pretty much. > > You can't put it all in the trace record, you just can't. Swapping a hard drive, or hot plugging a NIC isn't very likely to correlate with an error reported by the CPU in machine check banks. But microcode can be (and has been) the issue in enough cases that knowing the version at the time of the error matters. You seemed to agree with this argument when the microcode field was added to "struct mce" back in 2018 fa94d0c6e0f3 ("x86/MCE: Save microcode revision in machine check records") Is it so very different to add this to a trace record so that rasdaemon can have feature parity with mcelog(8)? -Tony
RE: [PATCH v2 0/2] Update mce_record tracepoint
> > 8 bytes for PPIN, 4 more for microcode. > > I know, nothing leads to bloat like 0.01% here, 0.001% there... 12 extra bytes divided by (say) 64GB (a very small server these days, may laptop has that much) = 0.0001746% We will need 57000 changes like this one before we get to 0.001% :-) > > Number of recoverable machine checks per system I hope the > > monthly rate should be countable on my fingers... > > That's not the point. Rather, when you look at MCE reports, you pretty > much almost always go and collect additional information from the target > machine because you want to figure out what exactly is going on. > > So what's stopping you from collecting all that static information > instead of parrotting it through the tracepoint with each error? PPIN is static. So with good tracking to keep source platform information attached to the error record as it gets passed around people trying to triage the problem, you could say it can be retrieved later (or earlier when setting up a database of attributes of each machine in the fleet. But the key there is keeping the details of the source machine attached to the error record. My first contact with machine check debugging is always just the raw error record (from mcelog, rasdaemon, or console log). > > PPIN is useful when talking to the CPU vendor about patterns of > > similar errors seen across a cluster. > > I guess that is perhaps the only thing of the two that makes some sense > at least - the identifier uniquely describes which CPU the error comes > from... > > > MICROCODE - gives a fast path to root cause problems that have already > > been fixed in a microcode update. > > But that, nah. See above. Knowing which microcode version was loaded on a core *at the time of the error* is critical. You've spent enough time with Ashok and Thomas tweaking the Linux microcode driver to know that going back to the machine the next day to ask about microcode version has a bunch of ways to get a wrong answer. -Tony
RE: [PATCH v2 0/2] Update mce_record tracepoint
> > The first patch adds PPIN (Protected Processor Inventory Number) field to > > the tracepoint. > > > > The second patch adds the microcode field (Microcode Revision) to the > > tracepoint. > > This is a lot of static information to add to *every* MCE. 8 bytes for PPIN, 4 more for microcode. Number of recoverable machine checks per system I hope the monthly rate should be countable on my fingers. If a system is getting more than that, then people should be looking at fixing the underlying problem. Corrected errors are much more common. Though Linux takes action to limit the rate when storms occur. So maybe hundreds or small numbers of thousands of error trace records? Increase in trace buffer consumption still measured in Kbytes not Mbytes. Server systems that do machine check reporting now start at tens of GBytes memory. > And where does it end? Stick full dmesg in the tracepoint too? Seems like overkill. > What is the real-life use case here? Systems using rasdaemon to track errors will be able to track both of these (I assume that Naik has plans to update rasdaemon to capture and save these new fields). PPIN is useful when talking to the CPU vendor about patterns of similar errors seen across a cluster. MICROCODE - gives a fast path to root cause problems that have already been fixed in a microcode update. -Tony
RE: [PATCH] printk: add cpu id information to printk() output
> + return in_task() ? task_pid_nr(current) | (smp_processor_id() << > CPU_ID_SHIFT) : There are contexts and CONFIG options around pre-emption where smp_processor_id() will throw a warning. Use raw_smp_processor_id(). -Tony
RE: [PATCH] EDAC/mc_sysfs: refactor deprecated strncpy
> `strncpy` is deprecated for use on NUL-terminated destination strings [1]. > > We should prefer more robust and less ambiguous string interfaces. > > A suitable replacement is `strscpy_pad` [2] due to the fact that it guarantees > NUL-termination on the destination buffer whilst maintaining the > NUL-padding behavior that `strncpy` provides. This may not be strictly > necessary but as I couldn't understand what this code does I wanted to > ensure that the functionality is the same. > > Link: > https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings > [1] > Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html > [2] > Link: https://github.com/KSPP/linux/issues/90 > Cc: linux-harden...@vger.kernel.org > Signed-off-by: Justin Stitt > --- > Note: build-tested only. > --- > drivers/edac/edac_mc_sysfs.c | 3 +-- > 1 file changed, 1 insertion(+), 2 deletions(-) > > diff --git a/drivers/edac/edac_mc_sysfs.c b/drivers/edac/edac_mc_sysfs.c > index 15f63452a9be..b303309a63cf 100644 > --- a/drivers/edac/edac_mc_sysfs.c > +++ b/drivers/edac/edac_mc_sysfs.c > @@ -229,8 +229,7 @@ static ssize_t channel_dimm_label_store(struct device > *dev, > if (copy_count == 0 || copy_count >= sizeof(rank->dimm->label)) > return -EINVAL; > > - strncpy(rank->dimm->label, data, copy_count); > - rank->dimm->label[copy_count] = '\0'; > + strscpy_pad(rank->dimm->label, data, copy_count); That doc page says the problem with strncpy() is that it doesn't guarantee to NUL terminate the target string. But this code is aware of that limitation and zaps a '\0' at the end to be sure. So this code doesn't suffer from the potential problems. If it is going to be fixed, then some further analysis of the original code would be wise. Just replacing with strscpy_pad() means the code probably still suffers from the "needless performance penalty" also mentioned in the deprecation document. -Tony
RE: [PATCH v7] x86/mce: retrieve poison range from hardware
> What I'm missing from this text here is, what *is* the mce->misc LSB > field in human speak? What does that field denote? The SDM says: Recoverable Address LSB (bits 5:0): The lowest valid recoverable address bit. Indicates the position of the least significant bit (LSB) of the recoverable error address. For example, if the processor logs bits [43:9] of the address, the LSB sub-field in IA32_MCi_MISC is 01001b (9 decimal). For this example, bits [8:0] of the recoverable error address in IA32_MCi_ADDR should be ignored. So in human speak "how much data did you lose". "6" is a common value saying a cache line (2<<6 == 64) was lost. Sometimes you see "12' (2<<12 == 4096) for a whole page lost. -Tony
RE: [PATCH v5] x86/mce: retrieve poison range from hardware
> struct mce m; > + int lsb = PAGE_SHIFT; Some maintainers like to order local declaration lines from longest to shortest > + /* > + * Even if the ->validation_bits are set for address mask, > + * to be extra safe, check and reject an error radius '0', > + * and fallback to the default page size. > + */ > + if (mem_err->validation_bits & CPER_MEM_VALID_PA_MASK) { > + lsb = __ffs64(mem_err->physical_addr_mask); > + if (lsb == 1) > + lsb = PAGE_SHIFT; > + } The comment above __ffs64() says: * The result is not defined if no bits are set, so check that @word * is non-zero before calling this. So if the intent is "extra safe" should check for that: if (mem_err->validation_bits & CPER_MEM_VALID_PA_MASK && mem_err->physical_addr_mask) { lsb = __ffs64(mem_err->physical_addr_mask); if (lsb == 1) lsb = PAGE_SHIFT; } -Tony
Re: [PATCH v3] x86/mce: retrieve poison range from hardware
On Mon, Jul 18, 2022 at 09:11:33PM +, Jane Chu wrote: > On 7/18/2022 12:22 PM, Luck, Tony wrote: > >> It appears the kernel is trusting that ->physical_addr_mask is non-zero > >> in other paths. So this is at least equally broken in the presence of a > >> broken BIOS. The impact is potentially larger though with this change, > >> so it might be a good follow-on patch to make sure that > >> ->physical_addr_mask gets fixed up to a minimum mask value. > > > > Agreed. Separate patch to sanitize early, so other kernel code can just use > > it. > > > > Is it possible that with >if (mem->validation_bits & CPER_MEM_VALID_PA_MASK) > the ->physical_addr_mask is still untrustworthy? The validation_bits just show which fields the BIOS *says* it filled in. If a validation bit isn't set, then Linux should certainly ignore that field. But if it is set, then Linux needs to decide whether to use the value, or do a sanity check first. -Tony
RE: [PATCH v3] x86/mce: retrieve poison range from hardware
> It appears the kernel is trusting that ->physical_addr_mask is non-zero > in other paths. So this is at least equally broken in the presence of a > broken BIOS. The impact is potentially larger though with this change, > so it might be a good follow-on patch to make sure that > ->physical_addr_mask gets fixed up to a minimum mask value. Agreed. Separate patch to sanitize early, so other kernel code can just use it. -Tony
RE: [PATCH v3] x86/mce: retrieve poison range from hardware
+ m.misc = (MCI_MISC_ADDR_PHYS << 6) | __ffs64(mem_err->physical_addr_mask); Do we want to unconditionally trust the sanity of the BIOS provided physical_address_mask? There's a warning comment on the kernel __ffs64() function: * The result is not defined if no bits are set, so check that @word * is non-zero before calling this. Otherwise, this looks like a good idea. -Tony
Re: Phantom PMEM poison issue
On Sat, Jan 22, 2022 at 12:40:18AM +, Jane Chu wrote: > On 1/21/2022 4:31 PM, Jane Chu wrote: > > On baremetal Intel platform with DCPMEM installed and configured to > > provision daxfs, say a poison was consumed by a load from a user thread, > > and then daxfs takes action and clears the poison, confirmed by "ndctl > > -NM". > > > > Now, depends on the luck, after sometime(from a few seconds to 5+ hours) > > the ghost of the previous poison will surface, and it takes > > unload/reload the libnvdimm drivers in order to drive the phantom poison > > away, confirmed by ARS. > > > > Turns out, the issue is quite reproducible with the latest stable Linux. > > > > Here is the relevant console message after injected 8 poisons in one > > page via > > # ndctl inject-error namespace0.0 -n 2 -B 8210 > > There is a cut-n-paste error, the above line should be >"# ndctl inject-error namespace0.0 -n 8 -B 8210" You say "in one page" here. What is the page size? > > -jane > > > then, cleared them all, and wait for 5+ hours, notice the time stamp. > > BTW, the system is idle otherwise. > > > > [ 2439.742296] mce: Uncorrected hardware memory error in user-access at > > 1850602400 > > [ 2439.742420] Memory failure: 0x1850602: Sending SIGBUS to > > fsdax_poison_v1:8457 due to hardware memory corruption > > [ 2439.761866] Memory failure: 0x1850602: recovery action for dax page: > > Recovered > > [ 2439.769949] mce: [Hardware Error]: Machine check events logged > > -1850603000 uncached-minus<->write-back > > [ 2439.769984] x86/PAT: memtype_reserve failed [mem > > 0x1850602000-0x1850602fff], track uncached-minus, req uncached-minus > > [ 2439.769985] Could not invalidate pfn=0x1850602 from 1:1 map > > [ 2440.856351] x86/PAT: fsdax_poison_v1:8457 freeing invalid memtype > > [mem 0x1850602000-0x1850602fff] This error is reported in PFN=1850602 (at offset 0x400 = 1K) > > > > At this point, > > # ndctl list -NMu -r 0 > > { > > "dev":"namespace0.0", > > "mode":"fsdax", > > "map":"dev", > > "size":"15.75 GiB (16.91 GB)", > > "uuid":"2ccc540a-3c7b-4b91-b87b-9e897ad0b9bb", > > "sector_size":4096, > > "align":2097152, > > "blockdev":"pmem0" > > } > > > > [21351.992296] {2}[Hardware Error]: Hardware error from APEI Generic > > Hardware Error Source: 1 > > [21352.001528] {2}[Hardware Error]: event severity: recoverable > > [21352.007838] {2}[Hardware Error]: Error 0, type: recoverable > > [21352.014156] {2}[Hardware Error]: section_type: memory error > > [21352.020572] {2}[Hardware Error]: physical_address: 0x001850603200 This error is in the following page: PFN=1850603 (at offset 0x200 = 512b) Is that what you mean by "phantom error" ... from a different address from those that were injected? -Tony
RE: [PATCH v1 3/3] mm,hwpoison: add kill_accessing_process() to find error virtual address
> I meant in this case (racing to access the same poisoned pages), the > page mapping should have been removed by and the hwpoison swap pte > installed by the winner thread? My "mutex" patch that Horiguchi-san has included his summary series should make this happen in most cases. Only problem is if the first thread runs into some error and does not complete unmapping the poison page from all other tasks. So the backup plan is to scan the page tables. >> Well, I did try a patch that removed *all* user mappings (switched CR3 to >> swapper_pgdir) and returned to user. Then have the resulting page fault >> report the address. But that didn't work very well. > Curious what didn't work well in this case? :-) Andy Lutomirski wasn't happy with the approach. It was specifically to cover the "kernel accesses poison more than once from get_user()". It doesn't generalize to the case where the user accessed the poison (because you'll just take the #PF on the instruction fetch ... everything is unmapped ... instead of on the original data access). -Tony
Re: [PATCH v1 3/3] mm,hwpoison: add kill_accessing_process() to find error virtual address
On Mon, Apr 19, 2021 at 07:03:01PM -0700, Jue Wang wrote: > On Tue, 13 Apr 2021 07:43:20 +0900, Naoya Horiguchi wrote: > > > This patch suggests to do page table walk to find the error virtual > > address. If we find multiple virtual addresses in walking, we now can't > > determine which one is correct, so we fall back to sending SIGBUS in > > kill_me_maybe() without error info as we do now. This corner case needs > > to be solved in the future. > > Instead of walking the page tables, I wonder what about the following idea: > > When failing to get vaddr, memory_failure just ensures the mapping is removed > and an hwpoisoned swap pte is put in place; or the original page is flagged > with > PG_HWPOISONED and kept in the radix tree (e.g., for SHMEM THP). To remove the mapping, you need to know the virtual address :-) Well, I did try a patch that removed *all* user mappings (switched CR3 to swapper_pgdir) and returned to user. Then have the resulting page fault report the address. But that didn't work very well. > NOTE: no SIGBUS is sent to user space. > > Then do_machine_check just returns to user space to resume execution, the > re-execution will result in a #PF and should land to the exact page fault > handling code that generates a SIGBUS with the precise vaddr info: That's how SRAO (and other races) are supposed to work. -Tony
Re: [PATCH v1 3/3] mm,hwpoison: add kill_accessing_process() to find error virtual address
On Mon, Apr 19, 2021 at 06:49:15PM -0700, Jue Wang wrote: > On Tue, 13 Apr 2021 07:43:20 +0900, Naoya Horiguchi wrote: > ... > > + * This function is intended to handle "Action Required" MCEs on already > > + * hardware poisoned pages. They could happen, for example, when > > + * memory_failure() failed to unmap the error page at the first call, or > > + * when multiple Action Optional MCE events races on different CPUs with > > + * Local MCE enabled. > > +Tony Luck > > Hey Tony, I thought SRAO MCEs are broadcasted to all cores in the system > as they come without an execution context, is it correct? > > If Yes, Naoya, I think we might want to remove the comments about the > "multiple Action Optional MCE racing" part. Jue, Correct. SRAO machine checks are broadcast. But rather than remove the second part, just replace with "multiple local machine checks on different CPUs". -Tony
RE: [PATCH 4/4] x86/mce: Avoid infinite loop for copy from user recovery
>> But there are places in the kernel where the code assumes that this >> EFAULT return was simply because of a page fault. The code takes some >> action to fix that, and then retries the access. This results in a second >> machine check. > > What about return EHWPOISON instead of EFAULT and update the callers > to handle EHWPOISON explicitly: i.e., not retry but give up on the page? That seems like a good idea to me. But I got some pushback when I started on this path earlier with some patches to the futex code. But back then I wasn't using error return of EHWPOISON ... possibly the code would look less hacky with that explicitly called out. The futex case was specifically for code using pagefault_disable(). Likely all the other callers would need to be audited (but there are only a few dozen places, so not too big of a deal). > My main concern is that the strong assumptions that the kernel can't hit more > than a fixed number of poisoned cache lines before turning to user space > may simply not be true. Agreed. > When DIMM goes bad, it can easily affect an entire bank or entire ram device > chip. Even with memory interleaving, it's possible that a kernel control path > touches lots of poisoned cache lines in the buffer it is working through. These larger failures have other problems ... dozens of unrelated pages may be affected. In a perfect world Linux would be told on the first error that this is just one of many errors ... and be given a list. But in the real world that isn't likely to happen :-( -Tony
RE: [PATCH 3/4] mce/copyin: fix to not SIGBUS when copying from user hits poison
> So what I'm missing with all this fun is, yeah, sure, we have this > facility out there but who's using it? Is anyone even using it at all? Even if no applications ever do anything with it, it is still useful to avoid crashing the whole system and just terminate one application/guest. > If so, does it even make sense, does it need improvements, etc? There's one more item on my long term TODO list. Add fixups so that copy_to_user() from poison in the page cache doesn't crash, but just checks to see if the page was clean .. .in which case re-read from the filesystem into a different physical page and retire the old page ... the read can now succeed. If the page is dirty, then fail the read (and retire the page ... need to make sure filesystem knows the data for the page was lost so subsequent reads return -EIO or something). Page cache occupies enough memory that it is a big enough source of system crashes that could be avoided. I'm not sure if there are any other obvious cases after this ... it all gets into diminishing returns ... not really worth it to handle a case that only occupies 0.2% of memory. > Because from where I stand it all looks like we do all these fancy > recovery things but is userspace even paying attention or using them or > whatever... See above. With core counts continuing to increase, the cloud service providers really want to see fewer events that crash the whole physical machine (taking down dozens, or hundreds, of guest VMs). -Tony
Re: [PATCH 3/4] mce/copyin: fix to not SIGBUS when copying from user hits poison
On Thu, Apr 08, 2021 at 10:49:58AM +0200, Borislav Petkov wrote: > On Wed, Apr 07, 2021 at 02:43:10PM -0700, Luck, Tony wrote: > > On Wed, Apr 07, 2021 at 11:18:16PM +0200, Borislav Petkov wrote: > > > On Thu, Mar 25, 2021 at 05:02:34PM -0700, Tony Luck wrote: > > > > Andy Lutomirski pointed out that sending SIGBUS to tasks that > > > > hit poison in the kernel copying syscall parameters from user > > > > address space is not the right semantic. > > > > > > What does that mean exactly? > > > > Andy said that a task could check a memory range for poison by > > doing: > > > > ret = write(fd, buf, size); > > if (ret == size) { > > memory range is all good > > } > > > > That doesn't work if the kernel sends a SIGBUS. > > > > It doesn't seem a likely scenario ... but Andy is correct that > > the above ought to work. > > We need to document properly what this is aiming to fix. He said > something yesterday along the lines of kthread_use_mm() hitting a SIGBUS > when a kthread "attaches" to an address space. I'm still unclear as to > how exactly that happens - there are only a handful of kthread_use_mm() > users in the tree... Also not clear to me either ... but sending a SIGBUS to a kthread isn't going to do anything useful. So avoiding doing that is another worthy goal. > > Yes. This is for kernel reading memory belongng to "current" task. > > Provided "current" is really the task to which the poison page belongs. > That kthread_use_mm() thing sounded like the wrong task gets killed. But that > needs more details. With these patches nothing gets killed when kernel touches user poison. If this is in a regular system call then these patches will return EFAULT to the user (but now that I see EHWPOISON exists that looks like a better choice - so applications can distinguish the "I just used an invalid address in a parameter to a syscall" from "This isn't my fault, the memory broke". > > Same in that the page gets unmapped. Different in that there > > is no SIGBUS if the kernel did the access for the user. > > What is even the actual use case with sending tasks SIGBUS on poison > consumption? KVM? Others? KVM apparently passes a machine check into the guest. Though it seems to be misisng the MCG_STATUS information to tell the guest whether this is an "Action Required" machine check, or an "Action Optional" (i.e. whether the poison was found synchonously by execution of the current instruction, or asynchronously). > Are we documenting somewhere: "if your process gets a SIGBUS and this > and that, which means your page got offlined, you should do this and > that to recover"? There is the ancient Documentation/vm/hwpoison.rst from 2009 ... nothing seems wrong in that, but could use some updates. I don't know how much detail we might want to go into on recovery stratgies for applications. In terms of production s/w there was one ISV who prototyped recovery for their application but last time I checked didn't enable it in the production version. Essentially it boils down to: SIGBUS handler gets additional data giving virtual address that has gone away 1) Can the application replace the lost page? Use mmap(addr, MAP_FIXED, ...) to map a fresh page into the gap and fill with replacement data. This case can return from SIGBUS handler to re-execute failed instruction 2) Can the application continue in degraded mode w/o the lost page? Hunt down pointers to lost page and update structures to say "this data lost". Use siglongjmp() to go to preset recovery path 3) Can the application shut down gracefully? Record details of the lost page. Inform next-of-kin. Exit. 4) Default - just exit -Tony
RE: [PATCH 4/4] x86/mce: Avoid infinite loop for copy from user recovery
> What I'm still unclear on, does this new version address that > "mysterious" hang or panic which the validation team triggered or you > haven't checked yet? No :-( They are triggering some case where multiple threads in a process hit the same poison, and somehow memory_failure() fails to complete offlining the page. At this point any other threads that hit that page get the early return from memory_failure (because the page flags say it is poisoned) ... and so we loop. But the "recover from cases where multiple machine checks happen simultaneously" case is orthogonal to the "do the right thing to recover when the kernel touches poison at a user address". So I think we can tackle them separately -Tony
RE: [RFC 0/4] Fix machine check recovery for copy_from_user
> I have one scenario, may you take into account: > > If one copyin case occurs, write() returned by your patch, the user process > may > check the return values, for errors, it may exit the process, then the error > page > will be freed, and then the page maybe alloced to other process or to kernel > itself, > then code will initialize it and this will trigger one SRAO, if it's used by > kernel, > we may do nothing for this, and kernel may still touch it, and lead to one > panic. In this case kill_me_never() calls memory_failure() with flags == 0. I think (hope!) that means that it will unmap the page from the task, but will not send a signal. When the task exits the PTE for this page has the swap/poison signature, so the page is not freed for re-use. -Tony
Re: [PATCH 3/4] mce/copyin: fix to not SIGBUS when copying from user hits poison
On Wed, Apr 07, 2021 at 11:18:16PM +0200, Borislav Petkov wrote: > On Thu, Mar 25, 2021 at 05:02:34PM -0700, Tony Luck wrote: > > Andy Lutomirski pointed out that sending SIGBUS to tasks that > > hit poison in the kernel copying syscall parameters from user > > address space is not the right semantic. > > What does that mean exactly? Andy said that a task could check a memory range for poison by doing: ret = write(fd, buf, size); if (ret == size) { memory range is all good } That doesn't work if the kernel sends a SIGBUS. It doesn't seem a likely scenario ... but Andy is correct that the above ought to work. > > From looking at the code, that is this conditional: > > if (t == EX_HANDLER_UACCESS && regs && is_copy_from_user(regs)) { > m->kflags |= MCE_IN_KERNEL_RECOV; > m->kflags |= MCE_IN_KERNEL_COPYIN; > > so what does the above have to do with syscall params? Most "copy from user" instances are the result of a system call parameter (e.g. "buf" in the write(2) example above). > If it is about us being in ring 0 and touching user memory and eating > poison in same *user* memory while doing so, then sure, that makes > sense. Yes. This is for kernel reading memory belongng to "current" task. > > So stop doing that. Add a new kill_me_never() call back that > > simply unmaps and offlines the poison page. > > Right, that's the same as handling poisoned user memory. Same in that the page gets unmapped. Different in that there is no SIGBUS if the kernel did the access for the user. -Tony
RE: [PATCH v3] mm,hwpoison: return -EHWPOISON when page already poisoned
>> Combined with my "mutex" patch (to get rid of races where 2nd process returns >> early, but first process is still looking for mappings to unmap and tasks >> to signal) this patch moves forward a bit. But I think it needs an >> additional change here in kill_me_maybe() to just "return" if there is a >> EHWPOISON return from memory_failure() >> > Got this, Thanks for your reply! > I will dig into this! One problem with this approach is when the first task to find poison fails to complete actions. Then the poison pages are not unmapped, and just returning from kill_me_maybe() gets into a loop :-( -Tony
Re: [PATCH v3] mm,hwpoison: return -EHWPOISON when page already poisoned
On Wed, Mar 31, 2021 at 07:25:40PM +0800, Aili Yao wrote: > When the page is already poisoned, another memory_failure() call in the > same page now return 0, meaning OK. For nested memory mce handling, this > behavior may lead to one mce looping, Example: > > 1.When LCME is enabled, and there are two processes A && B running on > different core X && Y separately, which will access one same page, then > the page corrupted when process A access it, a MCE will be rasied to > core X and the error process is just underway. > > 2.Then B access the page and trigger another MCE to core Y, it will also > do error process, it will see TestSetPageHWPoison be true, and 0 is > returned. > > 3.The kill_me_maybe will check the return: > > 1244 static void kill_me_maybe(struct callback_head *cb) > 1245 { > > 1254 if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags) && > 1255 !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) { > 1256 set_mce_nospec(p->mce_addr >> PAGE_SHIFT, > p->mce_whole_page); > 1257 sync_core(); > 1258 return; > 1259 } > > 1267 } With your change memory_failure() will return -EHWPOISON for the second task that consumes poison ... so that "if" statement won't be true and so we fall into the following code: 1273 if (p->mce_vaddr != (void __user *)-1l) { 1274 force_sig_mceerr(BUS_MCEERR_AR, p->mce_vaddr, PAGE_SHIFT); 1275 } else { 1276 pr_err("Memory error not recovered"); 1277 kill_me_now(cb); 1278 } If this was a copy_from_user() machine check, p->mce_vaddr is set and the task gets a BUS_MCEERR_AR SIGBUS, otherwise we print that "Memory error not recovered" message and send a generic SIGBUS. I don't think either of those options is right. Combined with my "mutex" patch (to get rid of races where 2nd process returns early, but first process is still looking for mappings to unmap and tasks to signal) this patch moves forward a bit. But I think it needs an additional change here in kill_me_maybe() to just "return" if there is a EHWPOISON return from memory_failure() -Tony
RE: [PATCH] x86/mce/dev-mcelog: Fix potential memory access error
- set_bit(MCE_OVERFLOW, (unsigned long *)>flags); + mcelog->flags |= BIT(MCE_OVERFLOW); set_bit() is an atomic operation because it might race with the code to get and clear this bit: do { flags = mcelog->flags; } while (cmpxchg(>flags, flags, 0) != flags); Originally this was needed because mcelog could be called from #MC context. That's all changed now and the machine check notifier chain routines are called in a kernel thread. So some other locking (mutex?) could be used to protect access to mcelog->flags. -Tony
RE: [PATCH] x86/mce: Add Skylake quirk for patrol scrub reported errors
> Yeah, into > > fd258dc4442c ("x86/mce: Add Skylake quirk for patrol scrub reported errors") Thanks ... my memory is failing, and I forgot that the patch had been improved and moved from core.c (where I looked for it) into severity.c -Tony
RE: [PATCH v5 2/3] x86/bus_lock: Handle #DB for bus lock
> What is the justifucation for making this rate limit per UID and not > per task, per process or systemwide? The concern is that a malicious user is running a workload that loops obtaining the buslock. This brings the whole system to its knees. Limiting per task doesn't help. The user can just fork(2) a whole bunch of tasks for a distributed buslock attack.. Systemwide might be an interesting alternative. Downside would be accidental rate limit of non-malicious tasks that happen to grab a bus lock periodically but in the same window with other buslocks from other users. Do you think that a risk worth taking to make the code simpler? -Tony
Re: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned
On Fri, Mar 12, 2021 at 11:48:31PM +, Luck, Tony wrote: > Thanks to the decode of the instruction we do have the virtual address. So we > just need > a safe walk of pgd->p4d->pud->pmd->pte (truncated if we hit a huge page) with > a write > of a "not-present" value. Maybe a different poison type from the one we get > from > memory_failure() so that the #PF code can recognize this as a special case > and do any > other work that we avoided because we were in #MC context. There is no such thing as the safe walk I describe above. In a multi threaded application other threads might munmap(2) bits of the address space which could free some of the p4d->pud->pmd->pte tables. But the pgd table is guaranteed to exist as long as any of the threads in the process exist. Which gave rise to a new approach (partial credit to Dave Hansen who helped me understand why a more extreme patch that I wanted to use wasn't working ... and he pointed out the pgd persistence). N.B. The code below DOES NOT WORK. My test application tried to do a write(2) syscall with some poison in the buffer. Goal is that write should return a short count (with only the bytes from the buffer leading up to the poison written to the file). Currently the process gets a suprise SIGSEGV in some glibc code :-( The code would also need a bunch more work to handle the fact that in a multi-threaded application multiple threads might consume the poison, and some innocent threads not consuming the poison may also end up drawn into the melee in the page fault handler. The big picture. --- This approach passes the buck for the recovery from the #MC handler (which has very limited options to do anything useful) to the #PF handler (which is a much friendlier environment where locks and mutexes can be obtained as needed). The mechanism for this buck passing is for the #MC handler to set a reserved bit in the PGD entry for the user address where the machine check happened, flush the TLB for that address, and then return from the #MC handler to the kernel get_user()/copy_from_user() code which will re-execute the instruction to access the poison address, but this time take a #PF because of the reserved bit in the PGD. There's a couple of changes bundled in here to help my debugging: 1) Turn off UCNA calls to memory_failure() ... just avoids races and make tests more determinstic for now 2) Disable "fast string" support ... avoid "REP MOVS" copy routines since I'm testing on an old system that treats poison consumed by REP MOVS as fatal. Posted here for general comments on the approach. On the plus side it avoids taking multiple machine checks in the kernel when it is accessing user data. I think it can also meet the goal set by Andy Lutomirski of avoiding SIGBUS from syscalls that happen to touch user poison. They should see either short counts for calls like write(2) that may partially succeed or -EFAULT if the system call totally failed. -Tony --- diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h index f24d7ef8fffa..e533eaf20834 100644 --- a/arch/x86/include/asm/pgtable_types.h +++ b/arch/x86/include/asm/pgtable_types.h @@ -23,6 +23,7 @@ #define _PAGE_BIT_SOFTW2 10 /* " */ #define _PAGE_BIT_SOFTW3 11 /* " */ #define _PAGE_BIT_PAT_LARGE12 /* On 2MB or 1GB pages */ +#define _PAGE_BIT_RESERVED151 /* Reserved bit */ #define _PAGE_BIT_SOFTW4 58 /* available for programmer */ #define _PAGE_BIT_PKEY_BIT059 /* Protection Keys, bit 1/4 */ #define _PAGE_BIT_PKEY_BIT160 /* Protection Keys, bit 2/4 */ @@ -56,6 +57,7 @@ #define _PAGE_PAT_LARGE (_AT(pteval_t, 1) << _PAGE_BIT_PAT_LARGE) #define _PAGE_SPECIAL (_AT(pteval_t, 1) << _PAGE_BIT_SPECIAL) #define _PAGE_CPA_TEST (_AT(pteval_t, 1) << _PAGE_BIT_CPA_TEST) +#define _PAGE_RESERVED1(_AT(pteval_t, 1) << _PAGE_BIT_RESERVED1) #ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS #define _PAGE_PKEY_BIT0(_AT(pteval_t, 1) << _PAGE_BIT_PKEY_BIT0) #define _PAGE_PKEY_BIT1(_AT(pteval_t, 1) << _PAGE_BIT_PKEY_BIT1) diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h index 7f7200021bd1..269e8ee04c11 100644 --- a/arch/x86/include/asm/traps.h +++ b/arch/x86/include/asm/traps.h @@ -45,4 +45,24 @@ void __noreturn handle_stack_overflow(const char *message, unsigned long fault_address); #endif +static inline void pgd_set_reserved(pgd_t *pgdp, unsigned long addr) +{ + pgd_t pgd; + + pgdp += pgd_index(addr); + pgd = *pgdp; + pgd.pgd |= _PAGE_RESERVED1; + WRITE_ONCE(*pgdp, pgd); +} + +static inline void pgd_clr_reserved(pgd_t *pgdp, unsigned long addr) +{ + pgd_t pgd; + + pgdp += pgd_index(addr); + pgd = *pgdp; + pgd.pgd &=
RE: EDAC list as Trojan Horse distribution ??
>> Nothing new - just the next spammer attempt. > But this was a new class of Spam. So far i got only mass mailing... This was > personalized based on my previous e-Mail (did not include this part in my > mail) Somewhat new - combining trawling of public mailing lists for addresses with a phishing attack trying to get you to open a (presumably) malicious payload. I haven't personally seen that ... probably my company's e-mail filters blocked it. -Tony
RE: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned
>> will memory_failure() find it and unmap it? if succeed, then the current >> will be >> signaled with correct vaddr and shift? > > That's a very good question. I didn't see a SIGBUS when I first wrote this > code, > hence all the p->mce_vaddr. But now I'm > a) not sure why there wasn't a signal > b) if we are to fix the problems noted by AndyL, need to make sure that there > isn't a SIGBUS Tests on upstream kernel today show that memory_failure() is both unmapping the page and sending a SIGBUS. My biggest issue with the KERNEL_COPYIN recovery path is that we don't have code to mark the page not present while we are still in do_machine_check(). That's resulted in recovery working for simple cases where there is a single get_user() call followed by an error return if that failed. But more complex cases require more machine checks and a touching faith that the kernel will eventually give up trying (spoiler: it sometimes doesn't). Thanks to the decode of the instruction we do have the virtual address. So we just need a safe walk of pgd->p4d->pud->pmd->pte (truncated if we hit a huge page) with a write of a "not-present" value. Maybe a different poison type from the one we get from memory_failure() so that the #PF code can recognize this as a special case and do any other work that we avoided because we were in #MC context. -Tony
RE: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned
> Sorry to interrupt as I am really confused here: > If it's a copyin case, has the page been mapped for the current process? Yes. The kernel has the current process mapped to the low address range and code like get_user(), put_user() "simply" reaches down to those addresses (maybe not so simply, as the access needs to be within a STAC/CLAC section of code on modern CPUs, and the access instruction needs an entry in EXTABLE so that the page fault handler can fix it if there isn't a page mapped at the user address). > will memory_failure() find it and unmap it? if succeed, then the current will > be > signaled with correct vaddr and shift? That's a very good question. I didn't see a SIGBUS when I first wrote this code, hence all the p->mce_vaddr. But now I'm a) not sure why there wasn't a signal b) if we are to fix the problems noted by AndyL, need to make sure that there isn't a SIGBUS > Maybe the mce_vaddr is set correctly, but we may lost the correct page shift? Yes. There is a hard-coded PAGE_SHIFT for this case - which may not match the actual page size. > And for copyin case, we don't need to call set_mce_nospec()? Yes. We should. Thanks for your good questions. -Tony
RE: [PATCH V2 20/25] perf/x86/intel: Add Alder Lake Hybrid support
>> I think the "sapphire_rapids" is the code name for the server platform. > > If that's really the case, then: > > #define INTEL_FAM6_SAPPHIRERAPIDS_X 0x8F > > is wrong, those things should be uarch name, not platform name. Tony? 0x8F is the model number of the CPU that is named Sapphire Rapids that goes into the Eagle Stream platform If you want a uarch name, that might be the name of the core ( something-cove ... I can't keep track of the names of all the coves). But that would likely lead to different confusion later if that *cove core is used in some other CPU model number that doesn't neatly fit into our _X, _L etc. suffix scheme) -Tony
RE: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned
> I guess that p->mce_vaddr stores the virtual address of the error here. > If so, sending SIGBUS with the address looks enough as we do now, so why > do you walk page table to find the error virtual address? p->mce_vaddr only has the virtual address for the COPYIN case. In that code path we decode the kernel instruction that hit the fault in order to find the virtual address. That's easy because: 1) The kernel RIP is known to be good (can't page fault etc. on kernel address). 2) There are only a half dozen instructions used by the kernel for get_user() or copy_from_user(). When the machine check happens during user execution accessing poison data we only have the physical address (from MCi_ADDR). -Tony
RE: [PATCH v3] x86/fault: Send a SIGBUS to user process always for hwpoison page access.
> I think we need to at least fix the existing bug before we add more > signals. AFAICS the MCE_IN_KERNEL_COPYIN code is busted for kernel > threads. Can a kernel thread do get_user() or copy_from_user()? Do we have kernel threads that have an associated user address space to copy from? -Tony
Re: [PATCH v2] mm,hwpoison: return -EBUSY when page already poisoned
On Tue, Mar 09, 2021 at 08:28:24AM +, HORIGUCHI NAOYA(堀口 直也) wrote: > On Tue, Mar 09, 2021 at 02:35:34PM +0800, Aili Yao wrote: > > When the page is already poisoned, another memory_failure() call in the > > same page now return 0, meaning OK. For nested memory mce handling, this > > behavior may lead to mce looping, Example: > > > > 1.When LCME is enabled, and there are two processes A && B running on > > different core X && Y separately, which will access one same page, then > > the page corrupted when process A access it, a MCE will be rasied to > > core X and the error process is just underway. > > > > 2.Then B access the page and trigger another MCE to core Y, it will also > > do error process, it will see TestSetPageHWPoison be true, and 0 is > > returned. > > > > 3.The kill_me_maybe will check the return: > > > > 1244 static void kill_me_maybe(struct callback_head *cb) > > 1245 { > > > > 1254 if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags) && > > 1255 !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) { > > 1256 set_mce_nospec(p->mce_addr >> PAGE_SHIFT, > > p->mce_whole_page); > > 1257 sync_core(); > > 1258 return; > > 1259 } > > > > 1267 } > > > > 4. The error process for B will end, and may nothing happened if > > kill-early is not set, The process B will re-excute instruction and get > > into mce again and then loop happens. And also the set_mce_nospec() > > here is not proper, may refer to commit fd0e786d9d09 ("x86/mm, > > mm/hwpoison: Don't unconditionally unmap kernel 1:1 pages"). > > > > For other cases which care the return value of memory_failure() should > > check why they want to process a memory error which have already been > > processed. This behavior seems reasonable. > > Other reviewers shared ideas about the returned value, but actually > I'm not sure which the best one is (EBUSY is not that bad). > What we need to fix the reported issue is to return non-zero value > for "already poisoned" case (the value itself is not so important). > > Other callers of memory_failure() (mostly test programs) could see > the change of return value, but they can already see EBUSY now and > anyway they should check dmesg for more detail about why failed, > so the impact of the change is not so big. > > > > > Signed-off-by: Aili Yao > > Reviewed-by: Naoya Horiguchi I think that both this and my "add a mutex" patch are both too simplistic for this complex problem :-( When multiple CPUs race to call memory_failure() for the same page we need the following results: 1) Poison page should be marked not-present in all tasks I think the mutex patch achieves this as long as memory_failure() doesn't hit an error[1]. 2) All tasks that were executing an instruction that was accessing the poison location should see a SIGBUS with virtual address and BUS_MCEERR_AR signature in siginfo. Neither solution achieves this. The -EBUSY return ensures that there is a SIGBUS for the tasks that get the -EBUSY return, but no siginfo details. Just the mutex patch *might* have BUS_MCEERR_AO signature to the race losing tasks, but only if they have PF_MCE_EARLY set (so says the comment in kill_proc() ... but I don't see the code checking for that bit). #2 seems hard to achieve ... there are inherent races that mean the AO SIGBUS could have been queued to the task before it even hits the poison. Maybe should include a non-action: 3) A task should only see one SIGBUS per poison? Not sure if this is achievable either ... what if the task has the same page mapped multiple times? -Tony [1] still looking at why my futex injection test ends with a "reserved kernel page still referenced by 1 users"
[PATCH] mm/memory-failure: Use a mutex to avoid memory_failure() races
There can be races when multiple CPUs consume poison from the same page. The first into memory_failure() atomically sets the HWPoison page flag and begins hunting for tasks that map this page. Eventually it invalidates those mappings and may send a SIGBUS to the affected tasks. But while all that work is going on, other CPUs see a "success" return code from memory_failure() and so they believe the error has been handled and continue executing. Fix by wrapping most of the internal parts of memory_failure() in a mutex. Signed-off-by: Tony Luck --- mm/memory-failure.c | 19 +-- 1 file changed, 17 insertions(+), 2 deletions(-) diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 24210c9bd843..c1509f4b565e 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -1381,6 +1381,8 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags, return rc; } +static DEFINE_MUTEX(mf_mutex); + /** * memory_failure - Handle memory failure of a page. * @pfn: Page Number of the corrupted page @@ -1424,12 +1426,18 @@ int memory_failure(unsigned long pfn, int flags) return -ENXIO; } + mutex_lock(_mutex); + try_again: - if (PageHuge(p)) - return memory_failure_hugetlb(pfn, flags); + if (PageHuge(p)) { + res = memory_failure_hugetlb(pfn, flags); + goto out2; + } + if (TestSetPageHWPoison(p)) { pr_err("Memory failure: %#lx: already hardware poisoned\n", pfn); + mutex_unlock(_mutex); return 0; } @@ -1463,9 +1471,11 @@ int memory_failure(unsigned long pfn, int flags) res = MF_FAILED; } action_result(pfn, MF_MSG_BUDDY, res); + mutex_unlock(_mutex); return res == MF_RECOVERED ? 0 : -EBUSY; } else { action_result(pfn, MF_MSG_KERNEL_HIGH_ORDER, MF_IGNORED); + mutex_unlock(_mutex); return -EBUSY; } } @@ -1473,6 +1483,7 @@ int memory_failure(unsigned long pfn, int flags) if (PageTransHuge(hpage)) { if (try_to_split_thp_page(p, "Memory Failure") < 0) { action_result(pfn, MF_MSG_UNSPLIT_THP, MF_IGNORED); + mutex_unlock(_mutex); return -EBUSY; } VM_BUG_ON_PAGE(!page_count(p), p); @@ -1517,6 +1528,7 @@ int memory_failure(unsigned long pfn, int flags) num_poisoned_pages_dec(); unlock_page(p); put_page(p); + mutex_unlock(_mutex); return 0; } if (hwpoison_filter(p)) { @@ -1524,6 +1536,7 @@ int memory_failure(unsigned long pfn, int flags) num_poisoned_pages_dec(); unlock_page(p); put_page(p); + mutex_unlock(_mutex); return 0; } @@ -1559,6 +1572,8 @@ int memory_failure(unsigned long pfn, int flags) res = identify_page_state(pfn, p, page_flags); out: unlock_page(p); +out2: + mutex_unlock(_mutex); return res; } EXPORT_SYMBOL_GPL(memory_failure); -- 2.29.2
RE: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned
>> So it should be safe to grab and hold a mutex. See patch below. > > The mutex approach looks simpler and safer, so I'm fine with it. Thanks. Is that an "Acked-by:"? >> /** >> * memory_failure - Handle memory failure of a page. >> * @pfn: Page Number of the corrupted page >> @@ -1424,12 +1426,18 @@ int memory_failure(unsigned long pfn, int flags) >> return -ENXIO; >> } >> >> +mutex_lock(_mutex); > > Is it better to take mutex before memory_failure_dev_pagemap() block? > Or we don't have to protect against race for device memory? No races (recovery is only attempted for errors in normal memory). -Tony
RE: [PATCH v3] x86/fault: Send a SIGBUS to user process always for hwpoison page access.
> Can you point me at that SIGBUS code in a current kernel? It is in kill_me_maybe(). mce_vaddr is setup when we disassemble whatever get_user() or copy from user variant was in use in the kernel when the poison memory was consumed. if (p->mce_vaddr != (void __user *)-1l) { force_sig_mceerr(BUS_MCEERR_AR, p->mce_vaddr, PAGE_SHIFT); Would it be any better if we used the BUS_MCEERR_AO code that goes into siginfo? That would make it match up better with what happens when poison is found asynchronously by the patrol scrubber. I.e. the semantics are: AR: You just touched poison at this address and need to do something about that. AO: Just letting you know that you have some poison at the address in siginfo. -Tony
Re: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned
This whole page table walking patch is trying to work around the races caused by multiple calls to memory_failure() for the same page. Maybe better to just avoid the races. The comment right above memory_failure says: * Must run in process context (e.g. a work queue) with interrupts * enabled and no spinlocks hold. So it should be safe to grab and hold a mutex. See patch below. -Tony commit 8dd0dbe7d595e02647e9c2c76c03341a9f6bd7b9 Author: Tony Luck Date: Fri Mar 5 10:40:48 2021 -0800 Use a mutex to avoid memory_failure() races diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 24210c9bd843..c1509f4b565e 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -1381,6 +1381,8 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags, return rc; } +static DEFINE_MUTEX(mf_mutex); + /** * memory_failure - Handle memory failure of a page. * @pfn: Page Number of the corrupted page @@ -1424,12 +1426,18 @@ int memory_failure(unsigned long pfn, int flags) return -ENXIO; } + mutex_lock(_mutex); + try_again: - if (PageHuge(p)) - return memory_failure_hugetlb(pfn, flags); + if (PageHuge(p)) { + res = memory_failure_hugetlb(pfn, flags); + goto out2; + } + if (TestSetPageHWPoison(p)) { pr_err("Memory failure: %#lx: already hardware poisoned\n", pfn); + mutex_unlock(_mutex); return 0; } @@ -1463,9 +1471,11 @@ int memory_failure(unsigned long pfn, int flags) res = MF_FAILED; } action_result(pfn, MF_MSG_BUDDY, res); + mutex_unlock(_mutex); return res == MF_RECOVERED ? 0 : -EBUSY; } else { action_result(pfn, MF_MSG_KERNEL_HIGH_ORDER, MF_IGNORED); + mutex_unlock(_mutex); return -EBUSY; } } @@ -1473,6 +1483,7 @@ int memory_failure(unsigned long pfn, int flags) if (PageTransHuge(hpage)) { if (try_to_split_thp_page(p, "Memory Failure") < 0) { action_result(pfn, MF_MSG_UNSPLIT_THP, MF_IGNORED); + mutex_unlock(_mutex); return -EBUSY; } VM_BUG_ON_PAGE(!page_count(p), p); @@ -1517,6 +1528,7 @@ int memory_failure(unsigned long pfn, int flags) num_poisoned_pages_dec(); unlock_page(p); put_page(p); + mutex_unlock(_mutex); return 0; } if (hwpoison_filter(p)) { @@ -1524,6 +1536,7 @@ int memory_failure(unsigned long pfn, int flags) num_poisoned_pages_dec(); unlock_page(p); put_page(p); + mutex_unlock(_mutex); return 0; } @@ -1559,6 +1572,8 @@ int memory_failure(unsigned long pfn, int flags) res = identify_page_state(pfn, p, page_flags); out: unlock_page(p); +out2: + mutex_unlock(_mutex); return res; } EXPORT_SYMBOL_GPL(memory_failure);
RE: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned
> From the walk, it seems we have got the virtual address, can we just send a > SIGBUS with it? If the walk wins the race and the pte for the poisoned page is still valid, then yes. But we could have: CPU1CPU2 memory_failure sets poison bit for struct page rmap finds page in task on CPU2 and sets PTE to not-valid-poison memory_failure returns early because struct page already marked as poison walk page tables looking for mapping - don't find it -Tony
Re: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned
On Thu, Mar 04, 2021 at 02:45:24PM +0800, Aili Yao wrote: > > > if your methods works, should it be like this? > > > > > > 1582 pteval = > > > swp_entry_to_pte(make_hwpoison_entry(subpage)); > > > 1583 if (PageHuge(page)) { > > > 1584 hugetlb_count_sub(compound_nr(page), > > > mm); > > > 1585 set_huge_swap_pte_at(mm, address, > > > 1586 pvmw.pte, > > > pteval, > > > 1587 > > > vma_mmu_pagesize(vma)); > > > 1588 } else { > > > 1589 dec_mm_counter(mm, mm_counter(page)); > > > 1590 set_pte_at(mm, address, pvmw.pte, > > > pteval); > > > 1591 } > > > > > > the page fault check if it's a poison page using is_hwpoison_entry(), > > > > > > > And if it works, does we need some locking mechanism before we call > > walk_page_range(); > > if we lock, does we need to process the blocking interrupted error as other > > places will do? > > > > And another thing: > Do we need a call to flush_tlb_page(vma, address) to make the pte changes > into effect? Thanks for all the pointers. I added them to the patch (see below). [The pmd/pud cases may need some tweaking ... but just trying to get the 4K page case working first] I tried testing by skipping the call to memory_failure() and just using this new code to search the page tables for current page and marking it hwpoison (to simulate the case where 2nd process gets the early return from memory_failure(). Something is still missing because I get: [ 481.911298] mce: pte_entry: matched pfn - mark poison & zap pte [ 481.917935] MCE: Killing einj_mem_uc: due to hardware memory corruption fault at 7fe64b33b400 [ 482.933775] BUG: Bad page cache in process einj_mem_uc pfn:408b6d6 [ 482.940777] page:13ea6e96 refcount:3 mapcount:1 mapping:e3a069d9 index:0x0 pfn:0x408b6d6 [ 482.951355] memcg:94a809834000 [ 482.955153] aops:shmem_aops ino:3c04 [ 482.959142] flags: 0x97c0880015(locked|uptodate|lru|swapbacked|hwpoison) [ 482.967018] raw: 0097c0880015 94c80e93ec00 94c80e93ec00 94c80a9b25a8 [ 482.975666] raw: 0003 94a809834000 [ 482.984310] page dumped because: still mapped when deleted -Tony commit e5de44560b33e2d407704243566253a70f858a59 Author: Tony Luck Date: Tue Mar 2 15:06:33 2021 -0800 x86/mce: Handle races between machine checks When multiple CPUs hit the same poison memory there is a race. The first CPU into memory_failure() atomically marks the page as poison and continues processing to hunt down all the tasks that map this page so that the virtual addresses can be marked not-present and SIGBUS sent to the task that did the access. Later CPUs get an early return from memory_failure() and may return to user mode and access the poison again. Add a new argument to memory_failure() so that it can indicate when the race has been lost. Fix kill_me_maybe() to scan page tables in this case to unmap pages. diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c index 7962355436da..a52c6a772de2 100644 --- a/arch/x86/kernel/cpu/mce/core.c +++ b/arch/x86/kernel/cpu/mce/core.c @@ -28,8 +28,12 @@ #include #include #include +#include +#include +#include #include #include +#include #include #include #include @@ -637,6 +641,7 @@ static int uc_decode_notifier(struct notifier_block *nb, unsigned long val, { struct mce *mce = (struct mce *)data; unsigned long pfn; + int already = 0; if (!mce || !mce_usable_address(mce)) return NOTIFY_DONE; @@ -646,8 +651,9 @@ static int uc_decode_notifier(struct notifier_block *nb, unsigned long val, return NOTIFY_DONE; pfn = mce->addr >> PAGE_SHIFT; - if (!memory_failure(pfn, 0)) { - set_mce_nospec(pfn, whole_page(mce)); + if (!memory_failure(pfn, 0, )) { + if (!already) + set_mce_nospec(pfn, whole_page(mce)); mce->kflags |= MCE_HANDLED_UC; } @@ -1248,6 +1254,79 @@ static void __mc_scan_banks(struct mce *m, struct pt_regs *regs, struct mce *fin *m = *final; } +static int pte_entry(pte_t *pte, unsigned long addr, unsigned long next, struct mm_walk *walk) +{ + u64 pfn = (u64)walk->private; + struct page *page; + pte_t pteval; + + if (pte_pfn(*pte) == pfn) { +pr_info("pte_entry: matched pfn - mark poison & zap pte\n"); + page = pfn_to_page(pfn); + lock_page(page); +SetPageHWPoison(page); + pteval = swp_entry_to_pte(make_hwpoison_entry(page)); +
RE: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned
> For error address with sigbus, i think this is not an issue resulted by the > patch i post, before my patch, the issue is already there. > I don't find a realizable way to get the correct address for same reason --- > we don't know whether the page mapping is there or not when > we got to kill_me_maybe(), in some case, we may get it, but there are a lot > of parallel issue need to consider, and if failed we have to fallback > to the error brach again, remaining current code may be an easy option; My RFC patch from yesterday removes the uncertainty about whether the page is there or not. After it walks the page tables we know that the poison page isn't mapped (note that patch is RFC for a reason ... I'm 90% sure that it should do a bit more that just clear the PRESENT bit). So perhaps memory_failure() has queued a SIGBUS for this task, if so, we take it when we return from kill_me_maybe() If not, we will return to user mode and re-execute the failing instruction ... but because the page is unmapped we will take a #PF The x86 page fault handler will see that the page for this physical address is marked HWPOISON, and it will send the SIGBUS (just like it does if the page had been removed by an earlier UCNA/SRAO error). At least that's the theory. -Tony
Re: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned
On Fri, Feb 26, 2021 at 10:59:15AM +0800, Aili Yao wrote: > Hi naoya, tony: > > > > > > Idea for what we should do next ... Now that x86 is calling > > > memory_failure() > > > from user context ... maybe parallel calls for the same page should > > > be blocked until the first caller completes so we can: > > > a) know that pages are unmapped (if that happens) > > > b) all get the same success/fail status > > > > One memory_failure() call changes the target page's status and > > affects all mappings to all affected processes, so I think that > > (ideally) we don't have to block other threads (letting them > > early return seems fine). Sometimes memory_failure() fails, > > but even in such case, PG_hwpoison is set on the page and other > > threads properly get SIGBUSs with this patch, so I think that > > we can avoid the worst scenario (like system stall by MCE loop). > > > I agree with naoya's point, if we block for this issue, Does this change the > result > that the process should be killed? Or is there something other still need to > be considered? Ok ... no blocking ... I think someone in this thread suggested scanning the page tables to make sure the poisoned page had been unmapped. There's a walk_page_range() function that does all the work for that. Just need to supply some callback routines that check whether a mapping includes the bad PFN and clear the PRESENT bit. RFC patch below against v5.12-rc1 -Tony >From 8de23b7f1be00ad38e129690dfe0b1558fad5ff8 Mon Sep 17 00:00:00 2001 From: Tony Luck Date: Tue, 2 Mar 2021 15:06:33 -0800 Subject: [PATCH] x86/mce: Handle races between machine checks When multiple CPUs hit the same poison memory there is a race. The first CPU into memory_failure() atomically marks the page as poison and continues processing to hunt down all the tasks that map this page so that the virtual addresses can be marked not-present and SIGBUS sent to the task that did the access. Later CPUs get an early return from memory_failure() and may return to user mode and access the poison again. Add a new argument to memory_failure() so that it can indicate when the race has been lost. Fix kill_me_maybe() to scan page tables in this case to unmap pages. --- arch/x86/kernel/cpu/mce/core.c | 61 +++--- drivers/base/memory.c | 2 +- include/linux/mm.h | 2 +- mm/hwpoison-inject.c | 2 +- mm/madvise.c | 2 +- mm/memory-failure.c| 6 ++-- 6 files changed, 64 insertions(+), 11 deletions(-) diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c index 7962355436da..2c6c560f3f92 100644 --- a/arch/x86/kernel/cpu/mce/core.c +++ b/arch/x86/kernel/cpu/mce/core.c @@ -30,6 +30,7 @@ #include #include #include +#include #include #include #include @@ -637,6 +638,7 @@ static int uc_decode_notifier(struct notifier_block *nb, unsigned long val, { struct mce *mce = (struct mce *)data; unsigned long pfn; + int already = 0; if (!mce || !mce_usable_address(mce)) return NOTIFY_DONE; @@ -646,8 +648,9 @@ static int uc_decode_notifier(struct notifier_block *nb, unsigned long val, return NOTIFY_DONE; pfn = mce->addr >> PAGE_SHIFT; - if (!memory_failure(pfn, 0)) { - set_mce_nospec(pfn, whole_page(mce)); + if (!memory_failure(pfn, 0, )) { + if (!already) + set_mce_nospec(pfn, whole_page(mce)); mce->kflags |= MCE_HANDLED_UC; } @@ -1248,6 +1251,50 @@ static void __mc_scan_banks(struct mce *m, struct pt_regs *regs, struct mce *fin *m = *final; } +static int pte_entry(pte_t *pte, unsigned long addr, unsigned long next, struct mm_walk *walk) +{ + u64 pfn = (u64)walk->private; + + if (pte_pfn(*pte) == pfn) + pte->pte = pte->pte & ~_PAGE_PRESENT; + + return 0; +} + +static int pmd_entry(pmd_t *pmd, unsigned long addr, unsigned long next, struct mm_walk *walk) +{ + int shift = PMD_SHIFT - PAGE_SHIFT; + u64 pfn = (u64)walk->private; + + if (!pmd_large(*pmd)) + return 0; + + if (pmd_pfn(*pmd) >> shift == pfn >> shift) + pmd->pmd = pmd->pmd & ~_PAGE_PRESENT; + + return 0; +} + +static int pud_entry(pud_t *pud, unsigned long addr, unsigned long next, struct mm_walk *walk) +{ + int shift = PUD_SHIFT - PAGE_SHIFT; + u64 pfn = (u64)walk->private; + + if (!pud_large(*pud)) + return 0; + + if (pud_pfn(*pud) >> shift == pfn >> shift) + pud->pud = pud->pud & ~_PAGE_PRESENT; + + return 0; +} + +static struct mm_walk_ops walk = { + .pte_entry = pte_entry, + .pmd_entry = pmd_entry, + .pud_entry = pud_entry +}; + static void kill_me_now(struct callback_head *ch) { force_sig(SIGBUS); @@ -1257,15 +1304,19 @@ static void kill_me_maybe(struct
RE: [PATCH v3] x86/fault: Send a SIGBUS to user process always for hwpoison page access.
> Some programs may use read(2), write(2), etc as ways to check if > memory is valid without getting a signal. They might not want > signals, which means that this feature might need to be configurable. That sounds like an appalling hack. If users need such a mechanism we should create some better way to do that. An aeon ago ACPI created the RASF table as a way for the OS to ask the platform to scan a block of physical memory using the patrol scrubber in the memory controller. I never did anything with it in Linux because it was just too complex and didn't know of any use cases. Users would want to check virtual addresses. Perhaps some new option MADV_CHECKFORPOISON to madvise(2) ? -Tony
RE: [PATCH v3] x86/fault: Send a SIGBUS to user process always for hwpoison page access.
> Programs that get a signal might expect that the RIP that the signal > frame points to is the instruction that caused the signal and that the > instruction faulted without side effects. For SIGSEGV, I would be > especially nervous about this. Maybe SIGBUS is safer. For SIGSEGV, > it's entirely valid to look at CR2 / si_fault_addr, fix it up, and > return. This would be completely *invalid* with your patch. I'm not > sure what to do about this. The original plan was that s/w like databases would be able to write their own application specific recovery code. E.g. they hit poison while reading some "table". The process gets a SIGBUS with siginfo telling the handler the virtual address range that has been lost. The code uses mmap(MAP_FIXED) to map a new page into the lost address and fills it with suitable data (either reconstructing lost data by replaying transactions, or filling the table with some "data unknown" indicator). Then the SIGBUS handler returns to re-execute the instruction that failed. As far as I know nobody has been that creative in production s/w. But I think there are folks with a siglongjmp() to a "this whole transaction just failed" safe point. -Tony
Re: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned
On Fri, Feb 26, 2021 at 10:52:50AM +0800, Aili Yao wrote: > Hi naoya,Oscar,david: > > > > > We could use some negative value (error code) to report the reported case, > > > then as you mentioned above, some callers need change to handle the > > > new case, and the same is true if you use some positive value. > > > My preference is -EHWPOISON, but other options are fine if justified > > > well. > > > > -EHWPOISON seems like a good fit. > > > I am OK with the -EHWPOISON error code, But I have one doubt here: > When we return this -EHWPOISON error code, Does this means we have to add a > new error code > to error-base.h or errno.h? Is this easy realized? The page already poisoned isn't really an error though. Just the result of a race condition. What if we added an extra argument to memory_failure() so it can tell the caller that the specific reason for the early successful return is that the page was already poisoned? Something like this (untested - patch against v5.11): -Tony --- diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c index e133ce1e562b..0e32c4d879fb 100644 --- a/arch/x86/kernel/cpu/mce/core.c +++ b/arch/x86/kernel/cpu/mce/core.c @@ -637,6 +637,7 @@ static int uc_decode_notifier(struct notifier_block *nb, unsigned long val, { struct mce *mce = (struct mce *)data; unsigned long pfn; + int already = 0; if (!mce || !mce_usable_address(mce)) return NOTIFY_DONE; @@ -646,8 +647,9 @@ static int uc_decode_notifier(struct notifier_block *nb, unsigned long val, return NOTIFY_DONE; pfn = mce->addr >> PAGE_SHIFT; - if (!memory_failure(pfn, 0)) { - set_mce_nospec(pfn, whole_page(mce)); + if (!memory_failure(pfn, 0, )) { + if (!already) + set_mce_nospec(pfn, whole_page(mce)); mce->kflags |= MCE_HANDLED_UC; } @@ -1245,15 +1247,19 @@ static void kill_me_maybe(struct callback_head *cb) { struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me); int flags = MF_ACTION_REQUIRED; + int already = 0; pr_err("Uncorrected hardware memory error in user-access at %llx", p->mce_addr); if (!p->mce_ripv) flags |= MF_MUST_KILL; - if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags) && + if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags, ) && !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) { - set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page); + if (already) + force_sig(SIGBUS); // BETTER CODE NEEDED HERE!!! + else + set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page); sync_core(); return; } @@ -1440,7 +1446,7 @@ noinstr void do_machine_check(struct pt_regs *regs) EXPORT_SYMBOL_GPL(do_machine_check); #ifndef CONFIG_MEMORY_FAILURE -int memory_failure(unsigned long pfn, int flags) +int memory_failure(unsigned long pfn, int flags, int *already) { /* mce_severity() should not hand us an ACTION_REQUIRED error */ BUG_ON(flags & MF_ACTION_REQUIRED); diff --git a/drivers/base/memory.c b/drivers/base/memory.c index eef4ffb6122c..24c36623e492 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -480,7 +480,7 @@ static ssize_t hard_offline_page_store(struct device *dev, if (kstrtoull(buf, 0, ) < 0) return -EINVAL; pfn >>= PAGE_SHIFT; - ret = memory_failure(pfn, 0); + ret = memory_failure(pfn, 0, NULL); return ret ? ret : count; } diff --git a/include/linux/mm.h b/include/linux/mm.h index ecdf8a8cd6ae..88b92820465c 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3045,7 +3045,7 @@ enum mf_flags { MF_MUST_KILL = 1 << 2, MF_SOFT_OFFLINE = 1 << 3, }; -extern int memory_failure(unsigned long pfn, int flags); +extern int memory_failure(unsigned long pfn, int flags, int *already); extern void memory_failure_queue(unsigned long pfn, int flags); extern void memory_failure_queue_kick(int cpu); extern int unpoison_memory(unsigned long pfn); diff --git a/mm/hwpoison-inject.c b/mm/hwpoison-inject.c index 1ae1ebc2b9b1..bfd5151dcd3f 100644 --- a/mm/hwpoison-inject.c +++ b/mm/hwpoison-inject.c @@ -48,7 +48,7 @@ static int hwpoison_inject(void *data, u64 val) inject: pr_info("Injecting memory failure at pfn %#lx\n", pfn); - return memory_failure(pfn, 0); + return memory_failure(pfn, 0, NULL); } static int hwpoison_unpoison(void *data, u64 val) diff --git a/mm/madvise.c b/mm/madvise.c index 6a660858784b..ade1956632aa 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -907,7 +907,7 @@ static int madvise_inject_error(int behavior, } else { pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n",
Re: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned
On Thu, Feb 25, 2021 at 12:38:06PM +, HORIGUCHI NAOYA(堀口 直也) wrote: > Thank you for shedding light on this, this race looks worrisome to me. > We call try_to_unmap() inside memory_failure(), where we find affected > ptes by page_vma_mapped_walk() and convert into hwpoison entires in > try_to_unmap_one(). So there seems two racy cases: > > 1) > CPU 0 CPU 1 > page_vma_mapped_walk > clear _PAGE_PRESENT bit >// skipped the entry > > 2) > CPU 0 CPU 1 > page_vma_mapped_walk >try_to_unmap_one > clear _PAGE_PRESENT bit > convert the entry > set_pte_at > > In case 1, the affected processes get signals on later access, > so although the info in SIGBUS could be different, that's OK. > And we have no impact in case 2. I've been debugging a similar issue for a few days and finally got enough traces to partially understand what happened. The test case is a multi-threaded pointer chasing micro-benchmark running on all logical CPUs. We then inject poison into the address space of the process. All works fine if one thread consumes poison and completes all Linux machine check processing before any other threads read the poison. The page is unmapped, a SIGBUS is sent (which kills all threads). But in the problem case I see: CPU1 reads poison, takes a machine check. Gets to the kill_me_maybe() task work, which calls memory_failure() this CPU sets the page poison flag, but is still executing the rest of the flow to hunt down tasks/mappings to invalidate pages and send SIGBUS if required. CPU2 reads the poison. When it gets to memory_failure() there's an early return because the poison flag is already set. So in current code it returns and takes the machine check again. CPU3 reads the poison and starts along same path that CPU2 did. Meanwhile CPU1 gets far enough along in memory failure and hits a problem. It prints: [ 1867.409837] Memory failure: 0x42a9ff6: reserved kernel page still referenced by 1 users [ 1867.409850] Memory failure: 0x42a9ff6: recovery action for reserved kernel page: Failed and doesn't complete unmapping the page that CPU2 and CPU3 are touching. Other CPUs gradually reach the poison and join in the fun of repeatedly taking machine checks. I have not yet tracked why this user access is reporting as a "reserved kernel page". Some traces showed that futex(2) syscall was in use from this benchmark, so maybe the kernel locked a user page that was a contended futex??? Idea for what we should do next ... Now that x86 is calling memory_failure() from user context ... maybe parallel calls for the same page should be blocked until the first caller completes so we can: a) know that pages are unmapped (if that happens) b) all get the same success/fail status -Tony
Re: [PATCH v3] x86/fault: Send a SIGBUS to user process always for hwpoison page access.
On Tue, Feb 23, 2021 at 07:33:46AM -0800, Andy Lutomirski wrote: > > > On Feb 23, 2021, at 4:44 AM, Aili Yao wrote: > > > > On Fri, 5 Feb 2021 17:01:35 +0800 > > Aili Yao wrote: > > > >> When one page is already hwpoisoned by MCE AO action, processes may not > >> be killed, processes mapping this page may make a syscall include this > >> page and result to trigger a VM_FAULT_HWPOISON fault, as it's in kernel > >> mode it may be fixed by fixup_exception, current code will just return > >> error code to user code. > >> > >> This is not sufficient, we should send a SIGBUS to the process and log > >> the info to console, as we can't trust the process will handle the error > >> correctly. > >> > >> Suggested-by: Feng Yang > >> Signed-off-by: Aili Yao > >> --- > >> arch/x86/mm/fault.c | 62 + > >> 1 file changed, 40 insertions(+), 22 deletions(-) > >> > > Hi luto; > > Is there any feedback? > > At the very least, this needs a clear explanation of why your proposed > behavior is better than the existing behavior. The explanation is buried in that "can't trust the process" line. E.g. user space isn't good about checking for failed write(2) syscalls. So if the poison was in a user buffer passed to write(fd, buffer, count) sending a SIGBUS would be the action if they read the poison directly, so it seems reasonable to send the same signal if the kernel read their poison for them. It would avoid users that didn't check the return value merrily proceeding as if everything was ok. -Tony
RE: [PATCH v2] x86/mce: fix wrong no-return-ip logic in do_machine_check()
> What I think is qemu has not an easy to get the MCE signature from host or > currently no methods for this > So qemu treat all AR will be No RIPV, Do more is better than do less. RIPV would be important in the guest in the case where the guest can fix the problem that caused the machine check and return to the failed instruction to continue. I think the only case where this happens is a fault in a read-only page mapped from a file (typically code page, but could be a data page). In this case memory-failure() unmaps the page with the posion but Linux can recover by reading data from the file into a new page. Other cases we send SIGBUS (so go to the signal handler instead of to the faulting instruction). So it would be good if the state of RIPV could be added to the signal state sent to qemu. If that isn't possible, then this full recovery case turns into another SIGBUS case. -Tony
checkpatch warnings for references to earlier commits
Would it be possible to teach checkpatch not to warn about canonical references to earlier commits? E.g. WARNING: Possible unwrapped commit description (prefer a maximum 75 chars per line) #7: commit e80634a75aba ("EDAC, skx: Retrieve and print retry_rd_err_log registers") Thanks -Tony
RE: [PATCH] x86, sched: Allow NUMA nodes to share an LLC on Intel platforms
> +#define X86_BUG_NUMA_SHARES_LLC X86_BUG(25) /* CPU may > enumerate an LLC shared by multiple NUMA nodes */ During internal review I wondered why this is a "BUG" rather than a "FEATURE" bit. Apparently, the suggestion for "BUG" came from earlier community discussions. Historically it may have seemed reasonable to say that a cache cannot span NUMA domains. But with more and more things moving off the motherboard and into the socket, this doesn't seem too weird now. -Tony
RE: Pstore : Query on using ramoops driver for DDR
> Can we use existing backend pstore ram driver (fs/pstore/ram.c) for DDR > instead of SRAM ? The expectation for pstore is that the system will go through a reset when it crashes. Most systems do not preserve DDR contents across reset. > Was the current driver written only to support persistant RAM like SRAM > or it can accept further change Linux is in a constant state of change :-) > to support DDR, If we have a mechanism to copy stored data from DDR to > external device after the crash. See above about DDR contents. But if you have a platform that does preserve DDR contents until your "mechanism" can copy the pstore buffer, then post a patch. -Tony
Re: [PATCH 02/49] x86/cpu: Describe hybrid CPUs in cpuinfo_x86
On Mon, Feb 08, 2021 at 02:04:24PM -0500, Liang, Kan wrote: > On 2/8/2021 12:56 PM, Borislav Petkov wrote: > > I think it's good enough for perf, but I'm not sure whether other codes need > the CPU type information. > > Ricardo, do you know? > > Maybe we should implement a generic function as below for this? > (Not test. Just use as an example.) > > diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c > index a66c1fd..679f5fe 100644 > --- a/arch/x86/kernel/cpu/common.c > +++ b/arch/x86/kernel/cpu/common.c > @@ -2056,3 +2056,11 @@ void arch_smt_update(void) > /* Check whether IPI broadcasting can be enabled */ > apic_smt_update(); > } > + > +u32 x86_read_hybrid_type(void) > +{ > + if (cpu_feature_enabled(X86_FEATURE_HYBRID_CPU)) > + return cpuid_eax(0x001a); > + > + return 0; > +} Machine check logging will want to include this in "struct mce". But ok to pick it up with a function like you describe above. -Tony
Re: [PATCH v3 1/7] seqnum_ops: Introduce Sequence Number Ops
On Fri, Feb 05, 2021 at 01:03:18PM -0700, Shuah Khan wrote: > On 2/5/21 2:58 AM, Greg KH wrote: > > On Wed, Feb 03, 2021 at 11:11:57AM -0700, Shuah Khan wrote: > > > +static inline u32 seqnum32_inc(struct seqnum32 *seq) > > > +{ > > > + atomic_t val = ATOMIC_INIT(seq->seqnum); > > > + > > > + seq->seqnum = (u32) atomic_inc_return(); > > > + if (seq->seqnum >= UINT_MAX) > > > + pr_info("Sequence Number overflow %u detected\n", > > > + seq->seqnum); > > > + return seq->seqnum; > > > > As Peter points out, this is doing doing what you think it is doing :( > > > > Why do you not just have seq->seqnum be a real atomic variable? Trying > > to switch to/from one like this does not work as there is no > > "atomic-ness" happening here at all. > > > > Yes. This is sloppy on my part. As Peter and Rafael also pointed. I have > to start paying more attention to my inner voice. What's the end goal with this sequence number type? Apart from the functional issues with this code already pointed out, it doesn't seem overly useful to just print the *value* of the sequence number when it hits the max value. It might be more helpful if each instance had a seq->name field that is used here to tell the user *which* user of sequence numbers had seen the wrap arounnd. But that just begs the question of what should users of sequence numbers do when wrap around occurs? Any user that can run through sequence numbers at greater than 10 Hz is going to cause a problem for systems that stay running for years. Maybe you should force users to code for wraparound by initializing to something like 0xff00 so that they all get a wraparound event quickly, rather than slowly, (same as was done with jiffies)?
RE: [PATCH v5] x86/mce: Avoid infinite loop for copy from user recovery
> So that "system hang or panic" which the validation folks triggered, > that cannot be reproduced anymore? Did they run the latest version of > the patch? I will get the validation folks to run the latest version (and play around with hyperthreading if they see problems). -Tony
RE: [PATCH v5] x86/mce: Avoid infinite loop for copy from user recovery
> And the much more important question is, what is the code supposed to > do when that overflow *actually* happens in real life? Because IINM, > an overflow condition on the same page would mean killing the task to > contain the error and not killing the machine... Correct. The cases I've actually hit, the second machine check is on the same address as the first. But from a recovery perspective Linux is going to take away the whole page anyway ... so not complaining if the second (or subsequent) access is within the same page makes sense (and that's what the patch does). The code can't handle it if a subsequent #MC is to a different page (because we only have a single spot in the task structure to store the physical page address). But that looks adequate. If the code is wildly accessing different pages *and* getting machine checks from those different pages ... then something is very seriously wrong with the system. -Tony
RE: [PATCH v2] x86/fault: Send a SIGBUS to user process always for hwpoison page access.
> In any case, this patch needs rebasing on top of my big fault series Is that series in some GIT tree? Or can you give a lore.kernel.org link? Thanks -Tony
Re: [PATCH v5] x86/mce: Avoid infinite loop for copy from user recovery
On Thu, Jan 28, 2021 at 06:57:35PM +0100, Borislav Petkov wrote: > Crazy idea: if you still can reproduce on -rc3, you could bisect: i.e., > if you apply the patch on -rc3 and it explodes and if you apply the same > patch on -rc5 and it works, then that could be a start... Yeah, don't > have a better idea here. :-\ I tried reporoducing (applied the original patch I posted back to -rc3) and the same issue stubbornly refused to show up again. But I did hit something with the same signature (overflow bit set in bank 1) while running my futex test (which has two processes mapping the poison page). This time I *do* understand what happened. The test failed when the two processes were running on the two hyperhtreads of the same core. Seeing overflow in this case is understandable because bank 1 MSRs on my test machine are shared between the HT threads. When I run the test again using taskset(1) to only allowing running on thread 0 of each core, it keeps going for hunderds of iterations. I'm not sure I can stitch together how this overflow also happened for my single process test. Maybe a migration from one HT thread to the other at an awkward moment? -Tony
Re: [PATCH] x86/fault: Send SIGBUS to user process always for hwpoison page access.
Thanks for the explanation and test code. I think I see better what is going on here. [I took your idea for using madvise(...MADV_HWPOISON) and added a new "-S" option to my einj_mem_uc test program to use madvise instead of ACPI/EINJ for injections. Update pushed here: git://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git ] There have been some small changes to arch/x86/mm/fault.c since you wrote the patch. Can you rebase to v5.11-rc5? Also maybe this might be a case to use IS_ENABLED() instead of #ifdef to make the code a little less ugly. At least for the 2nd hunk in your patch this would work well: if (IS_ENABLED(CONFIG_MEMORY_FAILURE) && (fault & (VM_FAULT_HWPOISON|VM_FAULT_HWPOISON_LARGE))) no_context(regs, error_code, address, SIGBUS, BUS_MCEERR_AR); else no_context(regs, error_code, address, SIGBUS, BUS_ADRERR); The first hunk might need a bit more thought. -Tony
RE: linux-next: removal of the ia64 tree
Stephen, Yes. Most stuff I do these days goes through the RAS tree I share with Boris. -Tony -Original Message- From: Stephen Rothwell Sent: Thursday, January 28, 2021 11:53 AM To: Luck, Tony Cc: Arnd Bergmann ; Linux Kernel Mailing List ; Linux Next Mailing List Subject: linux-next: removal of the ia64 tree Hi Tony, I noticed commit 96ec72a3425d ("ia64: Mark architecture as orphaned") just went into Linus' tree, so I assume I should drop the ia64 tree from linux-next? -- Cheers, Stephen Rothwell
Re: [PATCH] x86/fault: Send SIGBUS to user process always for hwpoison page access.
On Thu, Jan 28, 2021 at 07:43:26PM +0800, Aili Yao wrote: > when one page is already hwpoisoned by AO action, process may not be > killed, the process mapping this page may make a syscall include this > page and result to trigger a VM_FAULT_HWPOISON fault, as it's in kernel > mode it may be fixed by fixup_exception, current code will just return > error code to user process. Shouldn't the AO action that poisoned the page have also unmapped it? > > This is not suffient, we should send a SIGBUS to the process and log the > info to console, as we can't trust the process will handle the error > correctly. I agree with this part ... few apps check for -EFAULT and do the right thing. But I'm not sure how this happens. Can you provide a bit more detail on the steps -Tony P.S. Typo: s/suffient/sufficient/ > > Suggested-by: Feng Yang > Signed-off-by: Aili Yao > --- > arch/x86/mm/fault.c | 16 > 1 file changed, 16 insertions(+) > > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c > index f1f1b5a0956a..36d1e385512b 100644 > --- a/arch/x86/mm/fault.c > +++ b/arch/x86/mm/fault.c > @@ -662,7 +662,16 @@ no_context(struct pt_regs *regs, unsigned long > error_code, >* In this case we need to make sure we're not recursively >* faulting through the emulate_vsyscall() logic. >*/ > +#ifdef CONFIG_MEMORY_FAILURE > + if (si_code == BUS_MCEERR_AR && signal == SIGBUS) > + pr_err("MCE: Killing %s:%d due to hardware memory > corruption fault at %lx\n", > + current->comm, current->pid, address); > + > + if ((current->thread.sig_on_uaccess_err && signal) || > + (si_code == BUS_MCEERR_AR && signal == SIGBUS)) { > +#else > if (current->thread.sig_on_uaccess_err && signal) { > +#endif > sanitize_error_code(address, _code); > > set_signal_archinfo(address, error_code); > @@ -927,7 +936,14 @@ do_sigbus(struct pt_regs *regs, unsigned long > error_code, unsigned long address, > { > /* Kernel mode? Handle exceptions or die: */ > if (!(error_code & X86_PF_USER)) { > +#ifdef CONFIG_MEMORY_FAILURE > + if (fault & (VM_FAULT_HWPOISON|VM_FAULT_HWPOISON_LARGE)) > + no_context(regs, error_code, address, SIGBUS, > BUS_MCEERR_AR); > + else > + no_context(regs, error_code, address, SIGBUS, > BUS_ADRERR); > +#else > no_context(regs, error_code, address, SIGBUS, BUS_ADRERR); > +#endif > return; > } > > -- > 2.25.1 >
Re: [PATCH v5] x86/mce: Avoid infinite loop for copy from user recovery
On Tue, Jan 26, 2021 at 12:03:14PM +0100, Borislav Petkov wrote: > On Mon, Jan 25, 2021 at 02:55:09PM -0800, Luck, Tony wrote: > > And now I've changed it back to non-atomic (but keeping the > > slightly cleaner looking code style that I used for the atomic > > version). This one also works for thousands of injections and > > recoveries. Maybe take it now before it stops working again :-) > > Hmm, so the only differences I see between your v4 and this are: > > -@@ -1238,6 +1238,7 @@ static void __mc_scan_banks(struct mce *m, struct > pt_regs *regs, struct mce *fin > +@@ -1238,6 +1238,9 @@ static void __mc_scan_banks(struct mce *m, struct > pt_regs *regs, struct mce *fin > > static void kill_me_now(struct callback_head *ch) > { > ++struct task_struct *p = container_of(ch, struct task_struct, > mce_kill_me); > ++ > +p->mce_count = 0; > force_sig(SIGBUS); > } > > Could the container_of() macro have changed something? That change was to fix my brown paper bag moment (does not compile without a variable named "p" in scope to be used on next line.) > Because we don't know yet (right?) why would it fail? Would it read > stale ->mce_count data? If so, then a barrier is missing somewhere. I don't see how a barrier would make a differece. In the common case all this code is executed on the same logical CPU. Return from the do_machine_check() tries to return to user mode and finds that there is some "task_work" to execute first. In some cases Linux might context switch to something else. Perhaps this task even gets picked up by another CPU to run the task work queued functions. But I imagine that the context switch should act as a barrier ... shouldn't it? > Or what is the failure exactly? After a few cycles of the test injection to user mode, I saw an overflow in the machine check bank. As if it hadn't been cleared from the previous iteration ... but all the banks are cleared as soon as we find that the machine check is recoverable. A while before getting to the code I changed. When the tests were failing, code was on top of v5.11-rc3. Latest experiments moved to -rc5. There's just a tracing fix from PeterZ between rc3 and rc5 to mce/core.c: 737495361d44 ("x86/mce: Remove explicit/superfluous tracing") which doesn't appear to be a candidate for the problems I saw. > Because if I take it now without us knowing what the issue is, it will > start failing somewhere - Murphy's our friend - and then we'll have to > deal with breaking people's boxes. Not fun. Fair point. > The other difference is: > > @@ -76,8 +71,10 @@ index 13d3f1cbda17..5460c146edb5 100644 > -current->mce_kflags = m->kflags; > -current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV); > -current->mce_whole_page = whole_page(m); > ++int count = ++current->mce_count; > ++ > +/* First call, save all the details */ > -+if (current->mce_count++ == 0) { > ++if (count == 1) { > +current->mce_addr = m->addr; > +current->mce_kflags = m->kflags; > +current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV); > > Hmm, a local variable and a pre-increment. Can that have an effect somehow? This is the bit that changed during my detour using atomic_t mce_count. I added the local variable to capture value from atomic_inc_return(), then used it later, instead of a bunch of atomic_read() calls. I kept it this way because "if (count == 1)" is marginally easier to read than "if (current->mce_count++ == 0)" > > + /* Ten is likley overkill. Don't expect more than two faults before > > task_work() */ > > Typo: likely. Oops. Fixed. -Tony
[PATCH v5] x86/mce: Avoid infinite loop for copy from user recovery
On Thu, Jan 21, 2021 at 01:09:59PM -0800, Luck, Tony wrote: > On Wed, Jan 20, 2021 at 01:18:12PM +0100, Borislav Petkov wrote: > But, on a whim, I changed the type of mce_count from "int" to "atomic_t" and > fixeed up the increment & clear to use atomic_inc_return() and atomic_set(). > See updated patch below. > > I can't see why this should have made any difference. But the "int" version > crashes in a few dozen machine checks. The "atomic_t" version has run many > thousands of machine checks (10213 in the current boot according to > /proc/interrupts) > with no issues. And now I've changed it back to non-atomic (but keeping the slightly cleaner looking code style that I used for the atomic version). This one also works for thousands of injections and recoveries. Maybe take it now before it stops working again :-) -Tony From: Tony Luck Recovery action when get_user() triggers a machine check uses the fixup path to make get_user() return -EFAULT. Also queue_task_work() sets up so that kill_me_maybe() will be called on return to user mode to send a SIGBUS to the current process. But there are places in the kernel where the code assumes that this EFAULT return was simply because of a page fault. The code takes some action to fix that, and then retries the access. This results in a second machine check. While processing this second machine check queue_task_work() is called again. But since this uses the same callback_head structure that was used in the first call, the net result is an entry on the current->task_works list that points to itself. When task_work_run() is called it loops forever in this code: do { next = work->next; work->func(work); work = next; cond_resched(); } while (work); Add a counter (current->mce_count) to keep track of repeated machine checks before task_work() is called. First machine check saves the address information and calls task_work_add(). Subsequent machine checks before that task_work call back is executed check that the address is in the same page as the first machine check (since the callback will offline exactly one page). Expected worst case is two machine checks before moving on (e.g. one user access with page faults disabled, then a repeat to the same addrsss with page faults enabled). Just in case there is some code that loops forever enforce a limit of 10. Signed-off-by: Tony Luck --- arch/x86/kernel/cpu/mce/core.c | 35 +++--- include/linux/sched.h | 1 + 2 files changed, 29 insertions(+), 7 deletions(-) diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c index e133ce1e562b..30dedffd6f88 100644 --- a/arch/x86/kernel/cpu/mce/core.c +++ b/arch/x86/kernel/cpu/mce/core.c @@ -1238,6 +1238,9 @@ static void __mc_scan_banks(struct mce *m, struct pt_regs *regs, struct mce *fin static void kill_me_now(struct callback_head *ch) { + struct task_struct *p = container_of(ch, struct task_struct, mce_kill_me); + + p->mce_count = 0; force_sig(SIGBUS); } @@ -1246,6 +1249,7 @@ static void kill_me_maybe(struct callback_head *cb) struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me); int flags = MF_ACTION_REQUIRED; + p->mce_count = 0; pr_err("Uncorrected hardware memory error in user-access at %llx", p->mce_addr); if (!p->mce_ripv) @@ -1266,12 +1270,29 @@ static void kill_me_maybe(struct callback_head *cb) } } -static void queue_task_work(struct mce *m, int kill_current_task) +static void queue_task_work(struct mce *m, char *msg, int kill_current_task) { - current->mce_addr = m->addr; - current->mce_kflags = m->kflags; - current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV); - current->mce_whole_page = whole_page(m); + int count = ++current->mce_count; + + /* First call, save all the details */ + if (count == 1) { + current->mce_addr = m->addr; + current->mce_kflags = m->kflags; + current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV); + current->mce_whole_page = whole_page(m); + } + + /* Ten is likley overkill. Don't expect more than two faults before task_work() */ + if (count > 10) + mce_panic("Too many machine checks while accessing user data", m, msg); + + /* Second or later call, make sure page address matches the one from first call */ + if (count > 1 && (current->mce_addr >> PAGE_SHIFT) != (m->addr >> PAGE_SHIFT)) + mce_panic("Machine checks to different user pages", m, msg); + + /* Do n
Re: [PATCH v4] x86/mce: Avoid infinite loop for copy from user recovery
On Wed, Jan 20, 2021 at 01:18:12PM +0100, Borislav Petkov wrote: > On Tue, Jan 19, 2021 at 03:57:59PM -0800, Luck, Tony wrote: > > But the real validation folks took my patch and found that it has > > destabilized cases 1 & 2 (and case 3 also chokes if you repeat a few > > more times). System either hangs or panics. Generally before 100 > > injection/conumption cycles. > > Hmm, that mce_count increment and check stuff might turn out to be > fragile in the face of a race condition. But I don't see it... I can't see what the race is either. As far as I can see the flow looks like this (for a machine check in user mode) user-mode do_machine_check() [in atomic context on IST etc.] queue_task_work() increments current->mce_count first (only) call saves address etc. and calls task_work_add() return from machine check Plausibly we might context switch to another process here. We could even resume on a different CPU. do_task_work() [in process context] kill_me_maybe() sets mce_count back to zero, ready for another #MC memory_failure() send SIGBUS The kernel get_user() case is similar, but may take extra machine checks before before calling code decides get_user() really is going to keep saying -EFAULT. But, on a whim, I changed the type of mce_count from "int" to "atomic_t" and fixeed up the increment & clear to use atomic_inc_return() and atomic_set(). See updated patch below. I can't see why this should have made any difference. But the "int" version crashes in a few dozen machine checks. The "atomic_t" version has run many thousands of machine checks (10213 in the current boot according to /proc/interrupts) with no issues. -Tony >From d1b003f1de92f87c8dc53dd710fd79ad6e277364 Mon Sep 17 00:00:00 2001 From: Tony Luck Date: Wed, 20 Jan 2021 12:40:50 -0800 Subject: [PATCH] x86/mce: Avoid infinite loop for copy from user recovery Recovery action when get_user() triggers a machine check uses the fixup path to make get_user() return -EFAULT. Also queue_task_work() sets up so that kill_me_maybe() will be called on return to user mode to send a SIGBUS to the current process. But there are places in the kernel where the code assumes that this EFAULT return was simply because of a page fault. The code takes some action to fix that, and then retries the access. This results in a second machine check. While processing this second machine check queue_task_work() is called again. But since this uses the same callback_head structure that was used in the first call, the net result is an entry on the current->task_works list that points to itself. When task_work_run() is called it loops forever in this code: do { next = work->next; work->func(work); work = next; cond_resched(); } while (work); Add a counter (current->mce_count) to keep track of repeated machine checks before task_work() is called. First machine check saves the address information and calls task_work_add(). Subsequent machine checks before that task_work call back is executed check that the address is in the same page as the first machine check (since the callback will offline exactly one page). Expected worst case is two machine checks before moving on (e.g. one user access with page faults disabled, then a repeat to the same addrsss with page faults enabled). Just in case there is some code that loops forever enforce a limit of 10. Signed-off-by: Tony Luck --- arch/x86/kernel/cpu/mce/core.c | 35 +++--- include/linux/sched.h | 1 + 2 files changed, 29 insertions(+), 7 deletions(-) diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c index 13d3f1cbda17..4a8660058b67 100644 --- a/arch/x86/kernel/cpu/mce/core.c +++ b/arch/x86/kernel/cpu/mce/core.c @@ -1238,6 +1238,9 @@ static void __mc_scan_banks(struct mce *m, struct pt_regs *regs, struct mce *fin static void kill_me_now(struct callback_head *ch) { + struct task_struct *p = container_of(ch, struct task_struct, mce_kill_me); + + atomic_set(>mce_count, 0); force_sig(SIGBUS); } @@ -1246,6 +1249,7 @@ static void kill_me_maybe(struct callback_head *cb) struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me); int flags = MF_ACTION_REQUIRED; + atomic_set(>mce_count, 0); pr_err("Uncorrected hardware memory error in user-access at %llx", p->mce_addr); if (!p->mce_ripv) @@ -1266,12 +1270,29 @@ static void kill_me_maybe(struct callb
RE: [PATCH v4] x86/mce: Avoid infinite loop for copy from user recovery
> Yah, some printk sprinkling might be a good start. But debugging in that > atomic context is always nasty. ;-\ Some very light printk sprinkling (one message in queue_task_work() in atomic context, one each in kill_me_now() and kill_me_maybe() to check when task_work actually called them. Cases 1 & 2 (user & normal copyin) now work just fine (hundreds of iterations). Case 3 (my futex test) just hangs with only two characters making it to the serial port "[ " Deeply strange. -Tony
Re: [PATCH v4] x86/mce: Avoid infinite loop for copy from user recovery
On Tue, Jan 19, 2021 at 11:56:32AM +0100, Borislav Petkov wrote: > On Fri, Jan 15, 2021 at 03:23:46PM -0800, Luck, Tony wrote: > > On Fri, Jan 15, 2021 at 12:51:03PM -0800, Luck, Tony wrote: > > > static void kill_me_now(struct callback_head *ch) > > > { > > > + p->mce_count = 0; > > > force_sig(SIGBUS); > > > } > > > > Brown paper bag time ... I just pasted that line from kill_me_maybe() > > and I thought I did a re-compile ... but obviously not since it gives > > > > error: ‘p’ undeclared (first use in this function) > > > > Option a) (just like kill_me_maybe) > > > > struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me); > > > > Option b) (simpler ... not sure why PeterZ did the container_of thing > > > > current->mce_count = 0; > > Right, he says it is the canonical way to get it out of callback_head. > I don't think current will change while the #MC handler runs but we can > adhere to the design pattern here and do container_of() ... Ok ... I'll use the canonical way. But now I've run into a weird issue. I'd run some basic tests with a dozen machine checks in each of: 1) user access 2) kernel copyin 3) futex (multiple accesses from kernel before task_work()) and it passed my tests before I posted. But the real validation folks took my patch and found that it has destabilized cases 1 & 2 (and case 3 also chokes if you repeat a few more times). System either hangs or panics. Generally before 100 injection/conumption cycles. Their tests are still just doing one at a time (i.e. complete recovery of one machine cehck before injecting the next error). So there aren't any complicated race conditions. So if you see anything obviously broken, let me know. Otherwise I'll be poking around at the patch to figure out what is wrong. -Tony
Re: [PATCH v4] x86/mce: Avoid infinite loop for copy from user recovery
On Fri, Jan 15, 2021 at 12:51:03PM -0800, Luck, Tony wrote: > static void kill_me_now(struct callback_head *ch) > { > + p->mce_count = 0; > force_sig(SIGBUS); > } Brown paper bag time ... I just pasted that line from kill_me_maybe() and I thought I did a re-compile ... but obviously not since it gives error: ‘p’ undeclared (first use in this function) Option a) (just like kill_me_maybe) struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me); Option b) (simpler ... not sure why PeterZ did the container_of thing current->mce_count = 0; -Tony
[PATCH v4] x86/mce: Avoid infinite loop for copy from user recovery
Recovery action when get_user() triggers a machine check uses the fixup path to make get_user() return -EFAULT. Also queue_task_work() sets up so that kill_me_maybe() will be called on return to user mode to send a SIGBUS to the current process. But there are places in the kernel where the code assumes that this EFAULT return was simply because of a page fault. The code takes some action to fix that, and then retries the access. This results in a second machine check. While processing this second machine check queue_task_work() is called again. But since this uses the same callback_head structure that was used in the first call, the net result is an entry on the current->task_works list that points to itself. When task_work_run() is called it loops forever in this code: do { next = work->next; work->func(work); work = next; cond_resched(); } while (work); Add a counter (current->mce_count) to keep track of repeated machine checks before task_work() is called. First machine check saves the address information and calls task_work_add(). Subsequent machine checks before that task_work call back is executed check that the address is in the same page as the first machine check (since the callback will offline exactly one page). Expected worst case is two machine checks before moving on (e.g. one user access with page faults disabled, then a repeat to the same addrsss with page faults enabled). Just in case there is some code that loops forever enforce a limit of 10. Signed-off-by: Tony Luck --- V4: Fix bug with "||" where I meant "&&" Update stale commit comment referring to mce_busy field in task structure (now called mce_count). Add some comments to queue_task_work() arch/x86/kernel/cpu/mce/core.c | 31 --- include/linux/sched.h | 1 + 2 files changed, 25 insertions(+), 7 deletions(-) diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c index 13d3f1cbda17..5460c146edb5 100644 --- a/arch/x86/kernel/cpu/mce/core.c +++ b/arch/x86/kernel/cpu/mce/core.c @@ -1238,6 +1238,7 @@ static void __mc_scan_banks(struct mce *m, struct pt_regs *regs, struct mce *fin static void kill_me_now(struct callback_head *ch) { + p->mce_count = 0; force_sig(SIGBUS); } @@ -1246,6 +1247,7 @@ static void kill_me_maybe(struct callback_head *cb) struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me); int flags = MF_ACTION_REQUIRED; + p->mce_count = 0; pr_err("Uncorrected hardware memory error in user-access at %llx", p->mce_addr); if (!p->mce_ripv) @@ -1266,12 +1268,27 @@ static void kill_me_maybe(struct callback_head *cb) } } -static void queue_task_work(struct mce *m, int kill_current_task) +static void queue_task_work(struct mce *m, char *msg, int kill_current_task) { - current->mce_addr = m->addr; - current->mce_kflags = m->kflags; - current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV); - current->mce_whole_page = whole_page(m); + /* First call, save all the details */ + if (current->mce_count++ == 0) { + current->mce_addr = m->addr; + current->mce_kflags = m->kflags; + current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV); + current->mce_whole_page = whole_page(m); + } + + /* Ten is likley overkill. Don't expect more than two faults before task_work() */ + if (current->mce_count > 10) + mce_panic("Too many machine checks while accessing user data", m, msg); + + /* Second or later call, make sure page address matches the one from first call */ + if (current->mce_count > 1 && (current->mce_addr >> PAGE_SHIFT) != (m->addr >> PAGE_SHIFT)) + mce_panic("Machine checks to different user pages", m, msg); + + /* Do not call task_work_add() more than once */ + if (current->mce_count > 1) + return; if (kill_current_task) current->mce_kill_me.func = kill_me_now; @@ -1414,7 +1431,7 @@ noinstr void do_machine_check(struct pt_regs *regs) /* If this triggers there is no way to recover. Die hard. */ BUG_ON(!on_thread_stack() || !user_mode(regs)); - queue_task_work(, kill_current_task); + queue_task_work(, msg, kill_current_task); } else { /* @@ -1432,7 +1449,7 @@ noinstr void do_machine_check(struct pt_regs *regs) } if (m.kflags & MCE_IN_KERNEL_COPYIN) - queue_task_work(, kill_current_task); + queue_task_work(, msg, kill_current_task); } out: mce_wrmsrl(MSR_IA32_MCG_STATUS, 0); diff --git a/include/linux/sched.h b/include/linux/sched.h index
Re: [PATCH v3] x86/mce: Avoid infinite loop for copy from user recovery
On Fri, Jan 15, 2021 at 04:27:54PM +0100, Borislav Petkov wrote: > On Thu, Jan 14, 2021 at 04:38:17PM -0800, Tony Luck wrote: > > Add a "mce_busy" counter so that task_work_add() is only called once > > per faulty page in this task. > > Yeah, that sentence can be removed now too. I will update with new name "mce_count" and some details. > > -static void queue_task_work(struct mce *m, int kill_current_task) > > +static void queue_task_work(struct mce *m, char *msg, int > > kill_current_task) > > So this function gets called in the user mode MCE case too: > > if ((m.cs & 3) == 3) { > > queue_task_work(, msg, kill_current_task); > } > > Do we want to panic for multiple MCEs to different addresses in user > mode? In the user mode case we should only bump mce_count to "1" and before task_work() gets called. It shouldn't hurt to do the same checks. Maybe it will catch something weird - like an NMI handler on return from the machine check doing a get_user() that hits another machine check during the return from this machine check. AndyL has made me extra paranoid. :-) > > - current->mce_addr = m->addr; > > - current->mce_kflags = m->kflags; > > - current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV); > > - current->mce_whole_page = whole_page(m); > > + if (current->mce_count++ == 0) { > > + current->mce_addr = m->addr; > > + current->mce_kflags = m->kflags; > > + current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV); > > + current->mce_whole_page = whole_page(m); > > + } > > + > > /* Magic number should be large enough */ > > > + if (current->mce_count > 10) Will add similar comment here ... and to other tests in this function since it may not be obvious to me next year what I was thinking now :-) > > + if (current->mce_count > 10) > > + mce_panic("Too many machine checks while accessing user data", > > m, msg); > > + > > + if (current->mce_count > 1 || (current->mce_addr >> PAGE_SHIFT) != > > (m->addr >> PAGE_SHIFT)) > > + mce_panic("Machine checks to different user pages", m, msg); > > Will this second part of the test expression, after the "||" ever hit? No :-( This code is wrong. Should be "&&" not "||". Then it makes more sense. Will fix for v4. > In any case, what are you trying to catch with this? Two get_user() to > different pages both catching MCEs? Yes. Trying to catch two accesses to different pages. Need to do this because kill_me_maybe() is only going to offline one page. I'm not expecting that this would ever hit. It means that calling code took a machine check on one page and get_user() said -EFAULT. The the code decided to access a different page *and* that other page also triggered a machine check. > > + /* Do not call task_work_add() more than once */ > > + if (current->mce_count > 1) > > + return; > > That won't happen either, AFAICT. It'll panic above. With the s/||/&&/ above, we can get here. > > Regardless, I like how this is all confined to the MCE code and there's > no need to touch stuff outside... Thanks for the review. -Tony
Re: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery
On Thu, Jan 14, 2021 at 09:22:13PM +0100, Borislav Petkov wrote: > On Mon, Jan 11, 2021 at 01:44:50PM -0800, Tony Luck wrote: > > @@ -1431,8 +1433,11 @@ noinstr void do_machine_check(struct pt_regs *regs) > > mce_panic("Failed kernel mode recovery", , > > msg); > > } > > > > - if (m.kflags & MCE_IN_KERNEL_COPYIN) > > + if (m.kflags & MCE_IN_KERNEL_COPYIN) { > > + if (current->mce_busy) > > + mce_panic("Multiple copyin", , msg); > > So this: we're currently busy handling the first MCE, why do we must > panic? > > Can we simply ignore all follow-up MCEs to that page? If we s/all/some/ you are saying the same as Andy: > So I tend to think that the machine check code should arrange to > survive some reasonable number of duplicate machine checks. > I.e., the page will get poisoned eventually and that poisoning is > currently executing so all following MCEs are simply nothing new and we > can ignore them. > > It's not like we're going to corrupt more data - we already are > "corrupting" whole 4K. > > Am I making sense? > > Because if we do this, we won't have to pay attention to any get_user() > callers and whatnot - we simply ignore and the solution is simple and > you won't have to touch any get_user() callers... Changing get_user() is a can of worms. I don't think its a very big can. Perhaps two or three dozen places where code needs to change to account for the -ENXIO return ... but touching a bunch of different subsystems it is likley to take a while to get everyone in agreement. I'll try out this new approach, and if it works, I'll post a v3 patch. Thanks -Tony
RE: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery
>> Maybe the other difference in approach between Andy and me is whether to >> go for a solution that covers all the corner cases, or just make an >> incremental >> improvement that allows for recover in some useful subset of remaining fatal >> cases, but still dies in other cases. > > Does that mean more core code surgery? Yes. I need to look at other user access inside pagefault_disable/enable() as likely spots where the code may continue after a machine check and retry the access. So expect some more "if (ret == ENXIO) { do something to give up gracefully }" >> I'm happy to replace error messages with ones that are more descriptive and >> helpful to humans. > > Yap, that: "Multiple copyin" with something more understandable to users... I'll work on it. We tend not to have essay length messages as panic() strings. But I can add a comment in the code there so that people who grep whatever panic message we choose can get more details on what happened and what to do. -Tony
RE: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery
> I think you guys have veered off into the weeds with this. The current > solution - modulo error messages not destined for humans :) - is not soo > bad, considering the whole MCA trainwreck. > > Or am I missing something completely unacceptable? Maybe the other difference in approach between Andy and me is whether to go for a solution that covers all the corner cases, or just make an incremental improvement that allows for recover in some useful subset of remaining fatal cases, but still dies in other cases. I'm happy to replace error messages with ones that are more descriptive and helpful to humans. -Tony
Re: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery
On Tue, Jan 12, 2021 at 02:04:55PM -0800, Andy Lutomirski wrote: > > But we know that the fault happend in a get_user() or copy_from_user() call > > (i.e. an RIP with an extable recovery address). Does context switch > > access user memory? > > No, but NMI can. > > The case that would be very very hard to deal with is if we get an NMI just > before IRET/SYSRET and get #MC inside that NMI. > > What we should probably do is have a percpu list of pending memory failure > cleanups and just accept that we’re going to sometimes get a second MCE (or > third or fourth) before we can get to it. > > Can we do the cleanup from an interrupt? IPI-to-self might be a credible > approach, if so. You seem to be looking for a solution that is entirely contained within the machine check handling code. Willing to allow for repeated machine checks from the same poison address in order to achieve that. I'm opposed to mutliple machine checks. Willing to make some changes in core code to avoid repeated access to the same poison location. We need a tie-breaker. -Tony
Re: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery
On Tue, Jan 12, 2021 at 10:57:07AM -0800, Andy Lutomirski wrote: > On Tue, Jan 12, 2021 at 10:24 AM Luck, Tony wrote: > > > > On Tue, Jan 12, 2021 at 09:21:21AM -0800, Andy Lutomirski wrote: > > > Well, we need to do *something* when the first __get_user() trips the > > > #MC. It would be nice if we could actually fix up the page tables > > > inside the #MC handler, but, if we're in a pagefault_disable() context > > > we might have locks held. Heck, we could have the pagetable lock > > > held, be inside NMI, etc. Skipping the task_work_add() might actually > > > make sense if we get a second one. > > > > > > We won't actually infinite loop in pagefault_disable() context -- if > > > we would, then we would also infinite loop just from a regular page > > > fault, too. > > > > Fixing the page tables inside the #MC handler to unmap the poison > > page would indeed be a good solution. But, as you point out, not possible > > because of locks. > > > > Could we take a more drastic approach? We know that this case the kernel > > is accessing a user address for the current process. Could the machine > > check handler just re-write %cr3 to point to a kernel-only page table[1]. > > I.e. unmap the entire current user process. > > That seems scary, especially if we're in the middle of a context > switch when this happens. We *could* make it work, but I'm not at all > convinced it's wise. Scary? It's terrifying! But we know that the fault happend in a get_user() or copy_from_user() call (i.e. an RIP with an extable recovery address). Does context switch access user memory? -Tony
Re: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery
On Tue, Jan 12, 2021 at 09:21:21AM -0800, Andy Lutomirski wrote: > Well, we need to do *something* when the first __get_user() trips the > #MC. It would be nice if we could actually fix up the page tables > inside the #MC handler, but, if we're in a pagefault_disable() context > we might have locks held. Heck, we could have the pagetable lock > held, be inside NMI, etc. Skipping the task_work_add() might actually > make sense if we get a second one. > > We won't actually infinite loop in pagefault_disable() context -- if > we would, then we would also infinite loop just from a regular page > fault, too. Fixing the page tables inside the #MC handler to unmap the poison page would indeed be a good solution. But, as you point out, not possible because of locks. Could we take a more drastic approach? We know that this case the kernel is accessing a user address for the current process. Could the machine check handler just re-write %cr3 to point to a kernel-only page table[1]. I.e. unmap the entire current user process. Then any subsequent access to user space will page fault. Maybe have a flag in the task structure to help the #PF handler understand what just happened. The code we execute in the task_work handler can restore %cr3 -Tony [1] Does such a thing already exist? If not, I'd need some help/pointers to create it.
Re: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery
On Tue, Jan 12, 2021 at 09:00:14AM -0800, Andy Lutomirski wrote: > > On Jan 11, 2021, at 2:21 PM, Luck, Tony wrote: > > > > On Mon, Jan 11, 2021 at 02:11:56PM -0800, Andy Lutomirski wrote: > >> > >>>> On Jan 11, 2021, at 1:45 PM, Tony Luck wrote: > >>> > >>> Recovery action when get_user() triggers a machine check uses the fixup > >>> path to make get_user() return -EFAULT. Also queue_task_work() sets up > >>> so that kill_me_maybe() will be called on return to user mode to send a > >>> SIGBUS to the current process. > >>> > >>> But there are places in the kernel where the code assumes that this > >>> EFAULT return was simply because of a page fault. The code takes some > >>> action to fix that, and then retries the access. This results in a second > >>> machine check. > >>> > >>> While processing this second machine check queue_task_work() is called > >>> again. But since this uses the same callback_head structure that > >>> was used in the first call, the net result is an entry on the > >>> current->task_works list that points to itself. > >> > >> Is this happening in pagefault_disable context or normal sleepable fault > >> context? If the latter, maybe we should reconsider finding a way for the > >> machine check code to do its work inline instead of deferring it. > > > > The first machine check is in pagefault_disable() context. > > > > static int get_futex_value_locked(u32 *dest, u32 __user *from) > > { > >int ret; > > > >pagefault_disable(); > >ret = __get_user(*dest, from); > > I have very mixed feelings as to whether we should even try to recover > from the first failure like this. If we actually want to try to > recover, perhaps we should instead arrange for the second MCE to > recover successfully instead of panicking. Well we obviously have to "recover" from the first machine check in order to get to the second. Are you saying you don't like the different return value from get_user()? In my initial playing around with this I just had the second machine check simply skip the task_work_add(). This worked for this case, but only because there wasn't a third, fourth, etc. access to the poisoned data. If the caller keeps peeking, then we'll keep taking more machine checks - possibly forever. Even if we do recover with just one extra machine check ... that's one more than was necessary. -Tony
Re: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery
On Mon, Jan 11, 2021 at 02:11:56PM -0800, Andy Lutomirski wrote: > > > On Jan 11, 2021, at 1:45 PM, Tony Luck wrote: > > > > Recovery action when get_user() triggers a machine check uses the fixup > > path to make get_user() return -EFAULT. Also queue_task_work() sets up > > so that kill_me_maybe() will be called on return to user mode to send a > > SIGBUS to the current process. > > > > But there are places in the kernel where the code assumes that this > > EFAULT return was simply because of a page fault. The code takes some > > action to fix that, and then retries the access. This results in a second > > machine check. > > > > While processing this second machine check queue_task_work() is called > > again. But since this uses the same callback_head structure that > > was used in the first call, the net result is an entry on the > > current->task_works list that points to itself. > > Is this happening in pagefault_disable context or normal sleepable fault > context? If the latter, maybe we should reconsider finding a way for the > machine check code to do its work inline instead of deferring it. The first machine check is in pagefault_disable() context. static int get_futex_value_locked(u32 *dest, u32 __user *from) { int ret; pagefault_disable(); ret = __get_user(*dest, from); pagefault_enable(); return (ret == -ENXIO) ? ret : ret ? -EFAULT : 0; } -Tony
RE: [PATCH 2/2] futex, x86/mce: Avoid double machine checks
> Yeah, saw that, but that should be trivially fixable I'm thinking. Trivial, maybe ... but then follows the audit of other get_user() calls: git grep get_user\( | wc -l 2003 :-( -Tony
RE: [PATCH 2/2] futex, x86/mce: Avoid double machine checks
> I think this is horrid; why can't we have it return something different > then -EFAULT instead? I did consider this ... but it appears that architectures aren't unified in the return value from get_user() Here's another function involved in the futex call chain leading to this: static int get_futex_value_locked(u32 *dest, u32 __user *from) { int ret; pagefault_disable(); ret = __get_user(*dest, from); pagefault_enable(); return ret ? -EFAULT : 0; } It seems like the expectation here is just "zero or not" and we don't care what the "not" value is ... just turn it into -EFAULT. -Tony
RE: [PATCH RFC x86/mce] Make mce_timed_out() identify holdout CPUs
> Please see below for an updated patch. Yes. That worked: [ 78.946069] mce: mce_timed_out: MCE holdout CPUs (may include false positives): 24-47,120-143 [ 78.946151] mce: mce_timed_out: MCE holdout CPUs (may include false positives): 24-47,120-143 [ 78.946153] Kernel panic - not syncing: Timeout: Not all CPUs entered broadcast exception handler I guess that more than one CPU hit the timeout and so your new message was printed twice before the panic code took over? Once again, the whole of socket 1 is MIA rather than just the pair of threads on one of the cores there. But that's a useful improvement (eliminating the other three sockets on this system). Tested-by: Tony Luck -Tony
Re: [PATCH RFC x86/mce] Make mce_timed_out() identify holdout CPUs
On Wed, Jan 06, 2021 at 11:17:08AM -0800, Paul E. McKenney wrote: > On Wed, Jan 06, 2021 at 06:39:30PM +0000, Luck, Tony wrote: > > > The "Timeout: Not all CPUs entered broadcast exception handler" message > > > will appear from time to time given enough systems, but this message does > > > not identify which CPUs failed to enter the broadcast exception handler. > > > This information would be valuable if available, for example, in order to > > > correlated with other hardware-oriented error messages. This commit > > > therefore maintains a cpumask_t of CPUs that have entered this handler, > > > and prints out which ones failed to enter in the event of a timeout. > > > > I tried doing this a while back, but found that in my test case where I > > forced > > an error that would cause both threads from one core to be "missing", the > > output was highly unpredictable. Some random number of extra CPUs were > > reported as missing. After I added some extra breadcrumbs it became clear > > that pretty much all the CPUs (except the missing pair) entered > > do_machine_check(), > > but some got hung up at various points beyond the entry point. My only > > theory > > was that they were trying to snoop caches from the dead core (or access some > > other resource held by the dead core) and so they hung too. > > > > Your code is much neater than mine ... and perhaps works in other cases, but > > maybe the message needs to allow for the fact that some of the cores that > > are reported missing may just be collateral damage from the initial problem. > > Understood. The system is probably not in the best shape if this code > is ever executed, after all. ;-) > > So how about like this? > > pr_info("%s: MCE holdout CPUs (may include false positives): %*pbl\n", That looks fine. > > Easy enough if so! > > > If I get time in the next day or two, I'll run my old test against your > > code to > > see what happens. I got time today (plenty of meetings in which to run experiments in background). This code: - if (mca_cfg.tolerant <= 1) + if (mca_cfg.tolerant <= 1) { + if (!cpumask_andnot(_missing_cpus, cpu_online_mask, _present_cpus)) + pr_info("%s: MCE holdout CPUs: %*pbl\n", + __func__, cpumask_pr_args(_missing_cpus)); mce_panic(msg, NULL, NULL); didn't trigger ... so maybe that cpumask_andnot() didn't return the value you expected? I added a: + pr_info("%s: MCE present CPUs: %*pbl\n", __func__, cpumask_pr_args(_present_cpus)); to check that the mask was being set correctly, and saw: [ 219.329767] mce: mce_timed_out: MCE present CPUs: 0-23,48-119,144-191 So the every core of socket 1 failed to show up for this test. > For my own testing, is this still the right thing to use? > > https://github.com/andikleen/mce-inject That fakes up errors (by hooking into the mce_rdmsr() code to return arbitrary user supplied values). The plus side of this is that you can fake any error signature without needing special h/w or f/w. The downside is that it is all fake and you can't create situations where some CPUs don't show up in the machine check handler. -Tony
RE: [PATCH RFC x86/mce] Make mce_timed_out() identify holdout CPUs
> The "Timeout: Not all CPUs entered broadcast exception handler" message > will appear from time to time given enough systems, but this message does > not identify which CPUs failed to enter the broadcast exception handler. > This information would be valuable if available, for example, in order to > correlated with other hardware-oriented error messages. This commit > therefore maintains a cpumask_t of CPUs that have entered this handler, > and prints out which ones failed to enter in the event of a timeout. I tried doing this a while back, but found that in my test case where I forced an error that would cause both threads from one core to be "missing", the output was highly unpredictable. Some random number of extra CPUs were reported as missing. After I added some extra breadcrumbs it became clear that pretty much all the CPUs (except the missing pair) entered do_machine_check(), but some got hung up at various points beyond the entry point. My only theory was that they were trying to snoop caches from the dead core (or access some other resource held by the dead core) and so they hung too. Your code is much neater than mine ... and perhaps works in other cases, but maybe the message needs to allow for the fact that some of the cores that are reported missing may just be collateral damage from the initial problem. If I get time in the next day or two, I'll run my old test against your code to see what happens. -Tony
Re: [PATCH] ia64: fix xchg() warning
On Tue, Jan 05, 2021 at 02:17:41PM +0100, Arnd Bergmann wrote: > On Mon, Jan 4, 2021 at 5:00 PM Luck, Tony wrote: > > > > > I have not received any reply from the ia64 maintainers, I assume they > > > were > > > both out of office for Christmas. > > > > I'm back in the office ... but have no working ia64 machines, nor time to > > look at patches :-( > > > > Should drop me from the MAINTAINTERS file. > > If you like, I can apply the patch below and take that through my > asm-generic tree along with the two bug fixes I sent: > > diff --git a/MAINTAINERS b/MAINTAINERS > index 0f2e55faaf7f..b74093803154 100644 > --- a/MAINTAINERS > +++ b/MAINTAINERS > @@ -8432,11 +8432,8 @@ F: drivers/i3c/ > F: include/linux/i3c/ > > IA64 (Itanium) PLATFORM > -M: Tony Luck > -M: Fenghua Yu > L: linux-i...@vger.kernel.org > -S: Odd Fixes > -T: git git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux.git > +S: Orphan > F: Documentation/ia64/ > F: arch/ia64/ > > Is that what you had in mind? I see that Fenghua Yu has not been > actively involved for a long time. If you are both out, that would > make the port unmaintained, but that may actually help find someone > else to either volunteer as a maintainer or pay someone if they > have a commercial interest. Yes. Fenghua has moved to working on other things, so that looks good. Acked-by: Tony Luck -Tony
RE: [PATCH] ia64: fix xchg() warning
> I have not received any reply from the ia64 maintainers, I assume they were > both out of office for Christmas. I'm back in the office ... but have no working ia64 machines, nor time to look at patches :-( Should drop me from the MAINTAINTERS file. -Tony
RE: linux-next: Tree for Nov 19 (drivers/edac/igen6_edac.c)
> ../drivers/edac/igen6_edac.c: In function 'ecclog_nmi_handler': > ../drivers/edac/igen6_edac.c:525:10: error: 'NMI_DONE' undeclared (first use > in this function); did you mean 'DMI_NONE'? >return NMI_DONE; This driver has a #include But inside that file it says: #if defined(CONFIG_HAVE_NMI_WATCHDOG) #include #endif and the randconfig used doesn't set CONFIG_HAVE_NMI_WATCHDOG Some options: 1) Drop that #ifdef from It was introduced as part of this commit: f2e0cff85ed1 ("kernel/watchdog: introduce arch_touch_nmi_watchdog()") presumably for some good reason. 2) Make this edac driver select CONFIG_HAVE_NMI_WATCHDOG Yuck! 3) Make this driver #include instead of This fixes this build error, but I thought that general policy was to use the if it exists rather than the one. Maybe that's ok here because this is an x86 specific driver? I'm leaning toward option #3. -Tony
RE: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
> Of course is this not only an x86 problem. Every architecture which > supports virtualization has the same issue. ARM(64) has no way to tell > for sure whether the machine runs bare metal either. No idea about the > other architectures. Sounds like a hypervisor problem. If the VMM provides perfect emulation of every weird quirk of h/w, then it is OK to let the guest believe that it is running on bare metal. If it isn't perfect, then it should make sure the guest knows *for sure*, so that the guest can take appropriate actions to avoid the sharp edges. -Tony
[PATCH v2] x86/mce: Use "safe" MSR functions when enabling additional error logging
Booting as a guest under KVM results in error messages about unchecked MSR access: [6.814328][T0] unchecked MSR access error: RDMSR from 0x17f at rIP: 0x84483f16 (mce_intel_feature_init+0x156/0x270) because KVM doesn't provide emulation for random model specific registers. Switch to using rdmsrl_safe()/wrmsrl_safe() to avoid the message. Reported-by: Qian Cai Fixes: 68299a42f842 ("x86/mce: Enable additional error logging on certain Intel CPUs") Signed-off-by: Tony Luck --- arch/x86/kernel/cpu/mce/intel.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/arch/x86/kernel/cpu/mce/intel.c b/arch/x86/kernel/cpu/mce/intel.c index b47883e364b4..42e60ef16c3a 100644 --- a/arch/x86/kernel/cpu/mce/intel.c +++ b/arch/x86/kernel/cpu/mce/intel.c @@ -521,9 +521,10 @@ static void intel_imc_init(struct cpuinfo_x86 *c) case INTEL_FAM6_SANDYBRIDGE_X: case INTEL_FAM6_IVYBRIDGE_X: case INTEL_FAM6_HASWELL_X: - rdmsrl(MSR_ERROR_CONTROL, error_control); + if (rdmsrl_safe(MSR_ERROR_CONTROL, _control)) + return; error_control |= 2; - wrmsrl(MSR_ERROR_CONTROL, error_control); + wrmsrl_safe(MSR_ERROR_CONTROL, error_control); break; } } -- 2.21.1
RE: [PATCH] x86/mce: Check for hypervisor before enabling additional error logging
I still think there is a reasonable case to claim that this usage is right to check whether it is running as a guest. Look at what it is trying to do ... change the behavior of the platform w.r.t. logging of memory errors. How does that make any sense for a guest ... that doesn't even know what memory is present on the platform. Or have guarantees that what it sees as memory address 0x12345678 maps to the same set of cells in a DRAM from one second to the next? -Tony
RE: [tip: ras/core] x86/mce: Enable additional error logging on certain Intel CPUs
> I wouldn't expect all hypervisors to necessarily set CPUID.01H:ECX[bit > 31]. Architecturally, on Intel CPUs, that bit is simply defined as > "not used." There is no documented contract between Intel and > hypervisor vendors regarding the use of that bit. (AMD, on the other > hand, *does* document that bit as "reserved for use by hypervisor to > indicate guest status.") Maybe no contract ... but a bunch of places (many of them in Intel specific code) that check for it. Perhaps I should poke the owners of the Intel SDM to accept the inevitable. $ git grep "boot_cpu_has(X86_FEATURE_HYPERVISOR" arch/x86/events/core.c: if (boot_cpu_has(X86_FEATURE_HYPERVISOR)) { arch/x86/events/intel/core.c: if (!boot_cpu_has(X86_FEATURE_HYPERVISOR)) arch/x86/events/intel/core.c: int assume = 3 * !boot_cpu_has(X86_FEATURE_HYPERVISOR); arch/x86/events/intel/cstate.c: if (boot_cpu_has(X86_FEATURE_HYPERVISOR)) arch/x86/events/intel/uncore.c: if (boot_cpu_has(X86_FEATURE_HYPERVISOR)) arch/x86/kernel/apic/apic.c:if (boot_cpu_has(X86_FEATURE_HYPERVISOR)) arch/x86/kernel/cpu/bugs.c: if (boot_cpu_has(X86_FEATURE_HYPERVISOR)) arch/x86/kernel/cpu/bugs.c: else if (boot_cpu_has(X86_FEATURE_HYPERVISOR)) arch/x86/kernel/cpu/bugs.c: if (boot_cpu_has(X86_FEATURE_HYPERVISOR)) { arch/x86/kernel/cpu/bugs.c: if (boot_cpu_has(X86_FEATURE_HYPERVISOR)) { arch/x86/kernel/cpu/intel.c:if (boot_cpu_has(X86_FEATURE_HYPERVISOR)) arch/x86/kernel/cpu/mshyperv.c: if (!boot_cpu_has(X86_FEATURE_HYPERVISOR)) arch/x86/kernel/cpu/vmware.c: * If !boot_cpu_has(X86_FEATURE_HYPERVISOR), vmware_hypercall_mode arch/x86/kernel/cpu/vmware.c: if (boot_cpu_has(X86_FEATURE_HYPERVISOR)) { arch/x86/kernel/jailhouse.c:!boot_cpu_has(X86_FEATURE_HYPERVISOR)) arch/x86/kernel/kvm.c: if (boot_cpu_has(X86_FEATURE_HYPERVISOR)) arch/x86/kernel/paravirt.c: if (!boot_cpu_has(X86_FEATURE_HYPERVISOR)) arch/x86/kernel/tsc.c: if (boot_cpu_has(X86_FEATURE_HYPERVISOR) || arch/x86/mm/init_64.c: if (!boot_cpu_has(X86_FEATURE_HYPERVISOR)) { drivers/acpi/processor_idle.c: if (boot_cpu_has(X86_FEATURE_HYPERVISOR)) drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h: return boot_cpu_has(X86_FEATURE_HYPERVISOR); drivers/gpu/drm/i915/i915_memcpy.c: !boot_cpu_has(X86_FEATURE_HYPERVISOR)) drivers/gpu/drm/radeon/radeon_device.c: return boot_cpu_has(X86_FEATURE_HYPERVISOR); drivers/visorbus/visorchipset.c:if (boot_cpu_has(X86_FEATURE_HYPERVISOR)) { -Tony
[PATCH] x86/mce: Check for hypervisor before enabling additional error logging
Booting as a guest under KVM results in error messages about unchecked MSR access: [6.814328][T0] unchecked MSR access error: RDMSR from 0x17f at rIP: 0x84483f16 (mce_intel_feature_init+0x156/0x270) because KVM doesn't provide emulation for random model specific registers. Check for X86_FEATURE_HYPERVISOR and skip trying to enable the mode (a guest shouldn't be concerned with corrected errors anyway). Reported-by: Qian Cai Fixes: 68299a42f842 ("x86/mce: Enable additional error logging on certain Intel CPUs") Signed-off-by: Tony Luck --- arch/x86/kernel/cpu/mce/intel.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/x86/kernel/cpu/mce/intel.c b/arch/x86/kernel/cpu/mce/intel.c index b47883e364b4..7f7d863400b7 100644 --- a/arch/x86/kernel/cpu/mce/intel.c +++ b/arch/x86/kernel/cpu/mce/intel.c @@ -517,6 +517,9 @@ static void intel_imc_init(struct cpuinfo_x86 *c) { u64 error_control; + if (boot_cpu_has(X86_FEATURE_HYPERVISOR)) + return; + switch (c->x86_model) { case INTEL_FAM6_SANDYBRIDGE_X: case INTEL_FAM6_IVYBRIDGE_X: -- 2.21.1
RE: [tip: ras/core] x86/mce: Enable additional error logging on certain Intel CPUs
> I thought Linux had long ago gone the route of turning rdmsr/wrmsr > into rdmsr_safe/wrmsr_safe, so that the guest would ignore the #GPs on > writes and return zero to the caller for #GPs on reads. Linux just switched that around for the machine check banks ... if they #GP fault, then something is seriously wrong. Maybe that isn't a general change of direction though. Perhaps I should either use rdmsrl_safe() in this code. Or (better?) add if (boot_cpu_has(X86_FEATURE_HYPERVISOR)) return; to the start of intel_imc_init(). -Tony
RE: [tip: ras/core] x86/mce: Enable additional error logging on certain Intel CPUs
What does KVM do with model specific MSRs? Looks like you let the guest believe it was running on one of Sandy Bridge, Ivy Bridge, Haswell (Xeon). So, the core MCE code tried to enable extended error reporting. If there is a mode to have KVM let the guest think that it read/wrote MSR 0x17F, but actually, doesn't do it ... that would seem to be a reasonable thing to do here. -Tony
Re: [PATCH] x86/mce: Enable additional error logging on certain Intel CPUs
On Fri, Oct 30, 2020 at 12:04:03PM -0700, Luck, Tony wrote: Bah, didn't notice this conversation didn't include LKML. > The Xeon versions of Sandy Bridge, Ivy Bridge and Haswell support an > optional additional error logging mode which is enabled by an MSR. > > Previously this mode was enabled from the mcelog(8) tool via /dev/cpu, > but the kernel is now very picky about which MSRs may be written. So > move the enabling into the kernel. > > Suggested-by: Boris Petkov > Signed-off-by: Tony Luck > --- > > N.B. I don't have any of these old systems in my lab any more. So > this is untested :-( > > arch/x86/include/asm/msr-index.h | 1 + > arch/x86/kernel/cpu/mce/intel.c | 20 > 2 files changed, 21 insertions(+) > > diff --git a/arch/x86/include/asm/msr-index.h > b/arch/x86/include/asm/msr-index.h > index 972a34d93505..b2dd2648c0e2 100644 > --- a/arch/x86/include/asm/msr-index.h > +++ b/arch/x86/include/asm/msr-index.h > @@ -139,6 +139,7 @@ > #define MSR_IA32_MCG_CAP 0x0179 > #define MSR_IA32_MCG_STATUS 0x017a > #define MSR_IA32_MCG_CTL 0x017b > +#define MSR_ERROR_CONTROL0x017f > #define MSR_IA32_MCG_EXT_CTL 0x04d0 > > #define MSR_OFFCORE_RSP_00x01a6 > diff --git a/arch/x86/kernel/cpu/mce/intel.c b/arch/x86/kernel/cpu/mce/intel.c > index abe9fe0fb851..b47883e364b4 100644 > --- a/arch/x86/kernel/cpu/mce/intel.c > +++ b/arch/x86/kernel/cpu/mce/intel.c > @@ -509,12 +509,32 @@ static void intel_ppin_init(struct cpuinfo_x86 *c) > } > } > > +/* > + * Enable additional error logs from the integrated > + * memory controller on processors that support this. > + */ > +static void intel_imc_init(struct cpuinfo_x86 *c) > +{ > + u64 error_control; > + > + switch (c->x86_model) { > + case INTEL_FAM6_SANDYBRIDGE_X: > + case INTEL_FAM6_IVYBRIDGE_X: > + case INTEL_FAM6_HASWELL_X: > + rdmsrl(MSR_ERROR_CONTROL, error_control); > + error_control |= 2; > + wrmsrl(MSR_ERROR_CONTROL, error_control); > + break; > + } > +} > + > void mce_intel_feature_init(struct cpuinfo_x86 *c) > { > intel_init_thermal(c); > intel_init_cmci(); > intel_init_lmce(); > intel_ppin_init(c); > + intel_imc_init(c); > } > > void mce_intel_feature_clear(struct cpuinfo_x86 *c) > -- > 2.21.1 >
[PLEASE PULL] ia64 for v5.10
The following changes since commit f4d51dffc6c01a9e94650d95ce0104964f8ae822: Linux 5.9-rc4 (2020-09-06 17:11:40 -0700) are available in the Git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux.git tags/ia64_for_5.10 for you to fetch changes up to c331649e637152788b0ca1c857d0c2eaf34fcbc3: ia64: Use libata instead of the legacy ide driver in defconfigs (2020-09-11 09:38:00 -0700) Cleanups by Christoph 1) Switch to libata instead of legacy ide driver 2) Drop ia64 perfmon (it's been broken for a while) Christoph Hellwig (2): ia64: Remove perfmon ia64: Use libata instead of the legacy ide driver in defconfigs arch/ia64/Kconfig |9 - arch/ia64/configs/bigsur_defconfig|8 +- arch/ia64/configs/generic_defconfig | 10 +- arch/ia64/configs/gensparse_defconfig | 10 +- arch/ia64/configs/tiger_defconfig | 10 +- arch/ia64/configs/zx1_defconfig |8 +- arch/ia64/include/asm/processor.h | 10 - arch/ia64/include/asm/switch_to.h | 10 +- arch/ia64/kernel/Makefile |3 +- arch/ia64/kernel/irq_ia64.c |7 - arch/ia64/kernel/perfmon.c| 6703 - arch/ia64/kernel/process.c| 53 - arch/ia64/kernel/ptrace.c | 24 - arch/ia64/kernel/smpboot.c|8 - arch/ia64/kernel/syscalls/syscall.tbl |2 +- arch/ia64/lib/Makefile|1 - arch/ia64/lib/carta_random.S | 55 - arch/ia64/oprofile/Makefile |1 - arch/ia64/oprofile/init.c | 12 +- arch/ia64/oprofile/perfmon.c | 99 - 20 files changed, 22 insertions(+), 7021 deletions(-) delete mode 100644 arch/ia64/kernel/perfmon.c delete mode 100644 arch/ia64/lib/carta_random.S delete mode 100644 arch/ia64/oprofile/perfmon.c
RE: [PATCH v3 4/6] x86/mce: Avoid tail copy when machine check terminated a copy from user
>> Machine checks are more serious. Just give up at the point where the >> main copy loop triggered the #MC and return from the copy code as if >> the copy succeeded. The machine check handler will use task_work_add() to >> make sure that the task is sent a SIGBUS. > > Isn't that just plain wrong? It isn't pretty. I'm not sure how wrong it is. > If copy is reported as succeeding the kernel code will use the 'old' > data that is in the buffer as if it had been read from userspace. > This could end up with kernel stack data being written to a file. I ran a test with: write(fd, buf, 512) With poison injected into buf[256] to force a machine check mid-copy. The size of the file did get incremented by 512 rather than 256. Which isn't good. The data in the file up to the 256 byte mark was the user data from buf[0 ... 255]. The data in the file past offset 256 was all zeroes. I suspect that isn't by chance. The kernel has to defend against a user writing a partial page and using mmap(2) on the same file to peek at data past EOF and up to the next PAGE_SIZE boundary. So I think it must zero new pages allocated in page cache as they are allocated to a file. > Even zeroing the rest of the kernel buffer is wrong. It wouldn't help/change anything. > IIRC the code to try to maximise the copy has been removed. > So the 'slow' retry wont happen any more. Which code has been removed (and when ... TIP, and my testing, is based on 5.9-rc1) -Tony
Re: [PATCH v2 6/7] x86/mce: Recover from poison found while copying from user space
On Mon, Oct 05, 2020 at 06:32:47PM +0200, Borislav Petkov wrote: > On Wed, Sep 30, 2020 at 04:26:10PM -0700, Tony Luck wrote: > > arch/x86/kernel/cpu/mce/core.c | 33 + > > include/linux/sched.h | 2 ++ > > 2 files changed, 23 insertions(+), 12 deletions(-) > > Isn't that just simpler? Yes. A helper function avoids making the code a mess of if/else with subtle fall through cases. > diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c > index 4d2cf08820af..dc6c83aa2ec1 100644 > --- a/arch/x86/kernel/cpu/mce/core.c > +++ b/arch/x86/kernel/cpu/mce/core.c > @@ -1261,6 +1261,21 @@ static void kill_me_maybe(struct callback_head *cb) > kill_me_now(cb); > } > > +static inline void queue_task_work(struct mce *m, int kill_it) Does it need to be "inline" though? I hope machine check processing is never in the critical code path for anyone! > +{ > + current->mce_addr = m->addr; > + current->mce_kflags = m->kflags; > + current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV); > + current->mce_whole_page = whole_page(m); > + > + if (kill_it) > + current->mce_kill_me.func = kill_me_now; > + else > + current->mce_kill_me.func = kill_me_maybe; > + > + task_work_add(current, >mce_kill_me, true); > +} > + > /* > * The actual machine check handler. This only handles real > * exceptions when something got corrupted coming in through int 18. > @@ -1402,13 +1417,8 @@ noinstr void do_machine_check(struct pt_regs *regs) > /* If this triggers there is no way to recover. Die hard. */ > BUG_ON(!on_thread_stack() || !user_mode(regs)); > > - current->mce_addr = m.addr; > - current->mce_ripv = !!(m.mcgstatus & MCG_STATUS_RIPV); > - current->mce_whole_page = whole_page(); > - current->mce_kill_me.func = kill_me_maybe; > - if (kill_it) > - current->mce_kill_me.func = kill_me_now; > - task_work_add(current, >mce_kill_me, true); > + queue_task_work(, kill_it); > + > } else { > /* >* Handle an MCE which has happened in kernel space but from > @@ -1423,6 +1433,9 @@ noinstr void do_machine_check(struct pt_regs *regs) > if (!fixup_exception(regs, X86_TRAP_MC, 0, 0)) > mce_panic("Failed kernel mode recovery", , > msg); > } > + > + if (m.kflags & MCE_IN_KERNEL_COPYIN) > + queue_task_work(, kill_it); > } > out: > mce_wrmsrl(MSR_IA32_MCG_STATUS, 0); -Tony
Re: [PATCH 0/3] x86: Add initial support to discover Intel hybrid CPUs
On Sat, Oct 03, 2020 at 03:39:29AM +0200, Thomas Gleixner wrote: > On Fri, Oct 02 2020 at 13:19, Ricardo Neri wrote: > > Add support to discover and enumerate CPUs in Intel hybrid parts. A hybrid > > part has CPUs with more than one type of micro-architecture. Thus, certain > > features may only be present in a specific CPU type. > > > > It is useful to know the type of CPUs present in a system. For instance, > > perf may need to handle CPUs differently depending on the type of micro- > > architecture. Decoding machine check error logs may need the additional > > micro-architecture type information, so include that in the log. > > 'It is useful' as justification just makes me barf. This isn't "hetero" ... all of the cores are architecturally the same. If CPUID says that some feature is supported, then it will be supported on all of the cores. There might be some model specific performance counter events that only apply to some cores. Or a machine check error code that is logged in the model specific MSCOD field of IA32_MCi_STATUS. But any and all code can run on any core. Sure there will be some different power/performance tradeoffs on some cores. But we already have that with some cores able to achieve higher turbo frequencies than others. -Tony