RE: [PATCH] x86/mce: Clean up TP_printk() output line of the mce_record tracepoint

2024-03-29 Thread Luck, Tony
>> - While at it, don't forget to:
>> 
>>s/ADDR/MISC/SYND
>> /addr/misc/synd
>>
> These are actually acronyms for Address, Miscellaneous and Syndrome registers.

They look like abbreviations, not acronyms to me.

-Tony


RE: [PATCH v4 2/2] tracing: Include Microcode Revision in mce_record tracepoint

2024-03-27 Thread Luck, Tony
> Export microcode version through the tracepoint to prevent ambiguity over
> the active version on the system when the MCE was received.
>
> Signed-off-by: Avadhut Naik 
> Reviewed-by: Sohil Mehta 
> Reviewed-by: Steven Rostedt (Google) 

Reviewed-by: Tony Luck 



RE: [PATCH v4 1/2] tracing: Include PPIN in mce_record tracepoint

2024-03-27 Thread Luck, Tony
> Export PPIN through the tracepoint as it provides a unique identifier for
> the system (or socket in case of multi-socket systems) on which the MCE
> has been received.
>
> Signed-off-by: Avadhut Naik 
> Reviewed-by: Sohil Mehta 
> Reviewed-by: Steven Rostedt (Google) 

Reviewed-by: Tony Luck 



RE: [PATCH v2 0/2] Update mce_record tracepoint

2024-01-26 Thread Luck, Tony
> But no, that's not the right question to ask.
>
> It is rather: which bits of information are very relevant to an error
> record and which are transient enough so that they cannot be gathered
> from a system by other means or only gathered in a difficult way, and
> should be part of that record.
>
> The PPIN is not transient but you have to go map ->extcpu to the PPIN so
> adding it to the tracepoint is purely a convenience thing. More or less.
>
> The microcode revision thing I still don't buy but it is already there
> so whateva...
>
> So we'd need a rule hammered out and put there in a prominent place so
> that it is clear what goes into struct mce and what not.

My personal evaluation of the value of these two additions to the trace record:

PPIN: Nice to have. People that send stuff to me are terrible about providing 
surrounding
details. The record already includes CPUID(1).EAX ... so I can at least skip 
the step of
asking them which CPU family/model/stepping they were using). But PPIN
can be recovered (so long as the submitter kept good records about which system
generated the record).

MICROCODE: Must have. Microcode version can be changed at run time. Going
back to the system to check later may not give the correct answer to what was 
active
at the time of the error. Especially for an error reported while a microcode 
update is
waling across the CPUs poking the MSR on each in turn.

-Tony


RE: [PATCH v2 0/2] Update mce_record tracepoint

2024-01-26 Thread Luck, Tony
> > Is it so very different to add this to a trace record so that rasdaemon
> > can have feature parity with mcelog(8)?
>
> I knew you were gonna say that. When someone decides that it is
> a splendid idea to add more stuff to struct mce then said someone would
> want it in the tracepoint too.
>
> And then we're back to my original question:
>
> "And where does it end? Stick full dmesg in the tracepoint too?"
>
> Where do you draw the line in the sand and say, no more, especially
> static, fields bloating the trace record should be added and from then
> on, you should go collect the info from that box. Something which you're
> supposed to do anyway.

Every patch that adds new code or data structures adds to the kernel
memory footprint. Each should be considered on its merits. The basic
question being:

   "Is the new functionality worth the cost?"

Where does it end? It would end if Linus declared:

  "Linux is now complete. Stop sending patches".

I.e. it is never going to end.

If somebody posts a patch asking to add the full dmesg to a
tracepoint, I'll stand with you to say: "Not only no, but hell no".

So for Naik's two patches we have:

1) PPIN
Cost = 8 bytes.
Benefit: Emdeds a system identifier into the trace record so there
can be no ambiguity about which machine generated this error.
Also definitively indicates which socket on a multi-socket system.

2) MICROCODE
Cost = 4 bytes
Benefit: Certainty about the microcode version active on the core
at the time the error was detected.

RAS = Reliability, Availability, Serviceability

These changes fall into the serviceability bucket. They make it
easier to diagnose what went wrong.


-Tony


RE: [PATCH v2 0/2] Update mce_record tracepoint

2024-01-26 Thread Luck, Tony
> > You've spent enough time with Ashok and Thomas tweaking the Linux
> > microcode driver to know that going back to the machine the next day
> > to ask about microcode version has a bunch of ways to get a wrong
> > answer.
>
> Huh, what does that have to do with this?

If deployment of a microcode update across a fleet always went
flawlessly, life would be simpler. But things can fail. And maybe the
failure wasn't noticed. Perhaps a node was rebooting when the
sysadmin pushed the update to the fleet and so missed the
deployment. Perhaps one core was already acting weird and
the microcode update didn't get applied to that core.

> IIUC, if someone changes something on the system, whether that is
> updating microcode or swapping a harddrive or swapping memory or
> whatever, that invalidates the errors reported, pretty much.
>
> You can't put it all in the trace record, you just can't.

Swapping a hard drive, or hot plugging a NIC isn't very likely
to correlate with an error reported by the CPU in machine
check banks. But microcode can be (and has been) the issue
in enough cases that knowing the version at the time of the
error matters.

You seemed to agree with this argument when the microcode
field was added to "struct mce" back in 2018

fa94d0c6e0f3 ("x86/MCE: Save microcode revision in machine check records")

Is it so very different to add this to a trace record so that rasdaemon
can have feature parity with mcelog(8)?

-Tony


RE: [PATCH v2 0/2] Update mce_record tracepoint

2024-01-26 Thread Luck, Tony
> > 8 bytes for PPIN, 4 more for microcode.
>
> I know, nothing leads to bloat like 0.01% here, 0.001% there...

12 extra bytes divided by (say) 64GB (a very small server these days, may 
laptop has that much)
   = 0.0001746%

We will need 57000 changes like this one before we get to 0.001% :-)

> > Number of recoverable machine checks per system  I hope the
> > monthly rate should be countable on my fingers...
>
> That's not the point. Rather, when you look at MCE reports, you pretty
> much almost always go and collect additional information from the target
> machine because you want to figure out what exactly is going on.
>
> So what's stopping you from collecting all that static information
> instead of parrotting it through the tracepoint with each error?

PPIN is static. So with good tracking to keep source platform information
attached to the error record as it gets passed around people trying to triage
the problem, you could say it can be retrieved later (or earlier when setting
up a database of attributes of each machine in the fleet.

But the key there is keeping the details of the source machine attached to
the error record. My first contact with machine check debugging is always
just the raw error record (from mcelog, rasdaemon, or console log).

> > PPIN is useful when talking to the CPU vendor about patterns of
> > similar errors seen across a cluster.
>
> I guess that is perhaps the only thing of the two that makes some sense
> at least - the identifier uniquely describes which CPU the error comes
> from...
>
> > MICROCODE - gives a fast path to root cause problems that have already
> > been fixed in a microcode update.
>
> But that, nah. See above.

Knowing which microcode version was loaded on a core *at the time of the error*
is critical. You've spent enough time with Ashok and Thomas tweaking the Linux
microcode driver to know that going back to the machine the next day to ask 
about
microcode version has a bunch of ways to get a wrong answer.

-Tony


RE: [PATCH v2 0/2] Update mce_record tracepoint

2024-01-25 Thread Luck, Tony
> > The first patch adds PPIN (Protected Processor Inventory Number) field to
> > the tracepoint.
> >
> > The second patch adds the microcode field (Microcode Revision) to the
> > tracepoint.
>
> This is a lot of static information to add to *every* MCE.

8 bytes for PPIN, 4 more for microcode.

Number of recoverable machine checks per system  I hope the monthly rate 
should
be countable on my fingers. If a system is getting more than that, then people 
should
be looking at fixing the underlying problem.

Corrected errors are much more common. Though Linux takes action to limit the
rate when storms occur. So maybe hundreds or small numbers of thousands of
error trace records? Increase in trace buffer consumption still measured in 
Kbytes
not Mbytes. Server systems that do machine check reporting now start at tens of
GBytes memory.

> And where does it end? Stick full dmesg in the tracepoint too?

Seems like overkill.

> What is the real-life use case here?

Systems using rasdaemon to track errors will be able to track both of these
(I assume that Naik has plans to update rasdaemon to capture and save these
new fields).

PPIN is useful when talking to the CPU vendor about patterns of similar errors
seen across a cluster.

MICROCODE - gives a fast path to root cause problems that have already
been fixed in a microcode update.

-Tony


RE: [PATCH] printk: add cpu id information to printk() output

2023-09-15 Thread Luck, Tony
> + return in_task() ? task_pid_nr(current) | (smp_processor_id() << 
> CPU_ID_SHIFT) :

There are contexts and CONFIG options around pre-emption where 
smp_processor_id()
will throw a warning. Use raw_smp_processor_id().

-Tony


RE: [PATCH] EDAC/mc_sysfs: refactor deprecated strncpy

2023-09-13 Thread Luck, Tony
> `strncpy` is deprecated for use on NUL-terminated destination strings [1].
>
> We should prefer more robust and less ambiguous string interfaces.
>
> A suitable replacement is `strscpy_pad` [2] due to the fact that it guarantees
> NUL-termination on the destination buffer whilst maintaining the
> NUL-padding behavior that `strncpy` provides. This may not be strictly
> necessary but as I couldn't understand what this code does I wanted to
> ensure that the functionality is the same.
>
> Link: 
> https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings
>  [1]
> Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html 
> [2]
> Link: https://github.com/KSPP/linux/issues/90
> Cc: linux-harden...@vger.kernel.org
> Signed-off-by: Justin Stitt 
> ---
> Note: build-tested only.
> ---
>  drivers/edac/edac_mc_sysfs.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/drivers/edac/edac_mc_sysfs.c b/drivers/edac/edac_mc_sysfs.c
> index 15f63452a9be..b303309a63cf 100644
> --- a/drivers/edac/edac_mc_sysfs.c
> +++ b/drivers/edac/edac_mc_sysfs.c
> @@ -229,8 +229,7 @@ static ssize_t channel_dimm_label_store(struct device 
> *dev,
> if (copy_count == 0 || copy_count >= sizeof(rank->dimm->label))
> return -EINVAL;
>
> -   strncpy(rank->dimm->label, data, copy_count);
> -   rank->dimm->label[copy_count] = '\0';
> +   strscpy_pad(rank->dimm->label, data, copy_count);

That doc page says the problem with strncpy() is that it doesn't guarantee to
NUL terminate the target string. But this code is aware of that limitation and
zaps a '\0' at the end to be sure.

So this code doesn't suffer from the potential problems.

If it is going to be fixed, then some further analysis of the original code
would be wise. Just replacing with strscpy_pad() means the code probably
still suffers from the "needless performance penalty" also mentioned in
the deprecation document.

-Tony



RE: [PATCH v7] x86/mce: retrieve poison range from hardware

2022-08-23 Thread Luck, Tony
> What I'm missing from this text here is, what *is* the mce->misc LSB
> field in human speak? What does that field denote?

The SDM says:

 Recoverable Address LSB (bits 5:0): The lowest valid recoverable address bit. 
Indicates the position of the least
 significant bit (LSB) of the recoverable error address. For example, if the 
processor logs bits [43:9] of the
 address, the LSB sub-field in IA32_MCi_MISC is 01001b (9 decimal). For this 
example, bits [8:0] of the
 recoverable error address in IA32_MCi_ADDR should be ignored.

So in human speak "how much data did you lose". "6" is a common value saying a 
cache line (2<<6 == 64)
was lost. Sometimes you see "12' (2<<12 == 4096) for a whole page lost.

-Tony


RE: [PATCH v5] x86/mce: retrieve poison range from hardware

2022-08-01 Thread Luck, Tony
>   struct mce m;
> + int lsb = PAGE_SHIFT;

Some maintainers like to order local declaration lines from longest to shortest
 
> + /*
> +  * Even if the ->validation_bits are set for address mask,
> +  * to be extra safe, check and reject an error radius '0',
> +  * and fallback to the default page size.
> +  */
> + if (mem_err->validation_bits & CPER_MEM_VALID_PA_MASK) {
> + lsb = __ffs64(mem_err->physical_addr_mask);
> + if (lsb == 1)
> + lsb = PAGE_SHIFT;
> + }


The comment above __ffs64() says:

* The result is not defined if no bits are set, so check that @word
 * is non-zero before calling this.

So if the intent is "extra safe" should check for that:

if (mem_err->validation_bits & CPER_MEM_VALID_PA_MASK &&
mem_err->physical_addr_mask) {
lsb = __ffs64(mem_err->physical_addr_mask);
if (lsb == 1)
lsb = PAGE_SHIFT;
}

-Tony





Re: [PATCH v3] x86/mce: retrieve poison range from hardware

2022-07-18 Thread Luck, Tony
On Mon, Jul 18, 2022 at 09:11:33PM +, Jane Chu wrote:
> On 7/18/2022 12:22 PM, Luck, Tony wrote:
> >> It appears the kernel is trusting that ->physical_addr_mask is non-zero
> >> in other paths. So this is at least equally broken in the presence of a
> >> broken BIOS. The impact is potentially larger though with this change,
> >> so it might be a good follow-on patch to make sure that
> >> ->physical_addr_mask gets fixed up to a minimum mask value.
> > 
> > Agreed. Separate patch to sanitize early, so other kernel code can just use 
> > it.
> > 
> 
> Is it possible that with
>if (mem->validation_bits & CPER_MEM_VALID_PA_MASK)
> the ->physical_addr_mask is still untrustworthy?

The validation_bits just show which fields the BIOS *says* it filled in.
If a validation bit isn't set, then Linux should certainly ignore that
field. But if it is set, then Linux needs to decide whether to use the
value, or do a sanity check first.

-Tony



RE: [PATCH v3] x86/mce: retrieve poison range from hardware

2022-07-18 Thread Luck, Tony
> It appears the kernel is trusting that ->physical_addr_mask is non-zero
> in other paths. So this is at least equally broken in the presence of a
> broken BIOS. The impact is potentially larger though with this change,
> so it might be a good follow-on patch to make sure that
> ->physical_addr_mask gets fixed up to a minimum mask value.

Agreed. Separate patch to sanitize early, so other kernel code can just use it.

-Tony



RE: [PATCH v3] x86/mce: retrieve poison range from hardware

2022-07-18 Thread Luck, Tony
+   m.misc = (MCI_MISC_ADDR_PHYS << 6) | 
__ffs64(mem_err->physical_addr_mask);

Do we want to unconditionally trust the sanity of the BIOS provided 
physical_address_mask?

There's a warning comment on the kernel __ffs64() function:

 * The result is not defined if no bits are set, so check that @word
 * is non-zero before calling this.

Otherwise, this looks like a good idea.

-Tony



Re: Phantom PMEM poison issue

2022-01-21 Thread Luck, Tony
On Sat, Jan 22, 2022 at 12:40:18AM +, Jane Chu wrote:
> On 1/21/2022 4:31 PM, Jane Chu wrote:
> > On baremetal Intel platform with DCPMEM installed and configured to
> > provision daxfs, say a poison was consumed by a load from a user thread,
> > and then daxfs takes action and clears the poison, confirmed by "ndctl
> > -NM".
> > 
> > Now, depends on the luck, after sometime(from a few seconds to 5+ hours)
> > the ghost of the previous poison will surface, and it takes
> > unload/reload the libnvdimm drivers in order to drive the phantom poison
> > away, confirmed by ARS.
> > 
> > Turns out, the issue is quite reproducible with the latest stable Linux.
> > 
> > Here is the relevant console message after injected 8 poisons in one
> > page via
> > # ndctl inject-error namespace0.0 -n 2 -B 8210
> 
> There is a cut-n-paste error, the above line should be
>"# ndctl inject-error namespace0.0 -n 8 -B 8210"

You say "in one page" here. What is the page size? 
> 
> -jane
> 
> > then, cleared them all, and wait for 5+ hours, notice the time stamp.
> > BTW, the system is idle otherwise.
> > 
> > [ 2439.742296] mce: Uncorrected hardware memory error in user-access at
> > 1850602400
> > [ 2439.742420] Memory failure: 0x1850602: Sending SIGBUS to
> > fsdax_poison_v1:8457 due to hardware memory corruption
> > [ 2439.761866] Memory failure: 0x1850602: recovery action for dax page:
> > Recovered
> > [ 2439.769949] mce: [Hardware Error]: Machine check events logged
> > -1850603000 uncached-minus<->write-back
> > [ 2439.769984] x86/PAT: memtype_reserve failed [mem
> > 0x1850602000-0x1850602fff], track uncached-minus, req uncached-minus
> > [ 2439.769985] Could not invalidate pfn=0x1850602 from 1:1 map
> > [ 2440.856351] x86/PAT: fsdax_poison_v1:8457 freeing invalid memtype
> > [mem 0x1850602000-0x1850602fff]

This error is reported in PFN=1850602 (at offset 0x400 = 1K)

> > 
> > At this point,
> > # ndctl list -NMu -r 0
> > {
> > "dev":"namespace0.0",
> > "mode":"fsdax",
> > "map":"dev",
> > "size":"15.75 GiB (16.91 GB)",
> > "uuid":"2ccc540a-3c7b-4b91-b87b-9e897ad0b9bb",
> > "sector_size":4096,
> > "align":2097152,
> > "blockdev":"pmem0"
> > }
> > 
> > [21351.992296] {2}[Hardware Error]: Hardware error from APEI Generic
> > Hardware Error Source: 1
> > [21352.001528] {2}[Hardware Error]: event severity: recoverable
> > [21352.007838] {2}[Hardware Error]:  Error 0, type: recoverable
> > [21352.014156] {2}[Hardware Error]:   section_type: memory error
> > [21352.020572] {2}[Hardware Error]:   physical_address: 0x001850603200

This error is in the following page: PFN=1850603 (at offset 0x200 = 512b)

Is that what you mean by "phantom error" ... from a different
address from those that were injected?

-Tony



RE: [PATCH v1 3/3] mm,hwpoison: add kill_accessing_process() to find error virtual address

2021-04-20 Thread Luck, Tony
> I meant in this case (racing to access the same poisoned pages), the
> page mapping should have been removed by and the hwpoison swap pte
> installed by the winner thread?

My "mutex" patch that Horiguchi-san has included his summary series should
make this happen in most cases. Only problem is if the first thread runs into 
some
error and does not complete unmapping the poison page from all other tasks.

So the backup plan is to scan the page tables.

>> Well, I did try a patch that removed *all* user mappings (switched CR3 to
>> swapper_pgdir) and returned to user. Then have the resulting page fault
>> report the address. But that didn't work very well.
> Curious what didn't work well in this case? :-)

Andy Lutomirski wasn't happy with the approach. It was specifically
to cover the "kernel accesses poison more than once from get_user()".

It doesn't generalize to the case where the user accessed the poison (because
you'll just take the #PF on the instruction fetch ... everything is unmapped ...
instead of on the original data access).

-Tony



Re: [PATCH v1 3/3] mm,hwpoison: add kill_accessing_process() to find error virtual address

2021-04-20 Thread Luck, Tony
On Mon, Apr 19, 2021 at 07:03:01PM -0700, Jue Wang wrote:
> On Tue, 13 Apr 2021 07:43:20 +0900, Naoya Horiguchi wrote:
> 
> > This patch suggests to do page table walk to find the error virtual
> > address.  If we find multiple virtual addresses in walking, we now can't
> > determine which one is correct, so we fall back to sending SIGBUS in
> > kill_me_maybe() without error info as we do now.  This corner case needs
> > to be solved in the future.
> 
> Instead of walking the page tables, I wonder what about the following idea:
> 
> When failing to get vaddr, memory_failure just ensures the mapping is removed
> and an hwpoisoned swap pte is put in place; or the original page is flagged 
> with
> PG_HWPOISONED and kept in the radix tree (e.g., for SHMEM THP).

To remove the mapping, you need to know the virtual address :-)

Well, I did try a patch that removed *all* user mappings (switched CR3 to
swapper_pgdir) and returned to user. Then have the resulting page fault
report the address. But that didn't work very well.

> NOTE: no SIGBUS is sent to user space.
> 
> Then do_machine_check just returns to user space to resume execution, the
> re-execution will result in a #PF and should land to the exact page fault
> handling code that generates a SIGBUS with the precise vaddr info:

That's how SRAO (and other races) are supposed to work.

-Tony


Re: [PATCH v1 3/3] mm,hwpoison: add kill_accessing_process() to find error virtual address

2021-04-20 Thread Luck, Tony
On Mon, Apr 19, 2021 at 06:49:15PM -0700, Jue Wang wrote:
> On Tue, 13 Apr 2021 07:43:20 +0900, Naoya Horiguchi wrote:
> ...
> > + * This function is intended to handle "Action Required" MCEs on already
> > + * hardware poisoned pages. They could happen, for example, when
> > + * memory_failure() failed to unmap the error page at the first call, or
> > + * when multiple Action Optional MCE events races on different CPUs with
> > + * Local MCE enabled.
> 
> +Tony Luck
> 
> Hey Tony, I thought SRAO MCEs are broadcasted to all cores in the system
> as they come without an execution context, is it correct?
> 
> If Yes, Naoya, I think we might want to remove the comments about the
> "multiple Action Optional MCE racing" part.

Jue,

Correct. SRAO machine checks are broadcast.  But rather than remove the
second part, just replace with "multiple local machine checks on different
CPUs".

-Tony


RE: [PATCH 4/4] x86/mce: Avoid infinite loop for copy from user recovery

2021-04-19 Thread Luck, Tony
>> But there are places in the kernel where the code assumes that this
>> EFAULT return was simply because of a page fault. The code takes some
>> action to fix that, and then retries the access. This results in a second
>> machine check.
>
> What about return EHWPOISON instead of EFAULT and update the callers
> to handle EHWPOISON explicitly: i.e., not retry but give up on the page?

That seems like a good idea to me. But I got some pushback when I started
on this path earlier with some patches to the futex code.  But back then I
wasn't using error return of EHWPOISON ... possibly the code would look
less hacky with that explicitly called out.

The futex case was specifically for code using pagefault_disable(). Likely
all the other callers would need to be audited (but there are only a few dozen
places, so not too big of a deal).

> My main concern is that the strong assumptions that the kernel can't hit more
> than a fixed number of poisoned cache lines before turning to user space
> may simply not be true.

Agreed.

> When DIMM goes bad, it can easily affect an entire bank or entire ram device
> chip. Even with memory interleaving, it's possible that a kernel control path
> touches lots of poisoned cache lines in the buffer it is working through.

These larger failures have other problems ... dozens of unrelated pages
may be affected. In a perfect world Linux would be told on the first error
that this is just one of many errors ... and be given a list. But in the real
world that isn't likely to happen :-(

-Tony


RE: [PATCH 3/4] mce/copyin: fix to not SIGBUS when copying from user hits poison

2021-04-13 Thread Luck, Tony
> So what I'm missing with all this fun is, yeah, sure, we have this
> facility out there but who's using it? Is anyone even using it at all?

Even if no applications ever do anything with it, it is still useful to avoid
crashing the whole system and just terminate one application/guest.

> If so, does it even make sense, does it need improvements, etc?

There's one more item on my long term TODO list. Add fixups so that
copy_to_user() from poison in the page cache doesn't crash, but just
checks to see if the page was clean .. .in which case re-read from the
filesystem into a different physical page and retire the old page ... the
read can now succeed. If the page is dirty, then fail the read (and retire
the page ... need to make sure filesystem knows the data for the page
was lost so subsequent reads return -EIO or something).

Page cache occupies enough memory that it is a big enough
source of system crashes that could be avoided. I'm not sure
if there are any other obvious cases after this ... it all gets into
diminishing returns ... not really worth it to handle a case that
only occupies 0.2% of memory.

> Because from where I stand it all looks like we do all these fancy
> recovery things but is userspace even paying attention or using them or
> whatever...

See above. With core counts continuing to increase, the cloud service
providers really want to see fewer events that crash the whole physical
machine (taking down dozens, or hundreds, of guest VMs).

-Tony


Re: [PATCH 3/4] mce/copyin: fix to not SIGBUS when copying from user hits poison

2021-04-08 Thread Luck, Tony
On Thu, Apr 08, 2021 at 10:49:58AM +0200, Borislav Petkov wrote:
> On Wed, Apr 07, 2021 at 02:43:10PM -0700, Luck, Tony wrote:
> > On Wed, Apr 07, 2021 at 11:18:16PM +0200, Borislav Petkov wrote:
> > > On Thu, Mar 25, 2021 at 05:02:34PM -0700, Tony Luck wrote:
> > > > Andy Lutomirski pointed out that sending SIGBUS to tasks that
> > > > hit poison in the kernel copying syscall parameters from user
> > > > address space is not the right semantic.
> > > 
> > > What does that mean exactly?
> > 
> > Andy said that a task could check a memory range for poison by
> > doing:
> > 
> > ret = write(fd, buf, size);
> > if (ret == size) {
> > memory range is all good
> > }
> > 
> > That doesn't work if the kernel sends a SIGBUS.
> > 
> > It doesn't seem a likely scenario ... but Andy is correct that
> > the above ought to work.
> 
> We need to document properly what this is aiming to fix. He said
> something yesterday along the lines of kthread_use_mm() hitting a SIGBUS
> when a kthread "attaches" to an address space. I'm still unclear as to
> how exactly that happens - there are only a handful of kthread_use_mm()
> users in the tree...

Also not clear to me either ... but sending a SIGBUS to a kthread isn't
going to do anything useful. So avoiding doing that is another worthy
goal.

> > Yes. This is for kernel reading memory belongng to "current" task.
> 
> Provided "current" is really the task to which the poison page belongs.
> That kthread_use_mm() thing sounded like the wrong task gets killed. But that
> needs more details.

With these patches nothing gets killed when kernel touches user poison.
If this is in a regular system call then these patches will return EFAULT
to the user (but now that I see EHWPOISON exists that looks like a better
choice - so applications can distinguish the "I just used an invalid address in
a parameter to a syscall" from "This isn't my fault, the memory broke".

> > Same in that the page gets unmapped. Different in that there
> > is no SIGBUS if the kernel did the access for the user.
> 
> What is even the actual use case with sending tasks SIGBUS on poison
> consumption? KVM? Others?

KVM apparently passes a machine check into the guest. Though it seems
to be misisng the MCG_STATUS information to tell the guest whether this
is an "Action Required" machine check, or an "Action Optional" (i.e.
whether the poison was found synchonously by execution of the current
instruction, or asynchronously).

> Are we documenting somewhere: "if your process gets a SIGBUS and this
> and that, which means your page got offlined, you should do this and
> that to recover"?

There is the ancient Documentation/vm/hwpoison.rst from 2009 ... nothing
seems wrong in that, but could use some updates.  I don't know how much
detail we might want to go into on recovery stratgies for applications.
In terms of production s/w there was one ISV who prototyped recovery
for their application but last time I checked didn't enable it in the
production version.

Essentially it boils down to:
SIGBUS handler gets additional data giving virtual address that has gone away

1) Can the application replace the lost page?
Use mmap(addr, MAP_FIXED, ...) to map a fresh page into the gap
and fill with replacement data. This case can return from SIGBUS
handler to re-execute failed instruction
2) Can the application continue in degraded mode w/o the lost page?
Hunt down pointers to lost page and update structures to say
"this data lost". Use siglongjmp() to go to preset recovery path
3) Can the application shut down gracefully?
Record details of the lost page. Inform next-of-kin. Exit.
4) Default - just exit

-Tony


RE: [PATCH 4/4] x86/mce: Avoid infinite loop for copy from user recovery

2021-04-08 Thread Luck, Tony
> What I'm still unclear on, does this new version address that
> "mysterious" hang or panic which the validation team triggered or you
> haven't checked yet?

No :-(

They are triggering some case where multiple threads in a process hit the same
poison, and somehow memory_failure() fails to complete offlining the page. At 
this
point any other threads that hit that page get the early return from 
memory_failure
(because the page flags say it is poisoned) ... and so we loop.

But the "recover from cases where multiple machine checks happen
simultaneously" case is orthogonal to the "do the right thing to recover
when the kernel touches poison at a user address". So I think we can
tackle them separately

-Tony


RE: [RFC 0/4] Fix machine check recovery for copy_from_user

2021-04-08 Thread Luck, Tony
> I have one scenario, may you take into account:
>
> If one copyin case occurs, write() returned by your patch, the user process 
> may
> check the return values, for errors, it may exit the process, then the error 
> page
> will be freed, and then the page maybe alloced to other process or to kernel 
> itself,
> then code will initialize it and this will trigger one SRAO, if it's used by 
> kernel,
> we may do nothing for this, and kernel may still touch it, and lead to one 
> panic.

In this case kill_me_never() calls memory_failure() with flags == 0. I think 
(hope!)
that means that it will unmap the page from the task, but will not send a 
signal.

When the task exits the PTE for this page has the swap/poison signature, so the
page is not freed for re-use.

-Tony


Re: [PATCH 3/4] mce/copyin: fix to not SIGBUS when copying from user hits poison

2021-04-07 Thread Luck, Tony
On Wed, Apr 07, 2021 at 11:18:16PM +0200, Borislav Petkov wrote:
> On Thu, Mar 25, 2021 at 05:02:34PM -0700, Tony Luck wrote:
> > Andy Lutomirski pointed out that sending SIGBUS to tasks that
> > hit poison in the kernel copying syscall parameters from user
> > address space is not the right semantic.
> 
> What does that mean exactly?

Andy said that a task could check a memory range for poison by
doing:

ret = write(fd, buf, size);
if (ret == size) {
memory range is all good
}

That doesn't work if the kernel sends a SIGBUS.

It doesn't seem a likely scenario ... but Andy is correct that
the above ought to work.

> 
> From looking at the code, that is this conditional:
> 
> if (t == EX_HANDLER_UACCESS && regs && is_copy_from_user(regs)) {
> m->kflags |= MCE_IN_KERNEL_RECOV;
> m->kflags |= MCE_IN_KERNEL_COPYIN;
> 
> so what does the above have to do with syscall params?

Most "copy from user" instances are the result of a system call parameter
(e.g. "buf" in the write(2) example above).

> If it is about us being in ring 0 and touching user memory and eating
> poison in same *user* memory while doing so, then sure, that makes
> sense.

Yes. This is for kernel reading memory belongng to "current" task.

> > So stop doing that. Add a new kill_me_never() call back that
> > simply unmaps and offlines the poison page.
> 
> Right, that's the same as handling poisoned user memory.

Same in that the page gets unmapped. Different in that there
is no SIGBUS if the kernel did the access for the user.

-Tony


RE: [PATCH v3] mm,hwpoison: return -EHWPOISON when page already poisoned

2021-04-02 Thread Luck, Tony
>> Combined with my "mutex" patch (to get rid of races where 2nd process returns
>> early, but first process is still looking for mappings to unmap and tasks
>> to signal) this patch moves forward a bit. But I think it needs an
>> additional change here in kill_me_maybe() to just "return" if there is a
>> EHWPOISON return from memory_failure()
>> 
> Got this, Thanks for your reply!
> I will dig into this!

One problem with this approach is when the first task to find poison
fails to complete actions. Then the poison pages are not unmapped,
and just returning from kill_me_maybe() gets into a loop :-(

-Tony


Re: [PATCH v3] mm,hwpoison: return -EHWPOISON when page already poisoned

2021-04-01 Thread Luck, Tony
On Wed, Mar 31, 2021 at 07:25:40PM +0800, Aili Yao wrote:
> When the page is already poisoned, another memory_failure() call in the
> same page now return 0, meaning OK. For nested memory mce handling, this
> behavior may lead to one mce looping, Example:
> 
> 1.When LCME is enabled, and there are two processes A && B running on
> different core X && Y separately, which will access one same page, then
> the page corrupted when process A access it, a MCE will be rasied to
> core X and the error process is just underway.
> 
> 2.Then B access the page and trigger another MCE to core Y, it will also
> do error process, it will see TestSetPageHWPoison be true, and 0 is
> returned.
> 
> 3.The kill_me_maybe will check the return:
> 
> 1244 static void kill_me_maybe(struct callback_head *cb)
> 1245 {
> 
> 1254 if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags) &&
> 1255 !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) {
> 1256 set_mce_nospec(p->mce_addr >> PAGE_SHIFT,
> p->mce_whole_page);
> 1257 sync_core();
> 1258 return;
> 1259 }
> 
> 1267 }

With your change memory_failure() will return -EHWPOISON for the
second task that consumes poison ... so that "if" statement won't
be true and so we fall into the following code:

1273 if (p->mce_vaddr != (void __user *)-1l) {
1274 force_sig_mceerr(BUS_MCEERR_AR, p->mce_vaddr, PAGE_SHIFT);
1275 } else {
1276 pr_err("Memory error not recovered");
1277 kill_me_now(cb);
1278 }

If this was a copy_from_user() machine check, p->mce_vaddr is set and
the task gets a BUS_MCEERR_AR SIGBUS, otherwise we print that

"Memory error not recovered"

message and send a generic SIGBUS.  I don't think either of those options
is right.

Combined with my "mutex" patch (to get rid of races where 2nd process returns
early, but first process is still looking for mappings to unmap and tasks
to signal) this patch moves forward a bit. But I think it needs an
additional change here in kill_me_maybe() to just "return" if there is a
EHWPOISON return from memory_failure()

-Tony


RE: [PATCH] x86/mce/dev-mcelog: Fix potential memory access error

2021-03-29 Thread Luck, Tony
-   set_bit(MCE_OVERFLOW, (unsigned long *)>flags);
+   mcelog->flags |= BIT(MCE_OVERFLOW);

set_bit() is an atomic operation because it might race with the code to
get and clear this bit:

do {
flags = mcelog->flags;
} while (cmpxchg(>flags, flags, 0) != flags);

Originally this was needed because mcelog could be called from #MC
context.

That's all changed now and the machine check notifier chain routines are
called in a kernel thread. So some other locking (mutex?) could be used
to protect access to mcelog->flags.

-Tony



RE: [PATCH] x86/mce: Add Skylake quirk for patrol scrub reported errors

2021-03-24 Thread Luck, Tony
> Yeah, into
> 
> fd258dc4442c ("x86/mce: Add Skylake quirk for patrol scrub reported errors")

Thanks ... my memory is failing, and I forgot that the patch had been improved 
and
moved from core.c (where I looked for it) into severity.c

-Tony



RE: [PATCH v5 2/3] x86/bus_lock: Handle #DB for bus lock

2021-03-19 Thread Luck, Tony
>  What is the justifucation for making this rate limit per UID and not
>  per task, per process or systemwide?

The concern is that a malicious user is running a workload that loops
obtaining the buslock. This brings the whole system to its knees.

Limiting per task doesn't help. The user can just fork(2) a whole bunch
of tasks for a distributed buslock attack..

Systemwide might be an interesting alternative. Downside would be accidental
rate limit of non-malicious tasks that happen to grab a bus lock periodically
but in the same window with other buslocks from other users.

Do you think that a risk worth taking to make the code simpler?

-Tony


Re: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned

2021-03-16 Thread Luck, Tony
On Fri, Mar 12, 2021 at 11:48:31PM +, Luck, Tony wrote:
> Thanks to the decode of the instruction we do have the virtual address. So we 
> just need
> a safe walk of pgd->p4d->pud->pmd->pte (truncated if we hit a huge page) with 
> a write
> of a "not-present" value. Maybe a different poison type from the one we get 
> from
> memory_failure() so that the #PF code can recognize this as a special case 
> and do any
> other work that we avoided because we were in #MC context.

There is no such thing as the safe walk I describe above. In a multi
threaded application other threads might munmap(2) bits of the address
space which could free some of the p4d->pud->pmd->pte tables.

But the pgd table is guaranteed to exist as long as any of the threads
in the process exist. Which gave rise to a new approach (partial credit
to Dave Hansen who helped me understand why a more extreme patch that
I wanted to use wasn't working ... and he pointed out the pgd persistence).

N.B. The code below DOES NOT WORK. My test application tried to do a write(2)
syscall with some poison in the buffer. Goal is that write should return a
short count (with only the bytes from the buffer leading up to the poison
written to the file). Currently the process gets a suprise SIGSEGV in
some glibc code :-(

The code would also need a bunch more work to handle the fact that
in a multi-threaded application multiple threads might consume the
poison, and some innocent threads not consuming the poison may also end
up drawn into the melee in the page fault handler.


The big picture.
---

This approach passes the buck for the recovery from the #MC handler
(which has very limited options to do anything useful) to the #PF
handler (which is a much friendlier environment where locks and mutexes
can be obtained as needed).

The mechanism for this buck passing is for the #MC handler to set a
reserved bit in the PGD entry for the user address where the machine
check happened, flush the TLB for that address, and then return from
the #MC handler to the kernel get_user()/copy_from_user() code which
will re-execute the instruction to access the poison address, but this
time take a #PF because of the reserved bit in the PGD.

There's a couple of changes bundled in here to help my debugging:
1) Turn off UCNA calls to memory_failure() ... just avoids races
   and make tests more determinstic for now
2) Disable "fast string" support ... avoid "REP MOVS" copy routines
   since I'm testing on an old system that treats poison consumed
   by REP MOVS as fatal.

Posted here for general comments on the approach.  On the plus
side it avoids taking multiple machine checks in the kernel when
it is accessing user data. I think it can also meet the goal set
by Andy Lutomirski of avoiding SIGBUS from syscalls that happen
to touch user poison. They should see either short counts for
calls like write(2) that may partially succeed or -EFAULT if
the system call totally failed.

-Tony

---

diff --git a/arch/x86/include/asm/pgtable_types.h 
b/arch/x86/include/asm/pgtable_types.h
index f24d7ef8fffa..e533eaf20834 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -23,6 +23,7 @@
 #define _PAGE_BIT_SOFTW2   10  /* " */
 #define _PAGE_BIT_SOFTW3   11  /* " */
 #define _PAGE_BIT_PAT_LARGE12  /* On 2MB or 1GB pages */
+#define _PAGE_BIT_RESERVED151  /* Reserved bit */
 #define _PAGE_BIT_SOFTW4   58  /* available for programmer */
 #define _PAGE_BIT_PKEY_BIT059  /* Protection Keys, bit 1/4 */
 #define _PAGE_BIT_PKEY_BIT160  /* Protection Keys, bit 2/4 */
@@ -56,6 +57,7 @@
 #define _PAGE_PAT_LARGE (_AT(pteval_t, 1) << _PAGE_BIT_PAT_LARGE)
 #define _PAGE_SPECIAL  (_AT(pteval_t, 1) << _PAGE_BIT_SPECIAL)
 #define _PAGE_CPA_TEST (_AT(pteval_t, 1) << _PAGE_BIT_CPA_TEST)
+#define _PAGE_RESERVED1(_AT(pteval_t, 1) << _PAGE_BIT_RESERVED1)
 #ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
 #define _PAGE_PKEY_BIT0(_AT(pteval_t, 1) << _PAGE_BIT_PKEY_BIT0)
 #define _PAGE_PKEY_BIT1(_AT(pteval_t, 1) << _PAGE_BIT_PKEY_BIT1)
diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index 7f7200021bd1..269e8ee04c11 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -45,4 +45,24 @@ void __noreturn handle_stack_overflow(const char *message,
  unsigned long fault_address);
 #endif
 
+static inline void pgd_set_reserved(pgd_t *pgdp, unsigned long addr)
+{
+   pgd_t   pgd;
+
+   pgdp += pgd_index(addr);
+   pgd = *pgdp;
+   pgd.pgd |= _PAGE_RESERVED1;
+   WRITE_ONCE(*pgdp, pgd);
+}
+
+static inline void pgd_clr_reserved(pgd_t *pgdp, unsigned long addr)
+{
+   pgd_t   pgd;
+
+   pgdp += pgd_index(addr);
+   pgd = *pgdp;
+   pgd.pgd &= 

RE: EDAC list as Trojan Horse distribution ??

2021-03-16 Thread Luck, Tony
>> Nothing new - just the next spammer attempt. 

> But this was a new class of Spam. So far i got only mass mailing... This was 
> personalized based on my previous e-Mail (did not include this part in my 
> mail)

Somewhat new - combining trawling of public mailing lists for addresses with
a phishing attack trying to get you to open a (presumably) malicious payload.

I haven't personally seen that ... probably my company's e-mail filters blocked 
it.

-Tony


RE: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned

2021-03-12 Thread Luck, Tony
>> will memory_failure() find it and unmap it? if succeed, then the current 
>> will be
>> signaled with correct vaddr and shift?
>
> That's a very good question.  I didn't see a SIGBUS when I first wrote this 
> code,
> hence all the p->mce_vaddr.  But now I'm
> a) not sure why there wasn't a signal
> b) if we are to fix the problems noted by AndyL, need to make sure that there 
> isn't a SIGBUS

Tests on upstream kernel today show that memory_failure() is both unmapping the 
page
and sending a SIGBUS.


My biggest issue with the KERNEL_COPYIN recovery path is that we don't have code
to mark the page not present while we are still in do_machine_check().  That's 
resulted
in recovery working for simple cases where there is a single get_user() call 
followed by
an error return if that failed. But more complex cases require more machine 
checks and
a touching faith that the kernel will eventually give up trying (spoiler: it 
sometimes doesn't).

Thanks to the decode of the instruction we do have the virtual address. So we 
just need
a safe walk of pgd->p4d->pud->pmd->pte (truncated if we hit a huge page) with a 
write
of a "not-present" value. Maybe a different poison type from the one we get from
memory_failure() so that the #PF code can recognize this as a special case and 
do any
other work that we avoided because we were in #MC context.

-Tony


RE: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned

2021-03-12 Thread Luck, Tony
> Sorry to interrupt as I am really confused here:
> If it's a copyin case, has the page been mapped for the current process?

Yes. The kernel has the current process mapped to the low address range
and code like get_user(), put_user() "simply" reaches down to those addresses
(maybe not so simply, as the access needs to be within a STAC/CLAC section
of code on modern CPUs, and the access instruction needs an entry in EXTABLE
so that the page fault handler can fix it if there isn't a page mapped at the 
user
address).

> will memory_failure() find it and unmap it? if succeed, then the current will 
> be
> signaled with correct vaddr and shift?

That's a very good question.  I didn't see a SIGBUS when I first wrote this 
code,
hence all the p->mce_vaddr.  But now I'm
a) not sure why there wasn't a signal
b) if we are to fix the problems noted by AndyL, need to make sure that there 
isn't a SIGBUS

> Maybe the mce_vaddr is set correctly, but we may lost the correct page shift?
Yes. There is a hard-coded PAGE_SHIFT for this case - which may not match the 
actual page size.

> And for copyin case, we don't need to call set_mce_nospec()?
Yes. We should.

Thanks for your good questions.

-Tony


RE: [PATCH V2 20/25] perf/x86/intel: Add Alder Lake Hybrid support

2021-03-11 Thread Luck, Tony
>> I think the "sapphire_rapids" is the code name for the server platform.
>
> If that's really the case, then:
>
> #define INTEL_FAM6_SAPPHIRERAPIDS_X 0x8F
>
> is wrong, those things should be uarch name, not platform name. Tony?

0x8F is the model number of the CPU that is named Sapphire Rapids that
goes into the Eagle Stream platform

If you want a uarch name, that might be the name of the core
( something-cove ... I can't keep track of the names of all the
coves). But that would likely lead to different confusion later if that
*cove core is used in some other CPU model number that doesn't
neatly fit into our _X, _L etc. suffix scheme)

-Tony


RE: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned

2021-03-11 Thread Luck, Tony
> I guess that p->mce_vaddr stores the virtual address of the error here.
> If so, sending SIGBUS with the address looks enough as we do now, so why
> do you walk page table to find the error virtual address?

p->mce_vaddr only has the virtual address for the COPYIN case. In that code
path we decode the kernel instruction that hit the fault in order to find the 
virtual
address. That's easy because:

1) The kernel RIP is known to be good (can't page fault etc. on kernel address).
2) There are only a half dozen instructions used by the kernel for get_user() or
 copy_from_user().

When the machine check happens during user execution accessing poison data
we only have the physical address (from MCi_ADDR).

-Tony


RE: [PATCH v3] x86/fault: Send a SIGBUS to user process always for hwpoison page access.

2021-03-11 Thread Luck, Tony
> I think we need to at least fix the existing bug before we add more
> signals.  AFAICS the MCE_IN_KERNEL_COPYIN code is busted for kernel
> threads.

Can a kernel thread do get_user() or copy_from_user()?  Do we have kernel 
threads
that have an associated user address space to copy from?

-Tony


Re: [PATCH v2] mm,hwpoison: return -EBUSY when page already poisoned

2021-03-09 Thread Luck, Tony
On Tue, Mar 09, 2021 at 08:28:24AM +, HORIGUCHI NAOYA(堀口 直也) wrote:
> On Tue, Mar 09, 2021 at 02:35:34PM +0800, Aili Yao wrote:
> > When the page is already poisoned, another memory_failure() call in the
> > same page now return 0, meaning OK. For nested memory mce handling, this
> > behavior may lead to mce looping, Example:
> > 
> > 1.When LCME is enabled, and there are two processes A && B running on
> > different core X && Y separately, which will access one same page, then
> > the page corrupted when process A access it, a MCE will be rasied to
> > core X and the error process is just underway.
> > 
> > 2.Then B access the page and trigger another MCE to core Y, it will also
> > do error process, it will see TestSetPageHWPoison be true, and 0 is
> > returned.
> > 
> > 3.The kill_me_maybe will check the return:
> > 
> > 1244 static void kill_me_maybe(struct callback_head *cb)
> > 1245 {
> > 
> > 1254 if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags) &&
> > 1255 !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) {
> > 1256 set_mce_nospec(p->mce_addr >> PAGE_SHIFT,
> > p->mce_whole_page);
> > 1257 sync_core();
> > 1258 return;
> > 1259 }
> > 
> > 1267 }
> > 
> > 4. The error process for B will end, and may nothing happened if
> > kill-early is not set, The process B will re-excute instruction and get
> > into mce again and then loop happens. And also the set_mce_nospec()
> > here is not proper, may refer to commit fd0e786d9d09 ("x86/mm,
> > mm/hwpoison: Don't unconditionally unmap kernel 1:1 pages").
> > 
> > For other cases which care the return value of memory_failure() should
> > check why they want to process a memory error which have already been
> > processed. This behavior seems reasonable.
> 
> Other reviewers shared ideas about the returned value, but actually
> I'm not sure which the best one is (EBUSY is not that bad).
> What we need to fix the reported issue is to return non-zero value
> for "already poisoned" case (the value itself is not so important). 
> 
> Other callers of memory_failure() (mostly test programs) could see
> the change of return value, but they can already see EBUSY now and
> anyway they should check dmesg for more detail about why failed,
> so the impact of the change is not so big.
> 
> > 
> > Signed-off-by: Aili Yao 
> 
> Reviewed-by: Naoya Horiguchi 

I think that both this and my "add a mutex" patch are both
too simplistic for this complex problem :-(

When multiple CPUs race to call memory_failure() for the same
page we need the following results:

1) Poison page should be marked not-present in all tasks
I think the mutex patch achieves this as long as
memory_failure() doesn't hit an error[1].

2) All tasks that were executing an instruction that was accessing
   the poison location should see a SIGBUS with virtual address and
   BUS_MCEERR_AR signature in siginfo.
Neither solution achieves this. The -EBUSY return ensures
that there is a SIGBUS for the tasks that get the -EBUSY
return, but no siginfo details.
Just the mutex patch *might* have BUS_MCEERR_AO signature
to the race losing tasks, but only if they have PF_MCE_EARLY
set (so says the comment in kill_proc() ... but I don't
see the code checking for that bit).

#2 seems hard to achieve ... there are inherent races that mean the
AO SIGBUS could have been queued to the task before it even hits
the poison.

Maybe should include a non-action:

3) A task should only see one SIGBUS per poison?
Not sure if this is achievable either ... what if the task
has the same page mapped multiple times?

-Tony

[1] still looking at why my futex injection test ends with a "reserved
kernel page still referenced by 1 users"


[PATCH] mm/memory-failure: Use a mutex to avoid memory_failure() races

2021-03-08 Thread Luck, Tony
There can be races when multiple CPUs consume poison from the same
page. The first into memory_failure() atomically sets the HWPoison
page flag and begins hunting for tasks that map this page. Eventually
it invalidates those mappings and may send a SIGBUS to the affected
tasks.

But while all that work is going on, other CPUs see a "success"
return code from memory_failure() and so they believe the error
has been handled and continue executing.

Fix by wrapping most of the internal parts of memory_failure() in
a mutex.

Signed-off-by: Tony Luck 
---
 mm/memory-failure.c | 19 +--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 24210c9bd843..c1509f4b565e 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1381,6 +1381,8 @@ static int memory_failure_dev_pagemap(unsigned long pfn, 
int flags,
return rc;
 }
 
+static DEFINE_MUTEX(mf_mutex);
+
 /**
  * memory_failure - Handle memory failure of a page.
  * @pfn: Page Number of the corrupted page
@@ -1424,12 +1426,18 @@ int memory_failure(unsigned long pfn, int flags)
return -ENXIO;
}
 
+   mutex_lock(_mutex);
+
 try_again:
-   if (PageHuge(p))
-   return memory_failure_hugetlb(pfn, flags);
+   if (PageHuge(p)) {
+   res = memory_failure_hugetlb(pfn, flags);
+   goto out2;
+   }
+
if (TestSetPageHWPoison(p)) {
pr_err("Memory failure: %#lx: already hardware poisoned\n",
pfn);
+   mutex_unlock(_mutex);
return 0;
}
 
@@ -1463,9 +1471,11 @@ int memory_failure(unsigned long pfn, int flags)
res = MF_FAILED;
}
action_result(pfn, MF_MSG_BUDDY, res);
+   mutex_unlock(_mutex);
return res == MF_RECOVERED ? 0 : -EBUSY;
} else {
action_result(pfn, MF_MSG_KERNEL_HIGH_ORDER, 
MF_IGNORED);
+   mutex_unlock(_mutex);
return -EBUSY;
}
}
@@ -1473,6 +1483,7 @@ int memory_failure(unsigned long pfn, int flags)
if (PageTransHuge(hpage)) {
if (try_to_split_thp_page(p, "Memory Failure") < 0) {
action_result(pfn, MF_MSG_UNSPLIT_THP, MF_IGNORED);
+   mutex_unlock(_mutex);
return -EBUSY;
}
VM_BUG_ON_PAGE(!page_count(p), p);
@@ -1517,6 +1528,7 @@ int memory_failure(unsigned long pfn, int flags)
num_poisoned_pages_dec();
unlock_page(p);
put_page(p);
+   mutex_unlock(_mutex);
return 0;
}
if (hwpoison_filter(p)) {
@@ -1524,6 +1536,7 @@ int memory_failure(unsigned long pfn, int flags)
num_poisoned_pages_dec();
unlock_page(p);
put_page(p);
+   mutex_unlock(_mutex);
return 0;
}
 
@@ -1559,6 +1572,8 @@ int memory_failure(unsigned long pfn, int flags)
res = identify_page_state(pfn, p, page_flags);
 out:
unlock_page(p);
+out2:
+   mutex_unlock(_mutex);
return res;
 }
 EXPORT_SYMBOL_GPL(memory_failure);
-- 
2.29.2



RE: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned

2021-03-08 Thread Luck, Tony
>> So it should be safe to grab and hold a mutex.  See patch below.
>
> The mutex approach looks simpler and safer, so I'm fine with it.

Thanks.  Is that an "Acked-by:"?

>>  /**
>>   * memory_failure - Handle memory failure of a page.
>>   * @pfn: Page Number of the corrupted page
>> @@ -1424,12 +1426,18 @@ int memory_failure(unsigned long pfn, int flags)
>>  return -ENXIO;
>>  }
>>  
>> +mutex_lock(_mutex);
>
> Is it better to take mutex before memory_failure_dev_pagemap() block?
> Or we don't have to protect against race for device memory?

No races (recovery is only attempted for errors in normal memory).

-Tony


RE: [PATCH v3] x86/fault: Send a SIGBUS to user process always for hwpoison page access.

2021-03-08 Thread Luck, Tony
> Can you point me at that SIGBUS code in a current kernel?

It is in kill_me_maybe().  mce_vaddr is setup when we disassemble whatever 
get_user()
or copy from user variant was in use in the kernel when the poison memory was 
consumed.

if (p->mce_vaddr != (void __user *)-1l) {
force_sig_mceerr(BUS_MCEERR_AR, p->mce_vaddr, PAGE_SHIFT);

Would it be any better if we used the BUS_MCEERR_AO code that goes into siginfo?

That would make it match up better with what happens when poison is found
asynchronously by the patrol scrubber. I.e. the semantics are:

AR: You just touched poison at this address and need to do something about that.
AO: Just letting you know that you have some poison at the address in siginfo.

-Tony


Re: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned

2021-03-05 Thread Luck, Tony
This whole page table walking patch is trying to work around the
races caused by multiple calls to memory_failure() for the same
page.

Maybe better to just avoid the races.  The comment right above
memory_failure says:

 * Must run in process context (e.g. a work queue) with interrupts
 * enabled and no spinlocks hold.

So it should be safe to grab and hold a mutex.  See patch below.

-Tony

commit 8dd0dbe7d595e02647e9c2c76c03341a9f6bd7b9
Author: Tony Luck 
Date:   Fri Mar 5 10:40:48 2021 -0800

Use a mutex to avoid memory_failure() races

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 24210c9bd843..c1509f4b565e 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1381,6 +1381,8 @@ static int memory_failure_dev_pagemap(unsigned long pfn, 
int flags,
return rc;
 }
 
+static DEFINE_MUTEX(mf_mutex);
+
 /**
  * memory_failure - Handle memory failure of a page.
  * @pfn: Page Number of the corrupted page
@@ -1424,12 +1426,18 @@ int memory_failure(unsigned long pfn, int flags)
return -ENXIO;
}
 
+   mutex_lock(_mutex);
+
 try_again:
-   if (PageHuge(p))
-   return memory_failure_hugetlb(pfn, flags);
+   if (PageHuge(p)) {
+   res = memory_failure_hugetlb(pfn, flags);
+   goto out2;
+   }
+
if (TestSetPageHWPoison(p)) {
pr_err("Memory failure: %#lx: already hardware poisoned\n",
pfn);
+   mutex_unlock(_mutex);
return 0;
}
 
@@ -1463,9 +1471,11 @@ int memory_failure(unsigned long pfn, int flags)
res = MF_FAILED;
}
action_result(pfn, MF_MSG_BUDDY, res);
+   mutex_unlock(_mutex);
return res == MF_RECOVERED ? 0 : -EBUSY;
} else {
action_result(pfn, MF_MSG_KERNEL_HIGH_ORDER, 
MF_IGNORED);
+   mutex_unlock(_mutex);
return -EBUSY;
}
}
@@ -1473,6 +1483,7 @@ int memory_failure(unsigned long pfn, int flags)
if (PageTransHuge(hpage)) {
if (try_to_split_thp_page(p, "Memory Failure") < 0) {
action_result(pfn, MF_MSG_UNSPLIT_THP, MF_IGNORED);
+   mutex_unlock(_mutex);
return -EBUSY;
}
VM_BUG_ON_PAGE(!page_count(p), p);
@@ -1517,6 +1528,7 @@ int memory_failure(unsigned long pfn, int flags)
num_poisoned_pages_dec();
unlock_page(p);
put_page(p);
+   mutex_unlock(_mutex);
return 0;
}
if (hwpoison_filter(p)) {
@@ -1524,6 +1536,7 @@ int memory_failure(unsigned long pfn, int flags)
num_poisoned_pages_dec();
unlock_page(p);
put_page(p);
+   mutex_unlock(_mutex);
return 0;
}
 
@@ -1559,6 +1572,8 @@ int memory_failure(unsigned long pfn, int flags)
res = identify_page_state(pfn, p, page_flags);
 out:
unlock_page(p);
+out2:
+   mutex_unlock(_mutex);
return res;
 }
 EXPORT_SYMBOL_GPL(memory_failure);


RE: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned

2021-03-05 Thread Luck, Tony
> From the walk, it seems we have got the virtual address, can we just send a 
> SIGBUS with it?

If the walk wins the race and the pte for the poisoned page is still valid, 
then yes.

But we could have:

CPU1CPU2
memory_failure sets poison
bit for struct page


rmap finds page in task
on CPU2 and sets PTE
to not-valid-poison

memory_failure returns
early because struct page
already marked as poison

walk page tables looking
for mapping - don't find it

-Tony


Re: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned

2021-03-04 Thread Luck, Tony
On Thu, Mar 04, 2021 at 02:45:24PM +0800, Aili Yao wrote:
> > > if your methods works, should it be like this?
> > > 
> > > 1582 pteval = 
> > > swp_entry_to_pte(make_hwpoison_entry(subpage));
> > > 1583 if (PageHuge(page)) {
> > > 1584 hugetlb_count_sub(compound_nr(page), 
> > > mm);
> > > 1585 set_huge_swap_pte_at(mm, address,
> > > 1586  pvmw.pte, 
> > > pteval,
> > > 1587  
> > > vma_mmu_pagesize(vma));
> > > 1588 } else {
> > > 1589 dec_mm_counter(mm, mm_counter(page));
> > > 1590 set_pte_at(mm, address, pvmw.pte, 
> > > pteval);
> > > 1591 }
> > > 
> > > the page fault check if it's a poison page using is_hwpoison_entry(),
> > >   
> > 
> > And if it works, does we need some locking mechanism before we call 
> > walk_page_range();
> > if we lock, does we need to process the blocking interrupted error as other 
> > places will do?
> > 
> 
> And another thing:
> Do we need a call to flush_tlb_page(vma, address) to make the pte changes 
> into effect?

Thanks for all the pointers.  I added them to the patch (see below).
[The pmd/pud cases may need some tweaking ... but just trying to get
the 4K page case working first]

I tried testing by skipping the call to memory_failure() and just
using this new code to search the page tables for current page and
marking it hwpoison (to simulate the case where 2nd process gets the
early return from memory_failure(). Something is still missing because I get:

[  481.911298] mce: pte_entry: matched pfn - mark poison & zap pte
[  481.917935] MCE: Killing einj_mem_uc: due to hardware memory corruption 
fault at 7fe64b33b400
[  482.933775] BUG: Bad page cache in process einj_mem_uc  pfn:408b6d6
[  482.940777] page:13ea6e96 refcount:3 mapcount:1 
mapping:e3a069d9 index:0x0 pfn:0x408b6d6
[  482.951355] memcg:94a809834000
[  482.955153] aops:shmem_aops ino:3c04
[  482.959142] flags: 0x97c0880015(locked|uptodate|lru|swapbacked|hwpoison)
[  482.967018] raw: 0097c0880015 94c80e93ec00 94c80e93ec00 
94c80a9b25a8
[  482.975666] raw:   0003 
94a809834000
[  482.984310] page dumped because: still mapped when deleted

-Tony

commit e5de44560b33e2d407704243566253a70f858a59
Author: Tony Luck 
Date:   Tue Mar 2 15:06:33 2021 -0800

x86/mce: Handle races between machine checks

When multiple CPUs hit the same poison memory there is a race. The
first CPU into memory_failure() atomically marks the page as poison
and continues processing to hunt down all the tasks that map this page
so that the virtual addresses can be marked not-present and SIGBUS
sent to the task that did the access.

Later CPUs get an early return from memory_failure() and may return
to user mode and access the poison again.

Add a new argument to memory_failure() so that it can indicate when
the race has been lost. Fix kill_me_maybe() to scan page tables in
this case to unmap pages.

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 7962355436da..a52c6a772de2 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -28,8 +28,12 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -637,6 +641,7 @@ static int uc_decode_notifier(struct notifier_block *nb, 
unsigned long val,
 {
struct mce *mce = (struct mce *)data;
unsigned long pfn;
+   int already = 0;
 
if (!mce || !mce_usable_address(mce))
return NOTIFY_DONE;
@@ -646,8 +651,9 @@ static int uc_decode_notifier(struct notifier_block *nb, 
unsigned long val,
return NOTIFY_DONE;
 
pfn = mce->addr >> PAGE_SHIFT;
-   if (!memory_failure(pfn, 0)) {
-   set_mce_nospec(pfn, whole_page(mce));
+   if (!memory_failure(pfn, 0, )) {
+   if (!already)
+   set_mce_nospec(pfn, whole_page(mce));
mce->kflags |= MCE_HANDLED_UC;
}
 
@@ -1248,6 +1254,79 @@ static void __mc_scan_banks(struct mce *m, struct 
pt_regs *regs, struct mce *fin
*m = *final;
 }
 
+static int pte_entry(pte_t *pte, unsigned long addr, unsigned long next, 
struct mm_walk *walk)
+{
+   u64 pfn = (u64)walk->private;
+   struct page *page;
+   pte_t pteval;
+
+   if (pte_pfn(*pte) == pfn) {
+pr_info("pte_entry: matched pfn - mark poison & zap pte\n");
+   page = pfn_to_page(pfn);
+   lock_page(page);
+SetPageHWPoison(page);
+   pteval = swp_entry_to_pte(make_hwpoison_entry(page));
+   

RE: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned

2021-03-03 Thread Luck, Tony
> For error address with sigbus, i think this is not an issue resulted by the 
> patch i post, before my patch, the issue is already there.
> I don't find a realizable way to get the correct address for same reason --- 
> we don't know whether the page mapping is there or not when
> we got to kill_me_maybe(), in some case, we may get it, but there are a lot 
> of parallel issue need to consider, and if failed we have to fallback
> to the error brach again, remaining current code may be an easy option;

My RFC patch from yesterday removes the uncertainty about whether the page is 
there or not. After it walks the page
tables we know that the poison page isn't mapped (note that patch is RFC for a 
reason ... I'm 90% sure that it should
do a bit more that just clear the PRESENT bit).

So perhaps memory_failure() has queued a SIGBUS for this task, if so, we take 
it when we return from kill_me_maybe()

If not, we will return to user mode and re-execute the failing instruction ... 
but because the page is unmapped we will take a #PF

The x86 page fault handler will see that the page for this physical address is 
marked HWPOISON, and it will send the SIGBUS
(just like it does if the page had been removed by an earlier UCNA/SRAO error).

At least that's the theory.

-Tony


Re: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned

2021-03-03 Thread Luck, Tony
On Fri, Feb 26, 2021 at 10:59:15AM +0800, Aili Yao wrote:
> Hi naoya, tony:
> > > 
> > > Idea for what we should do next ... Now that x86 is calling 
> > > memory_failure()
> > > from user context ... maybe parallel calls for the same page should
> > > be blocked until the first caller completes so we can:
> > > a) know that pages are unmapped (if that happens)
> > > b) all get the same success/fail status  
> > 
> > One memory_failure() call changes the target page's status and
> > affects all mappings to all affected processes, so I think that
> > (ideally) we don't have to block other threads (letting them
> > early return seems fine).  Sometimes memory_failure() fails,
> > but even in such case, PG_hwpoison is set on the page and other
> > threads properly get SIGBUSs with this patch, so I think that
> > we can avoid the worst scenario (like system stall by MCE loop).
> > 
> I agree with naoya's point, if we block for this issue, Does this change the 
> result
> that the process should be killed? Or is there something other still need to 
> be considered?

Ok ... no blocking ... I think someone in this thread suggested
scanning the page tables to make sure the poisoned page had been
unmapped.

There's a walk_page_range() function that does all the work for that.
Just need to supply some callback routines that check whether a
mapping includes the bad PFN and clear the PRESENT bit.

RFC patch below against v5.12-rc1

-Tony

>From 8de23b7f1be00ad38e129690dfe0b1558fad5ff8 Mon Sep 17 00:00:00 2001
From: Tony Luck 
Date: Tue, 2 Mar 2021 15:06:33 -0800
Subject: [PATCH] x86/mce: Handle races between machine checks

When multiple CPUs hit the same poison memory there is a race. The
first CPU into memory_failure() atomically marks the page as poison
and continues processing to hunt down all the tasks that map this page
so that the virtual addresses can be marked not-present and SIGBUS
sent to the task that did the access.

Later CPUs get an early return from memory_failure() and may return
to user mode and access the poison again.

Add a new argument to memory_failure() so that it can indicate when
the race has been lost. Fix kill_me_maybe() to scan page tables in
this case to unmap pages.
---
 arch/x86/kernel/cpu/mce/core.c | 61 +++---
 drivers/base/memory.c  |  2 +-
 include/linux/mm.h |  2 +-
 mm/hwpoison-inject.c   |  2 +-
 mm/madvise.c   |  2 +-
 mm/memory-failure.c|  6 ++--
 6 files changed, 64 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 7962355436da..2c6c560f3f92 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -30,6 +30,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -637,6 +638,7 @@ static int uc_decode_notifier(struct notifier_block *nb, 
unsigned long val,
 {
struct mce *mce = (struct mce *)data;
unsigned long pfn;
+   int already = 0;
 
if (!mce || !mce_usable_address(mce))
return NOTIFY_DONE;
@@ -646,8 +648,9 @@ static int uc_decode_notifier(struct notifier_block *nb, 
unsigned long val,
return NOTIFY_DONE;
 
pfn = mce->addr >> PAGE_SHIFT;
-   if (!memory_failure(pfn, 0)) {
-   set_mce_nospec(pfn, whole_page(mce));
+   if (!memory_failure(pfn, 0, )) {
+   if (!already)
+   set_mce_nospec(pfn, whole_page(mce));
mce->kflags |= MCE_HANDLED_UC;
}
 
@@ -1248,6 +1251,50 @@ static void __mc_scan_banks(struct mce *m, struct 
pt_regs *regs, struct mce *fin
*m = *final;
 }
 
+static int pte_entry(pte_t *pte, unsigned long addr, unsigned long next, 
struct mm_walk *walk)
+{
+   u64 pfn = (u64)walk->private;
+
+   if (pte_pfn(*pte) == pfn)
+   pte->pte = pte->pte & ~_PAGE_PRESENT;
+
+   return 0;
+}
+
+static int pmd_entry(pmd_t *pmd, unsigned long addr, unsigned long next, 
struct mm_walk *walk)
+{
+   int shift = PMD_SHIFT - PAGE_SHIFT;
+   u64 pfn = (u64)walk->private;
+
+   if (!pmd_large(*pmd))
+   return 0;
+
+   if (pmd_pfn(*pmd) >> shift == pfn >> shift)
+   pmd->pmd = pmd->pmd & ~_PAGE_PRESENT;
+
+   return 0;
+}
+
+static int pud_entry(pud_t *pud, unsigned long addr, unsigned long next, 
struct mm_walk *walk)
+{
+   int shift = PUD_SHIFT - PAGE_SHIFT;
+   u64 pfn = (u64)walk->private;
+
+   if (!pud_large(*pud))
+   return 0;
+
+   if (pud_pfn(*pud) >> shift == pfn >> shift)
+   pud->pud = pud->pud & ~_PAGE_PRESENT;
+
+   return 0;
+}
+
+static struct mm_walk_ops walk = {
+   .pte_entry = pte_entry,
+   .pmd_entry = pmd_entry,
+   .pud_entry = pud_entry
+};
+
 static void kill_me_now(struct callback_head *ch)
 {
force_sig(SIGBUS);
@@ -1257,15 +1304,19 @@ static void kill_me_maybe(struct 

RE: [PATCH v3] x86/fault: Send a SIGBUS to user process always for hwpoison page access.

2021-03-01 Thread Luck, Tony
> Some programs may use read(2), write(2), etc as ways to check if
> memory is valid without getting a signal.  They might not want
> signals, which means that this feature might need to be configurable.

That sounds like an appalling hack. If users need such a mechanism
we should create some better way to do that.

An aeon ago ACPI created the RASF table as a way for the OS to
ask the platform to scan a block of physical memory using the patrol
scrubber in the memory controller.  I never did anything with it in Linux
because it was just too complex and didn't know of any use cases.

Users would want to check virtual addresses. Perhaps some new
option MADV_CHECKFORPOISON to madvise(2) ?

-Tony


RE: [PATCH v3] x86/fault: Send a SIGBUS to user process always for hwpoison page access.

2021-03-01 Thread Luck, Tony
> Programs that get a signal might expect that the RIP that the signal
> frame points to is the instruction that caused the signal and that the
> instruction faulted without side effects.  For SIGSEGV, I would be
> especially nervous about this.  Maybe SIGBUS is safer.  For SIGSEGV,
> it's entirely valid to look at CR2 / si_fault_addr, fix it up, and
> return.  This would be completely *invalid* with your patch.  I'm not
> sure what to do about this.

The original plan was that s/w like databases would be able to write
their own application specific recovery code. E.g. they hit poison while
reading some "table". The process gets a SIGBUS with siginfo telling
the handler the virtual address range that has been lost. The code
uses mmap(MAP_FIXED) to map a new page into the lost address and
fills it with suitable data (either reconstructing lost data by replaying
transactions, or filling the table with some "data unknown" indicator).

Then the SIGBUS handler returns to re-execute the instruction that
failed.

As far as I know nobody has been that creative in production s/w.
But I think there are folks with a siglongjmp() to a "this whole transaction
just failed" safe point.

-Tony


Re: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned

2021-02-26 Thread Luck, Tony
On Fri, Feb 26, 2021 at 10:52:50AM +0800, Aili Yao wrote:
> Hi naoya,Oscar,david:
> > 
> > > We could use some negative value (error code) to report the reported case,
> > > then as you mentioned above, some callers need change to handle the
> > > new case, and the same is true if you use some positive value.
> > > My preference is -EHWPOISON, but other options are fine if justified 
> > > well.  
> > 
> > -EHWPOISON seems like a good fit.
> > 
> I am OK with the -EHWPOISON error code, But I have one doubt here:
> When we return this -EHWPOISON error code, Does this means we have to add a 
> new error code
> to error-base.h or errno.h? Is this easy realized?

The page already poisoned isn't really an error though. Just the result
of a race condition.  What if we added an extra argument to memory_failure()
so it can tell the caller that the specific reason for the early successful
return is that the page was already poisoned?

Something like this (untested - patch against v5.11):

-Tony

---

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index e133ce1e562b..0e32c4d879fb 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -637,6 +637,7 @@ static int uc_decode_notifier(struct notifier_block *nb, 
unsigned long val,
 {
struct mce *mce = (struct mce *)data;
unsigned long pfn;
+   int already = 0;
 
if (!mce || !mce_usable_address(mce))
return NOTIFY_DONE;
@@ -646,8 +647,9 @@ static int uc_decode_notifier(struct notifier_block *nb, 
unsigned long val,
return NOTIFY_DONE;
 
pfn = mce->addr >> PAGE_SHIFT;
-   if (!memory_failure(pfn, 0)) {
-   set_mce_nospec(pfn, whole_page(mce));
+   if (!memory_failure(pfn, 0, )) {
+   if (!already)
+   set_mce_nospec(pfn, whole_page(mce));
mce->kflags |= MCE_HANDLED_UC;
}
 
@@ -1245,15 +1247,19 @@ static void kill_me_maybe(struct callback_head *cb)
 {
struct task_struct *p = container_of(cb, struct task_struct, 
mce_kill_me);
int flags = MF_ACTION_REQUIRED;
+   int already = 0;
 
pr_err("Uncorrected hardware memory error in user-access at %llx", 
p->mce_addr);
 
if (!p->mce_ripv)
flags |= MF_MUST_KILL;
 
-   if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags) &&
+   if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags, ) &&
!(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) {
-   set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
+   if (already)
+   force_sig(SIGBUS); // BETTER CODE NEEDED HERE!!!
+   else
+   set_mce_nospec(p->mce_addr >> PAGE_SHIFT, 
p->mce_whole_page);
sync_core();
return;
}
@@ -1440,7 +1446,7 @@ noinstr void do_machine_check(struct pt_regs *regs)
 EXPORT_SYMBOL_GPL(do_machine_check);
 
 #ifndef CONFIG_MEMORY_FAILURE
-int memory_failure(unsigned long pfn, int flags)
+int memory_failure(unsigned long pfn, int flags, int *already)
 {
/* mce_severity() should not hand us an ACTION_REQUIRED error */
BUG_ON(flags & MF_ACTION_REQUIRED);
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index eef4ffb6122c..24c36623e492 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -480,7 +480,7 @@ static ssize_t hard_offline_page_store(struct device *dev,
if (kstrtoull(buf, 0, ) < 0)
return -EINVAL;
pfn >>= PAGE_SHIFT;
-   ret = memory_failure(pfn, 0);
+   ret = memory_failure(pfn, 0, NULL);
return ret ? ret : count;
 }
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ecdf8a8cd6ae..88b92820465c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3045,7 +3045,7 @@ enum mf_flags {
MF_MUST_KILL = 1 << 2,
MF_SOFT_OFFLINE = 1 << 3,
 };
-extern int memory_failure(unsigned long pfn, int flags);
+extern int memory_failure(unsigned long pfn, int flags, int *already);
 extern void memory_failure_queue(unsigned long pfn, int flags);
 extern void memory_failure_queue_kick(int cpu);
 extern int unpoison_memory(unsigned long pfn);
diff --git a/mm/hwpoison-inject.c b/mm/hwpoison-inject.c
index 1ae1ebc2b9b1..bfd5151dcd3f 100644
--- a/mm/hwpoison-inject.c
+++ b/mm/hwpoison-inject.c
@@ -48,7 +48,7 @@ static int hwpoison_inject(void *data, u64 val)
 
 inject:
pr_info("Injecting memory failure at pfn %#lx\n", pfn);
-   return memory_failure(pfn, 0);
+   return memory_failure(pfn, 0, NULL);
 }
 
 static int hwpoison_unpoison(void *data, u64 val)
diff --git a/mm/madvise.c b/mm/madvise.c
index 6a660858784b..ade1956632aa 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -907,7 +907,7 @@ static int madvise_inject_error(int behavior,
} else {
pr_info("Injecting memory failure for pfn %#lx at 
process virtual address %#lx\n",
  

Re: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned

2021-02-25 Thread Luck, Tony
On Thu, Feb 25, 2021 at 12:38:06PM +, HORIGUCHI NAOYA(堀口 直也) wrote:
> Thank you for shedding light on this, this race looks worrisome to me.
> We call try_to_unmap() inside memory_failure(), where we find affected
> ptes by page_vma_mapped_walk() and convert into hwpoison entires in
> try_to_unmap_one().  So there seems two racy cases:
> 
>   1)
>  CPU 0  CPU 1
>  page_vma_mapped_walk
> clear _PAGE_PRESENT bit
>// skipped the entry
> 
>   2)
>  CPU 0  CPU 1
>  page_vma_mapped_walk
>try_to_unmap_one
> clear _PAGE_PRESENT bit
>  convert the entry
>  set_pte_at
> 
> In case 1, the affected processes get signals on later access,
> so although the info in SIGBUS could be different, that's OK.
> And we have no impact in case 2.

I've been debugging a similar issue for a few days and finally
got enough traces to partially understand what happened.

The test case is a multi-threaded pointer chasing micro-benchmark
running on all logical CPUs. We then inject poison into the address
space of the process.

All works fine if one thread consumes poison and completes all
Linux machine check processing before any other threads read the
poison. The page is unmapped, a SIGBUS is sent (which kills all
threads).

But in the problem case I see:

CPU1 reads poison, takes a machine check. Gets to the
kill_me_maybe() task work, which calls memory_failure()
this CPU sets the page poison flag, but is still executing the
rest of the flow to hunt down tasks/mappings to invalidate pages
and send SIGBUS if required.

CPU2 reads the poison. When it gets to memory_failure()
there's an early return because the poison flag is already
set. So in current code it returns and takes the machine
check again.

CPU3 reads the poison and starts along same path that CPU2
did.

Meanwhile CPU1 gets far enough along in memory failure and hits
a problem. It prints:

[ 1867.409837] Memory failure: 0x42a9ff6: reserved kernel page still referenced 
by 1 users
[ 1867.409850] Memory failure: 0x42a9ff6: recovery action for reserved kernel 
page: Failed

and doesn't complete unmapping the page that CPU2 and CPU3 are touching.

Other CPUs gradually reach the poison and join in the fun of repeatedly
taking machine checks.

I have not yet tracked why this user access is reporting as a "reserved kernel 
page".
Some traces showed that futex(2) syscall was in use from this benchmark,
so maybe the kernel locked a user page that was a contended futex???


Idea for what we should do next ... Now that x86 is calling memory_failure()
from user context ... maybe parallel calls for the same page should
be blocked until the first caller completes so we can:
a) know that pages are unmapped (if that happens)
b) all get the same success/fail status

-Tony


Re: [PATCH v3] x86/fault: Send a SIGBUS to user process always for hwpoison page access.

2021-02-23 Thread Luck, Tony
On Tue, Feb 23, 2021 at 07:33:46AM -0800, Andy Lutomirski wrote:
> 
> > On Feb 23, 2021, at 4:44 AM, Aili Yao  wrote:
> > 
> > On Fri, 5 Feb 2021 17:01:35 +0800
> > Aili Yao  wrote:
> > 
> >> When one page is already hwpoisoned by MCE AO action, processes may not
> >> be killed, processes mapping this page may make a syscall include this
> >> page and result to trigger a VM_FAULT_HWPOISON fault, as it's in kernel
> >> mode it may be fixed by fixup_exception, current code will just return
> >> error code to user code.
> >> 
> >> This is not sufficient, we should send a SIGBUS to the process and log
> >> the info to console, as we can't trust the process will handle the error
> >> correctly.
> >> 
> >> Suggested-by: Feng Yang 
> >> Signed-off-by: Aili Yao 
> >> ---
> >> arch/x86/mm/fault.c | 62 +
> >> 1 file changed, 40 insertions(+), 22 deletions(-)
> >> 
> > Hi luto;
> >  Is there any feedback?
> 
> At the very least, this needs a clear explanation of why your proposed 
> behavior is better than the existing behavior.

The explanation is buried in that "can't trust the process" line.

E.g. user space isn't good about checking for failed write(2) syscalls.
So if the poison was in a user buffer passed to write(fd, buffer, count)
sending a SIGBUS would be the action if they read the poison directly,
so it seems reasonable to send the same signal if the kernel read their
poison for them.

It would avoid users that didn't check the return value merrily proceeding
as if everything was ok.

-Tony


RE: [PATCH v2] x86/mce: fix wrong no-return-ip logic in do_machine_check()

2021-02-23 Thread Luck, Tony
> What I think is qemu has not an easy to get the MCE signature from host or 
> currently no methods for this
> So qemu treat all AR will be No RIPV, Do more is better than do less.

RIPV would be important in the guest in the case where the guest can fix the 
problem that caused
the machine check and return to the failed instruction to continue.

I think the only case where this happens is a fault in a read-only page mapped 
from a file (typically
code page, but could be a data page). In this case memory-failure() unmaps the 
page with the posion
but Linux can recover by reading data from the file into a new page.

Other cases we send SIGBUS (so go to the signal handler instead of to the 
faulting instruction).

So it would be good if the state of RIPV could be added to the signal state 
sent to qemu. If that
isn't possible, then this full recovery case turns into another SIGBUS case.

-Tony


checkpatch warnings for references to earlier commits

2021-02-22 Thread Luck, Tony
Would it be possible to teach checkpatch not to warn about
canonical references to earlier commits?  E.g.

WARNING: Possible unwrapped commit description (prefer a maximum 75 chars per 
line)
#7:
commit e80634a75aba ("EDAC, skx: Retrieve and print retry_rd_err_log registers")

Thanks

-Tony


RE: [PATCH] x86, sched: Allow NUMA nodes to share an LLC on Intel platforms

2021-02-09 Thread Luck, Tony
> +#define X86_BUG_NUMA_SHARES_LLC  X86_BUG(25) /* CPU may 
> enumerate an LLC shared by multiple NUMA nodes */

During internal review I wondered why this is a "BUG" rather than a "FEATURE" 
bit.

Apparently, the suggestion for "BUG" came from earlier community discussions.

Historically it may have seemed reasonable to say that a cache cannot span
NUMA domains. But with more and more things moving off the motherboard
and into the socket, this doesn't seem too weird now.

-Tony



RE: Pstore : Query on using ramoops driver for DDR

2021-02-09 Thread Luck, Tony
> Can we use existing backend pstore ram driver (fs/pstore/ram.c) for DDR 
> instead of SRAM ?

The expectation for pstore is that the system will go through a reset when it
crashes. Most systems do not preserve DDR contents across reset.

> Was the current driver written only to support persistant RAM like SRAM 
> or it can accept further change

Linux is in a constant state of change :-)

> to support DDR, If we have a mechanism to copy stored data from DDR to 
> external device after the crash.

See above about DDR contents.   But if you have a platform that does preserve
DDR contents until your "mechanism" can copy the pstore buffer, then post
a patch.

-Tony


Re: [PATCH 02/49] x86/cpu: Describe hybrid CPUs in cpuinfo_x86

2021-02-08 Thread Luck, Tony
On Mon, Feb 08, 2021 at 02:04:24PM -0500, Liang, Kan wrote:
> On 2/8/2021 12:56 PM, Borislav Petkov wrote:
> 
> I think it's good enough for perf, but I'm not sure whether other codes need
> the CPU type information.
> 
> Ricardo, do you know?
> 
> Maybe we should implement a generic function as below for this?
> (Not test. Just use as an example.)
> 
> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> index a66c1fd..679f5fe 100644
> --- a/arch/x86/kernel/cpu/common.c
> +++ b/arch/x86/kernel/cpu/common.c
> @@ -2056,3 +2056,11 @@ void arch_smt_update(void)
>   /* Check whether IPI broadcasting can be enabled */
>   apic_smt_update();
>  }
> +
> +u32 x86_read_hybrid_type(void)
> +{
> + if (cpu_feature_enabled(X86_FEATURE_HYBRID_CPU))
> + return cpuid_eax(0x001a);
> +
> + return 0;
> +}

Machine check logging will want to include this in "struct mce".

But ok to pick it up with a function like you describe above.

-Tony


Re: [PATCH v3 1/7] seqnum_ops: Introduce Sequence Number Ops

2021-02-05 Thread Luck, Tony
On Fri, Feb 05, 2021 at 01:03:18PM -0700, Shuah Khan wrote:
> On 2/5/21 2:58 AM, Greg KH wrote:
> > On Wed, Feb 03, 2021 at 11:11:57AM -0700, Shuah Khan wrote:
> > > +static inline u32 seqnum32_inc(struct seqnum32 *seq)
> > > +{
> > > + atomic_t val = ATOMIC_INIT(seq->seqnum);
> > > +
> > > + seq->seqnum = (u32) atomic_inc_return();
> > > + if (seq->seqnum >= UINT_MAX)
> > > + pr_info("Sequence Number overflow %u detected\n",
> > > + seq->seqnum);
> > > + return seq->seqnum;
> > 
> > As Peter points out, this is doing doing what you think it is doing :(
> > 
> > Why do you not just have seq->seqnum be a real atomic variable?  Trying
> > to switch to/from one like this does not work as there is no
> > "atomic-ness" happening here at all.
> > 
> 
> Yes. This is sloppy on my part. As Peter and Rafael also pointed. I have
> to start paying more attention to my inner voice.

What's the end goal with this sequence number type?

Apart from the functional issues with this code already pointed
out, it doesn't seem overly useful to just print the *value* of
the sequence number when it hits the max value.  It might be
more helpful if each instance had a seq->name field that is
used here to tell the user *which* user of sequence numbers
had seen the wrap arounnd.

But that just begs the question of what should users of sequence
numbers do when wrap around occurs? Any user that can run through
sequence numbers at greater than 10 Hz is going to cause a problem
for systems that stay running for years.

Maybe you should force users to code for wraparound by initializing
to something like 0xff00 so that they all get a wraparound event
quickly, rather than slowly, (same as was done with jiffies)?


RE: [PATCH v5] x86/mce: Avoid infinite loop for copy from user recovery

2021-02-02 Thread Luck, Tony
> So that "system hang or panic" which the validation folks triggered,
> that cannot be reproduced anymore? Did they run the latest version of
> the patch?

I will get the validation folks to run the latest version (and play around with
hyperthreading if they see problems).

-Tony


RE: [PATCH v5] x86/mce: Avoid infinite loop for copy from user recovery

2021-02-02 Thread Luck, Tony
> And the much more important question is, what is the code supposed to
> do when that overflow *actually* happens in real life? Because IINM,
> an overflow condition on the same page would mean killing the task to
> contain the error and not killing the machine...

Correct. The cases I've actually hit, the second machine check is on the
same address as the first. But from a recovery perspective Linux is going
to take away the whole page anyway ... so not complaining if the second
(or subsequent) access is within the same page makes sense (and that's
what the patch does).

The code can't handle it if a subsequent #MC is to a different page (because
we only have a single spot in the task structure to store the physical page
address).  But that looks adequate. If the code is wildly accessing different
pages *and* getting machine checks from those different pages ... then
something is very seriously wrong with the system.

-Tony



RE: [PATCH v2] x86/fault: Send a SIGBUS to user process always for hwpoison page access.

2021-02-01 Thread Luck, Tony
> In any case, this patch needs rebasing on top of my big fault series

Is that series in some GIT tree? Or can you give a lore.kernel.org link?

Thanks

-Tony


Re: [PATCH v5] x86/mce: Avoid infinite loop for copy from user recovery

2021-02-01 Thread Luck, Tony
On Thu, Jan 28, 2021 at 06:57:35PM +0100, Borislav Petkov wrote:
> Crazy idea: if you still can reproduce on -rc3, you could bisect: i.e.,
> if you apply the patch on -rc3 and it explodes and if you apply the same
> patch on -rc5 and it works, then that could be a start... Yeah, don't
> have a better idea here. :-\

I tried reporoducing (applied the original patch I posted back to -rc3) and
the same issue stubbornly refused to show up again.

But I did hit something with the same signature (overflow bit set in
bank 1) while running my futex test (which has two processes mapping
the poison page).  This time I *do* understand what happened.  The test
failed when the two processes were running on the two hyperhtreads of
the same core. Seeing overflow in this case is understandable because
bank 1 MSRs on my test machine are shared between the HT threads. When
I run the test again using taskset(1) to only allowing running on
thread 0 of each core, it keeps going for hunderds of iterations.

I'm not sure I can stitch together how this overflow also happened for
my single process test. Maybe a migration from one HT thread to the
other at an awkward moment?

-Tony


Re: [PATCH] x86/fault: Send SIGBUS to user process always for hwpoison page access.

2021-01-29 Thread Luck, Tony
Thanks for the explanation and test code. I think I see better what
is going on here.

[I took your idea for using madvise(...MADV_HWPOISON) and added a new "-S"
option to my einj_mem_uc test program to use madvise instead of ACPI/EINJ
for injections. Update pushed here:
git://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git ]

There have been some small changes to arch/x86/mm/fault.c since you wrote
the patch.  Can you rebase to v5.11-rc5?

Also maybe this might be a case to use IS_ENABLED() instead of #ifdef to
make the code a little less ugly. At least for the 2nd hunk in your patch
this would work well:

if (IS_ENABLED(CONFIG_MEMORY_FAILURE) &&
(fault & (VM_FAULT_HWPOISON|VM_FAULT_HWPOISON_LARGE)))
no_context(regs, error_code, address, SIGBUS, BUS_MCEERR_AR);
else
no_context(regs, error_code, address, SIGBUS, BUS_ADRERR);

The first hunk might need a bit more thought.

-Tony


RE: linux-next: removal of the ia64 tree

2021-01-28 Thread Luck, Tony
Stephen,

Yes. Most stuff I do these days goes through the RAS tree I share with Boris.

-Tony

-Original Message-
From: Stephen Rothwell  
Sent: Thursday, January 28, 2021 11:53 AM
To: Luck, Tony 
Cc: Arnd Bergmann ; Linux Kernel Mailing List 
; Linux Next Mailing List 

Subject: linux-next: removal of the ia64 tree

Hi Tony,

I noticed commit

  96ec72a3425d ("ia64: Mark architecture as orphaned")

just went into Linus' tree, so I assume I should drop the ia64 tree
from linux-next?

-- 
Cheers,
Stephen Rothwell


Re: [PATCH] x86/fault: Send SIGBUS to user process always for hwpoison page access.

2021-01-28 Thread Luck, Tony
On Thu, Jan 28, 2021 at 07:43:26PM +0800, Aili Yao wrote:
> when one page is already hwpoisoned by AO action, process may not be
> killed, the process mapping this page may make a syscall include this
> page and result to trigger a VM_FAULT_HWPOISON fault, as it's in kernel
> mode it may be fixed by fixup_exception, current code will just return
> error code to user process.

Shouldn't the AO action that poisoned the page have also unmapped it?
> 
> This is not suffient, we should send a SIGBUS to the process and log the
> info to console, as we can't trust the process will handle the error
> correctly.

I agree with this part ... few apps check for -EFAULT and do the right
thing.  But I'm not sure how this happens. Can you provide a bit more
detail on the steps

-Tony

P.S. Typo: s/suffient/sufficient/

> 
> Suggested-by: Feng Yang 
> Signed-off-by: Aili Yao 
> ---
>  arch/x86/mm/fault.c | 16 
>  1 file changed, 16 insertions(+)
> 
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index f1f1b5a0956a..36d1e385512b 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -662,7 +662,16 @@ no_context(struct pt_regs *regs, unsigned long 
> error_code,
>* In this case we need to make sure we're not recursively
>* faulting through the emulate_vsyscall() logic.
>*/
> +#ifdef CONFIG_MEMORY_FAILURE
> + if (si_code == BUS_MCEERR_AR && signal == SIGBUS)
> + pr_err("MCE: Killing %s:%d due to hardware memory 
> corruption fault at %lx\n",
> + current->comm, current->pid, address);
> +
> + if ((current->thread.sig_on_uaccess_err && signal) ||
> + (si_code == BUS_MCEERR_AR && signal == SIGBUS)) {
> +#else
>   if (current->thread.sig_on_uaccess_err && signal) {
> +#endif
>   sanitize_error_code(address, _code);
>  
>   set_signal_archinfo(address, error_code);
> @@ -927,7 +936,14 @@ do_sigbus(struct pt_regs *regs, unsigned long 
> error_code, unsigned long address,
>  {
>   /* Kernel mode? Handle exceptions or die: */
>   if (!(error_code & X86_PF_USER)) {
> +#ifdef CONFIG_MEMORY_FAILURE
> + if (fault & (VM_FAULT_HWPOISON|VM_FAULT_HWPOISON_LARGE))
> + no_context(regs, error_code, address, SIGBUS, 
> BUS_MCEERR_AR);
> + else
> + no_context(regs, error_code, address, SIGBUS, 
> BUS_ADRERR);
> +#else
>   no_context(regs, error_code, address, SIGBUS, BUS_ADRERR);
> +#endif
>   return;
>   }
>  
> -- 
> 2.25.1
> 


Re: [PATCH v5] x86/mce: Avoid infinite loop for copy from user recovery

2021-01-27 Thread Luck, Tony
On Tue, Jan 26, 2021 at 12:03:14PM +0100, Borislav Petkov wrote:
> On Mon, Jan 25, 2021 at 02:55:09PM -0800, Luck, Tony wrote:
> > And now I've changed it back to non-atomic (but keeping the
> > slightly cleaner looking code style that I used for the atomic
> > version).  This one also works for thousands of injections and
> > recoveries.  Maybe take it now before it stops working again :-)
> 
> Hmm, so the only differences I see between your v4 and this are:
> 
> -@@ -1238,6 +1238,7 @@ static void __mc_scan_banks(struct mce *m, struct 
> pt_regs *regs, struct mce *fin
> +@@ -1238,6 +1238,9 @@ static void __mc_scan_banks(struct mce *m, struct 
> pt_regs *regs, struct mce *fin
>   
>   static void kill_me_now(struct callback_head *ch)
>   {
> ++struct task_struct *p = container_of(ch, struct task_struct, 
> mce_kill_me);
> ++
>  +p->mce_count = 0;
>   force_sig(SIGBUS);
>   }
> 
> Could the container_of() macro have changed something?

That change was to fix my brown paper bag moment (does not
compile without a variable named "p" in scope to be used on
next line.)

> Because we don't know yet (right?) why would it fail? Would it read
> stale ->mce_count data? If so, then a barrier is missing somewhere.

I don't see how a barrier would make a differece. In the common case
all this code is executed on the same logical CPU. Return from the
do_machine_check() tries to return to user mode and finds that there
is some "task_work" to execute first.

In some cases Linux might context switch to something else. Perhaps
this task even gets picked up by another CPU to run the task work
queued functions.  But I imagine that the context switch should act
as a barrier ... shouldn't it?

> Or what is the failure exactly?

After a few cycles of the test injection to user mode, I saw an
overflow in the machine check bank. As if it hadn't been cleared
from the previous iteration ... but all the banks are cleared as
soon as we find that the machine check is recoverable. A while before
getting to the code I changed.

When the tests were failing, code was on top of v5.11-rc3. Latest
experiments moved to -rc5.  There's just a tracing fix from
PeterZ between rc3 and rc5 to mce/core.c:

737495361d44 ("x86/mce: Remove explicit/superfluous tracing")

which doesn't appear to be a candidate for the problems I saw.

> Because if I take it now without us knowing what the issue is, it will
> start failing somewhere - Murphy's our friend - and then we'll have to
> deal with breaking people's boxes. Not fun.

Fair point.

> The other difference is:
> 
> @@ -76,8 +71,10 @@ index 13d3f1cbda17..5460c146edb5 100644
>  -current->mce_kflags = m->kflags;
>  -current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV);
>  -current->mce_whole_page = whole_page(m);
> ++int count = ++current->mce_count;
> ++
>  +/* First call, save all the details */
> -+if (current->mce_count++ == 0) {
> ++if (count == 1) {
>  +current->mce_addr = m->addr;
>  +current->mce_kflags = m->kflags;
>  +current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV);
> 
> Hmm, a local variable and a pre-increment. Can that have an effect somehow?

This is the bit that changed during my detour using atomic_t mce_count.
I added the local variable to capture value from atomic_inc_return(), then
used it later, instead of a bunch of atomic_read() calls.

I kept it this way because "if (count == 1)" is marginally easier to read
than "if (current->mce_count++ == 0)"

> > +   /* Ten is likley overkill. Don't expect more than two faults before 
> > task_work() */
> 
> Typo: likely.

Oops. Fixed.

-Tony


[PATCH v5] x86/mce: Avoid infinite loop for copy from user recovery

2021-01-25 Thread Luck, Tony
On Thu, Jan 21, 2021 at 01:09:59PM -0800, Luck, Tony wrote:
> On Wed, Jan 20, 2021 at 01:18:12PM +0100, Borislav Petkov wrote:
> But, on a whim, I changed the type of mce_count from "int" to "atomic_t" and
> fixeed up the increment & clear to use atomic_inc_return() and atomic_set().
> See updated patch below.
> 
> I can't see why this should have made any difference. But the "int" version
> crashes in a few dozen machine checks. The "atomic_t" version has run many
> thousands of machine checks (10213 in the current boot according to 
> /proc/interrupts)
> with no issues.

And now I've changed it back to non-atomic (but keeping the
slightly cleaner looking code style that I used for the atomic
version).  This one also works for thousands of injections and
recoveries.  Maybe take it now before it stops working again :-)

-Tony

From: Tony Luck 

Recovery action when get_user() triggers a machine check uses the fixup
path to make get_user() return -EFAULT.  Also queue_task_work() sets up
so that kill_me_maybe() will be called on return to user mode to send a
SIGBUS to the current process.

But there are places in the kernel where the code assumes that this
EFAULT return was simply because of a page fault. The code takes some
action to fix that, and then retries the access. This results in a second
machine check.

While processing this second machine check queue_task_work() is called
again. But since this uses the same callback_head structure that
was used in the first call, the net result is an entry on the
current->task_works list that points to itself. When task_work_run()
is called it loops forever in this code:

do {
next = work->next;
work->func(work);
work = next;
cond_resched();
} while (work);

Add a counter (current->mce_count) to keep track of repeated machine checks
before task_work() is called. First machine check saves the address information
and calls task_work_add(). Subsequent machine checks before that task_work
call back is executed check that the address is in the same page as the first
machine check (since the callback will offline exactly one page).

Expected worst case is two machine checks before moving on (e.g. one user
access with page faults disabled, then a repeat to the same addrsss with
page faults enabled). Just in case there is some code that loops forever
enforce a limit of 10.

Signed-off-by: Tony Luck 
---
 arch/x86/kernel/cpu/mce/core.c | 35 +++---
 include/linux/sched.h  |  1 +
 2 files changed, 29 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index e133ce1e562b..30dedffd6f88 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1238,6 +1238,9 @@ static void __mc_scan_banks(struct mce *m, struct pt_regs 
*regs, struct mce *fin
 
 static void kill_me_now(struct callback_head *ch)
 {
+   struct task_struct *p = container_of(ch, struct task_struct, 
mce_kill_me);
+
+   p->mce_count = 0;
force_sig(SIGBUS);
 }
 
@@ -1246,6 +1249,7 @@ static void kill_me_maybe(struct callback_head *cb)
struct task_struct *p = container_of(cb, struct task_struct, 
mce_kill_me);
int flags = MF_ACTION_REQUIRED;
 
+   p->mce_count = 0;
pr_err("Uncorrected hardware memory error in user-access at %llx", 
p->mce_addr);
 
if (!p->mce_ripv)
@@ -1266,12 +1270,29 @@ static void kill_me_maybe(struct callback_head *cb)
}
 }
 
-static void queue_task_work(struct mce *m, int kill_current_task)
+static void queue_task_work(struct mce *m, char *msg, int kill_current_task)
 {
-   current->mce_addr = m->addr;
-   current->mce_kflags = m->kflags;
-   current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV);
-   current->mce_whole_page = whole_page(m);
+   int count = ++current->mce_count;
+
+   /* First call, save all the details */
+   if (count == 1) {
+   current->mce_addr = m->addr;
+   current->mce_kflags = m->kflags;
+   current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV);
+   current->mce_whole_page = whole_page(m);
+   }
+
+   /* Ten is likley overkill. Don't expect more than two faults before 
task_work() */
+   if (count > 10)
+   mce_panic("Too many machine checks while accessing user data", 
m, msg);
+
+   /* Second or later call, make sure page address matches the one from 
first call */
+   if (count > 1 && (current->mce_addr >> PAGE_SHIFT) != (m->addr >> 
PAGE_SHIFT))
+   mce_panic("Machine checks to different user pages", m, msg);
+
+   /* Do n

Re: [PATCH v4] x86/mce: Avoid infinite loop for copy from user recovery

2021-01-21 Thread Luck, Tony
On Wed, Jan 20, 2021 at 01:18:12PM +0100, Borislav Petkov wrote:
> On Tue, Jan 19, 2021 at 03:57:59PM -0800, Luck, Tony wrote:
> > But the real validation folks took my patch and found that it has
> > destabilized cases 1 & 2 (and case 3 also chokes if you repeat a few
> > more times). System either hangs or panics. Generally before 100
> > injection/conumption cycles.
> 
> Hmm, that mce_count increment and check stuff might turn out to be
> fragile in the face of a race condition. But I don't see it...

I can't see what the race is either.  As far as I can see the flow
looks like this (for a machine check in user mode)

user-mode

do_machine_check() [in atomic context on IST etc.]

queue_task_work()
increments current->mce_count
first (only) call saves address etc. and calls 
task_work_add()
return from machine check

Plausibly we might context switch to another process here. We could even
resume on a different CPU.

do_task_work() [in process context]

kill_me_maybe()
sets mce_count back to zero, ready for another 
#MC
memory_failure()
send SIGBUS

The kernel get_user() case is similar, but may take extra machine checks before
before calling code decides get_user() really is going to keep saying -EFAULT.


But, on a whim, I changed the type of mce_count from "int" to "atomic_t" and
fixeed up the increment & clear to use atomic_inc_return() and atomic_set().
See updated patch below.

I can't see why this should have made any difference. But the "int" version
crashes in a few dozen machine checks. The "atomic_t" version has run many
thousands of machine checks (10213 in the current boot according to 
/proc/interrupts)
with no issues.

-Tony

>From d1b003f1de92f87c8dc53dd710fd79ad6e277364 Mon Sep 17 00:00:00 2001
From: Tony Luck 
Date: Wed, 20 Jan 2021 12:40:50 -0800
Subject: [PATCH] x86/mce: Avoid infinite loop for copy from user recovery

Recovery action when get_user() triggers a machine check uses the fixup
path to make get_user() return -EFAULT.  Also queue_task_work() sets up
so that kill_me_maybe() will be called on return to user mode to send a
SIGBUS to the current process.

But there are places in the kernel where the code assumes that this
EFAULT return was simply because of a page fault. The code takes some
action to fix that, and then retries the access. This results in a second
machine check.

While processing this second machine check queue_task_work() is called
again. But since this uses the same callback_head structure that
was used in the first call, the net result is an entry on the
current->task_works list that points to itself. When task_work_run()
is called it loops forever in this code:

do {
next = work->next;
work->func(work);
work = next;
cond_resched();
} while (work);

Add a counter (current->mce_count) to keep track of repeated machine checks
before task_work() is called. First machine check saves the address information
and calls task_work_add(). Subsequent machine checks before that task_work
call back is executed check that the address is in the same page as the first
machine check (since the callback will offline exactly one page).

Expected worst case is two machine checks before moving on (e.g. one user
access with page faults disabled, then a repeat to the same addrsss with
page faults enabled). Just in case there is some code that loops forever
enforce a limit of 10.

Signed-off-by: Tony Luck 
---
 arch/x86/kernel/cpu/mce/core.c | 35 +++---
 include/linux/sched.h  |  1 +
 2 files changed, 29 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 13d3f1cbda17..4a8660058b67 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1238,6 +1238,9 @@ static void __mc_scan_banks(struct mce *m, struct pt_regs 
*regs, struct mce *fin
 
 static void kill_me_now(struct callback_head *ch)
 {
+   struct task_struct *p = container_of(ch, struct task_struct, 
mce_kill_me);
+
+   atomic_set(>mce_count, 0);
force_sig(SIGBUS);
 }
 
@@ -1246,6 +1249,7 @@ static void kill_me_maybe(struct callback_head *cb)
struct task_struct *p = container_of(cb, struct task_struct, 
mce_kill_me);
int flags = MF_ACTION_REQUIRED;
 
+   atomic_set(>mce_count, 0);
pr_err("Uncorrected hardware memory error in user-access at %llx", 
p->mce_addr);
 
if (!p->mce_ripv)
@@ -1266,12 +1270,29 @@ static void kill_me_maybe(struct callb

RE: [PATCH v4] x86/mce: Avoid infinite loop for copy from user recovery

2021-01-20 Thread Luck, Tony
> Yah, some printk sprinkling might be a good start. But debugging in that
> atomic context is always nasty. ;-\

Some very light printk sprinkling (one message in queue_task_work() in atomic
context, one each in kill_me_now() and kill_me_maybe() to check when task_work
actually called them.

Cases 1 & 2 (user & normal copyin) now work just fine (hundreds of iterations).

Case 3 (my futex test) just hangs with only two characters making it to the 
serial port "[ "

Deeply strange.

-Tony


Re: [PATCH v4] x86/mce: Avoid infinite loop for copy from user recovery

2021-01-19 Thread Luck, Tony
On Tue, Jan 19, 2021 at 11:56:32AM +0100, Borislav Petkov wrote:
> On Fri, Jan 15, 2021 at 03:23:46PM -0800, Luck, Tony wrote:
> > On Fri, Jan 15, 2021 at 12:51:03PM -0800, Luck, Tony wrote:
> > >  static void kill_me_now(struct callback_head *ch)
> > >  {
> > > + p->mce_count = 0;
> > >   force_sig(SIGBUS);
> > >  }
> > 
> > Brown paper bag time ... I just pasted that line from kill_me_maybe()
> > and I thought I did a re-compile ... but obviously not since it gives
> > 
> > error: ‘p’ undeclared (first use in this function)
> > 
> > Option a) (just like kill_me_maybe)
> > 
> > struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me);
> > 
> > Option b) (simpler ... not sure why PeterZ did the container_of thing
> > 
> > current->mce_count = 0;
> 
> Right, he says it is the canonical way to get it out of callback_head.
> I don't think current will change while the #MC handler runs but we can
> adhere to the design pattern here and do container_of() ...

Ok ... I'll use the canonical way.

But now I've run into a weird issue. I'd run some basic tests with a
dozen machine checks in each of:
1) user access
2) kernel copyin
3) futex (multiple accesses from kernel before task_work())

and it passed my tests before I posted.

But the real validation folks took my patch and found that it has
destabilized cases 1 & 2 (and case 3 also chokes if you repeat a few
more times). System either hangs or panics. Generally before 100
injection/conumption cycles.

Their tests are still just doing one at a time (i.e. complete recovery
of one machine cehck before injecting the next error). So there aren't
any complicated race conditions.

So if you see anything obviously broken, let me know. Otherwise I'll
be poking around at the patch to figure out what is wrong.

-Tony


Re: [PATCH v4] x86/mce: Avoid infinite loop for copy from user recovery

2021-01-15 Thread Luck, Tony
On Fri, Jan 15, 2021 at 12:51:03PM -0800, Luck, Tony wrote:
>  static void kill_me_now(struct callback_head *ch)
>  {
> + p->mce_count = 0;
>   force_sig(SIGBUS);
>  }

Brown paper bag time ... I just pasted that line from kill_me_maybe()
and I thought I did a re-compile ... but obviously not since it gives

error: ‘p’ undeclared (first use in this function)

Option a) (just like kill_me_maybe)

struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me);

Option b) (simpler ... not sure why PeterZ did the container_of thing

current->mce_count = 0;

-Tony


[PATCH v4] x86/mce: Avoid infinite loop for copy from user recovery

2021-01-15 Thread Luck, Tony
Recovery action when get_user() triggers a machine check uses the fixup
path to make get_user() return -EFAULT.  Also queue_task_work() sets up
so that kill_me_maybe() will be called on return to user mode to send a
SIGBUS to the current process.

But there are places in the kernel where the code assumes that this
EFAULT return was simply because of a page fault. The code takes some
action to fix that, and then retries the access. This results in a second
machine check.

While processing this second machine check queue_task_work() is called
again. But since this uses the same callback_head structure that
was used in the first call, the net result is an entry on the
current->task_works list that points to itself. When task_work_run()
is called it loops forever in this code:

do {
next = work->next;
work->func(work);
work = next;
cond_resched();
} while (work);

Add a counter (current->mce_count) to keep track of repeated machine checks
before task_work() is called. First machine check saves the address information
and calls task_work_add(). Subsequent machine checks before that task_work
call back is executed check that the address is in the same page as the first
machine check (since the callback will offline exactly one page).

Expected worst case is two machine checks before moving on (e.g. one user
access with page faults disabled, then a repeat to the same addrsss with
page faults enabled). Just in case there is some code that loops forever
enforce a limit of 10.

Signed-off-by: Tony Luck 
---

V4:
Fix bug with "||" where I meant "&&"
Update stale commit comment referring to mce_busy field
in task structure (now called mce_count).
Add some comments to queue_task_work()

 arch/x86/kernel/cpu/mce/core.c | 31 ---
 include/linux/sched.h  |  1 +
 2 files changed, 25 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 13d3f1cbda17..5460c146edb5 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1238,6 +1238,7 @@ static void __mc_scan_banks(struct mce *m, struct pt_regs 
*regs, struct mce *fin
 
 static void kill_me_now(struct callback_head *ch)
 {
+   p->mce_count = 0;
force_sig(SIGBUS);
 }
 
@@ -1246,6 +1247,7 @@ static void kill_me_maybe(struct callback_head *cb)
struct task_struct *p = container_of(cb, struct task_struct, 
mce_kill_me);
int flags = MF_ACTION_REQUIRED;
 
+   p->mce_count = 0;
pr_err("Uncorrected hardware memory error in user-access at %llx", 
p->mce_addr);
 
if (!p->mce_ripv)
@@ -1266,12 +1268,27 @@ static void kill_me_maybe(struct callback_head *cb)
}
 }
 
-static void queue_task_work(struct mce *m, int kill_current_task)
+static void queue_task_work(struct mce *m, char *msg, int kill_current_task)
 {
-   current->mce_addr = m->addr;
-   current->mce_kflags = m->kflags;
-   current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV);
-   current->mce_whole_page = whole_page(m);
+   /* First call, save all the details */
+   if (current->mce_count++ == 0) {
+   current->mce_addr = m->addr;
+   current->mce_kflags = m->kflags;
+   current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV);
+   current->mce_whole_page = whole_page(m);
+   }
+
+   /* Ten is likley overkill. Don't expect more than two faults before 
task_work() */
+   if (current->mce_count > 10)
+   mce_panic("Too many machine checks while accessing user data", 
m, msg);
+
+   /* Second or later call, make sure page address matches the one from 
first call */
+   if (current->mce_count > 1 && (current->mce_addr >> PAGE_SHIFT) != 
(m->addr >> PAGE_SHIFT))
+   mce_panic("Machine checks to different user pages", m, msg);
+
+   /* Do not call task_work_add() more than once */
+   if (current->mce_count > 1)
+   return;
 
if (kill_current_task)
current->mce_kill_me.func = kill_me_now;
@@ -1414,7 +1431,7 @@ noinstr void do_machine_check(struct pt_regs *regs)
/* If this triggers there is no way to recover. Die hard. */
BUG_ON(!on_thread_stack() || !user_mode(regs));
 
-   queue_task_work(, kill_current_task);
+   queue_task_work(, msg, kill_current_task);
 
} else {
/*
@@ -1432,7 +1449,7 @@ noinstr void do_machine_check(struct pt_regs *regs)
}
 
if (m.kflags & MCE_IN_KERNEL_COPYIN)
-   queue_task_work(, kill_current_task);
+   queue_task_work(, msg, kill_current_task);
}
 out:
mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 

Re: [PATCH v3] x86/mce: Avoid infinite loop for copy from user recovery

2021-01-15 Thread Luck, Tony
On Fri, Jan 15, 2021 at 04:27:54PM +0100, Borislav Petkov wrote:
> On Thu, Jan 14, 2021 at 04:38:17PM -0800, Tony Luck wrote:
> > Add a "mce_busy" counter so that task_work_add() is only called once
> > per faulty page in this task.
> 
> Yeah, that sentence can be removed now too.

I will update with new name "mce_count" and some details.

> > -static void queue_task_work(struct mce *m, int kill_current_task)
> > +static void queue_task_work(struct mce *m, char *msg, int 
> > kill_current_task)
> 
> So this function gets called in the user mode MCE case too:
> 
>   if ((m.cs & 3) == 3) {
> 
>   queue_task_work(, msg, kill_current_task);
>   }
> 
> Do we want to panic for multiple MCEs to different addresses in user
> mode?

In the user mode case we should only bump mce_count to "1" and
before task_work() gets called. It shouldn't hurt to do the
same checks. Maybe it will catch something weird - like an NMI
handler on return from the machine check doing a get_user() that
hits another machine check during the return from this machine check.

AndyL has made me extra paranoid. :-)

> > -   current->mce_addr = m->addr;
> > -   current->mce_kflags = m->kflags;
> > -   current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV);
> > -   current->mce_whole_page = whole_page(m);
> > +   if (current->mce_count++ == 0) {
> > +   current->mce_addr = m->addr;
> > +   current->mce_kflags = m->kflags;
> > +   current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV);
> > +   current->mce_whole_page = whole_page(m);
> > +   }
> > +
> 
>   /* Magic number should be large enough */
>
> > +   if (current->mce_count > 10)

Will add similar comment here ... and to other tests in this function
since it may not be obvious to me next year what I was thinking now :-)

> > +   if (current->mce_count > 10)
> > +   mce_panic("Too many machine checks while accessing user data", 
> > m, msg);
> > +
> > +   if (current->mce_count > 1 || (current->mce_addr >> PAGE_SHIFT) != 
> > (m->addr >> PAGE_SHIFT))
> > +   mce_panic("Machine checks to different user pages", m, msg);
> 
> Will this second part of the test expression, after the "||" ever hit?

No :-( This code is wrong. Should be "&&" not "||". Then it makes more sense.
Will fix for v4.

> In any case, what are you trying to catch with this? Two get_user() to
> different pages both catching MCEs?

Yes. Trying to catch two accesses to different pages. Need to do this
because kill_me_maybe() is only going to offline one page.

I'm not expecting that this would ever hit.  It means that calling code
took a machine check on one page and get_user() said -EFAULT. The the
code decided to access a different page *and* that other page also triggered
a machine check.

> > +   /* Do not call task_work_add() more than once */
> > +   if (current->mce_count > 1)
> > +   return;
> 
> That won't happen either, AFAICT. It'll panic above.

With the s/||/&&/ above, we can get here.
> 
> Regardless, I like how this is all confined to the MCE code and there's
> no need to touch stuff outside...

Thanks for the review.

-Tony


Re: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery

2021-01-14 Thread Luck, Tony
On Thu, Jan 14, 2021 at 09:22:13PM +0100, Borislav Petkov wrote:
> On Mon, Jan 11, 2021 at 01:44:50PM -0800, Tony Luck wrote:
> > @@ -1431,8 +1433,11 @@ noinstr void do_machine_check(struct pt_regs *regs)
> > mce_panic("Failed kernel mode recovery", , 
> > msg);
> > }
> >  
> > -   if (m.kflags & MCE_IN_KERNEL_COPYIN)
> > +   if (m.kflags & MCE_IN_KERNEL_COPYIN) {
> > +   if (current->mce_busy)
> > +   mce_panic("Multiple copyin", , msg);
> 
> So this: we're currently busy handling the first MCE, why do we must
> panic?
> 
> Can we simply ignore all follow-up MCEs to that page?

If we s/all/some/ you are saying the same as Andy:
> So I tend to think that the machine check code should arrange to
> survive some reasonable number of duplicate machine checks.

> I.e., the page will get poisoned eventually and that poisoning is
> currently executing so all following MCEs are simply nothing new and we
> can ignore them.
> 
> It's not like we're going to corrupt more data - we already are
> "corrupting" whole 4K.
> 
> Am I making sense?
> 
> Because if we do this, we won't have to pay attention to any get_user()
> callers and whatnot - we simply ignore and the solution is simple and
> you won't have to touch any get_user() callers...

Changing get_user() is a can of worms. I don't think its a very big can.
Perhaps two or three dozen places where code needs to change to account
for the -ENXIO return ... but touching a bunch of different subsystems
it is likley to take a while to get everyone in agreement.

I'll try out this new approach, and if it works, I'll post a v3 patch.

Thanks

-Tony


RE: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery

2021-01-13 Thread Luck, Tony
>> Maybe the other difference in approach between Andy and me is whether to
>> go for a solution that covers all the corner cases, or just make an 
>> incremental
>> improvement that allows for recover in some useful subset of remaining fatal
>> cases, but still dies in other cases.
>
> Does that mean more core code surgery?

Yes. I need to look at other user access inside pagefault_disable/enable()
as likely spots where the code may continue after a machine check and
retry the access.  So expect some more "if (ret == ENXIO) { do something to 
give up gracefully }"

>> I'm happy to replace error messages with ones that are more descriptive and
>> helpful to humans.
>
> Yap, that: "Multiple copyin" with something more understandable to users...

I'll work on it. We tend not to have essay length messages as panic() strings. 
But I can
add a comment in the code there so that people who grep whatever panic message
we choose can get more details on what happened and what to do.

-Tony


RE: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery

2021-01-13 Thread Luck, Tony
> I think you guys have veered off into the weeds with this. The current
> solution - modulo error messages not destined for humans :) - is not soo
> bad, considering the whole MCA trainwreck.
>
> Or am I missing something completely unacceptable?

Maybe the other difference in approach between Andy and me is whether to
go for a solution that covers all the corner cases, or just make an incremental
improvement that allows for recover in some useful subset of remaining fatal
cases, but still dies in other cases.

I'm happy to replace error messages with ones that are more descriptive and
helpful to humans.

-Tony


Re: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery

2021-01-12 Thread Luck, Tony
On Tue, Jan 12, 2021 at 02:04:55PM -0800, Andy Lutomirski wrote:
> > But we know that the fault happend in a get_user() or copy_from_user() call
> > (i.e. an RIP with an extable recovery address).  Does context switch
> > access user memory?
> 
> No, but NMI can.
> 
> The case that would be very very hard to deal with is if we get an NMI just 
> before IRET/SYSRET and get #MC inside that NMI.
> 
> What we should probably do is have a percpu list of pending memory failure 
> cleanups and just accept that we’re going to sometimes get a second MCE (or 
> third or fourth) before we can get to it.
> 
> Can we do the cleanup from an interrupt?  IPI-to-self might be a credible 
> approach, if so.

You seem to be looking for a solution that is entirely contained within
the machine check handling code. Willing to allow for repeated machine
checks from the same poison address in order to achieve that.

I'm opposed to mutliple machine checks.  Willing to make some changes
in core code to avoid repeated access to the same poison location.

We need a tie-breaker.

-Tony


Re: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery

2021-01-12 Thread Luck, Tony
On Tue, Jan 12, 2021 at 10:57:07AM -0800, Andy Lutomirski wrote:
> On Tue, Jan 12, 2021 at 10:24 AM Luck, Tony  wrote:
> >
> > On Tue, Jan 12, 2021 at 09:21:21AM -0800, Andy Lutomirski wrote:
> > > Well, we need to do *something* when the first __get_user() trips the
> > > #MC.  It would be nice if we could actually fix up the page tables
> > > inside the #MC handler, but, if we're in a pagefault_disable() context
> > > we might have locks held.  Heck, we could have the pagetable lock
> > > held, be inside NMI, etc.  Skipping the task_work_add() might actually
> > > make sense if we get a second one.
> > >
> > > We won't actually infinite loop in pagefault_disable() context -- if
> > > we would, then we would also infinite loop just from a regular page
> > > fault, too.
> >
> > Fixing the page tables inside the #MC handler to unmap the poison
> > page would indeed be a good solution. But, as you point out, not possible
> > because of locks.
> >
> > Could we take a more drastic approach? We know that this case the kernel
> > is accessing a user address for the current process. Could the machine
> > check handler just re-write %cr3 to point to a kernel-only page table[1].
> > I.e. unmap the entire current user process.
> 
> That seems scary, especially if we're in the middle of a context
> switch when this happens.  We *could* make it work, but I'm not at all
> convinced it's wise.

Scary? It's terrifying!

But we know that the fault happend in a get_user() or copy_from_user() call
(i.e. an RIP with an extable recovery address).  Does context switch
access user memory?

-Tony


Re: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery

2021-01-12 Thread Luck, Tony
On Tue, Jan 12, 2021 at 09:21:21AM -0800, Andy Lutomirski wrote:
> Well, we need to do *something* when the first __get_user() trips the
> #MC.  It would be nice if we could actually fix up the page tables
> inside the #MC handler, but, if we're in a pagefault_disable() context
> we might have locks held.  Heck, we could have the pagetable lock
> held, be inside NMI, etc.  Skipping the task_work_add() might actually
> make sense if we get a second one.
> 
> We won't actually infinite loop in pagefault_disable() context -- if
> we would, then we would also infinite loop just from a regular page
> fault, too.

Fixing the page tables inside the #MC handler to unmap the poison
page would indeed be a good solution. But, as you point out, not possible
because of locks.

Could we take a more drastic approach? We know that this case the kernel
is accessing a user address for the current process. Could the machine
check handler just re-write %cr3 to point to a kernel-only page table[1].
I.e. unmap the entire current user process.

Then any subsequent access to user space will page fault. Maybe have a
flag in the task structure to help the #PF handler understand what just
happened.

The code we execute in the task_work handler can restore %cr3

-Tony

[1] Does such a thing already exist? If not, I'd need some help/pointers
to create it.


Re: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery

2021-01-12 Thread Luck, Tony
On Tue, Jan 12, 2021 at 09:00:14AM -0800, Andy Lutomirski wrote:
> > On Jan 11, 2021, at 2:21 PM, Luck, Tony  wrote:
> >
> > On Mon, Jan 11, 2021 at 02:11:56PM -0800, Andy Lutomirski wrote:
> >>
> >>>> On Jan 11, 2021, at 1:45 PM, Tony Luck  wrote:
> >>>
> >>> Recovery action when get_user() triggers a machine check uses the fixup
> >>> path to make get_user() return -EFAULT.  Also queue_task_work() sets up
> >>> so that kill_me_maybe() will be called on return to user mode to send a
> >>> SIGBUS to the current process.
> >>>
> >>> But there are places in the kernel where the code assumes that this
> >>> EFAULT return was simply because of a page fault. The code takes some
> >>> action to fix that, and then retries the access. This results in a second
> >>> machine check.
> >>>
> >>> While processing this second machine check queue_task_work() is called
> >>> again. But since this uses the same callback_head structure that
> >>> was used in the first call, the net result is an entry on the
> >>> current->task_works list that points to itself.
> >>
> >> Is this happening in pagefault_disable context or normal sleepable fault 
> >> context?  If the latter, maybe we should reconsider finding a way for the 
> >> machine check code to do its work inline instead of deferring it.
> >
> > The first machine check is in pagefault_disable() context.
> >
> > static int get_futex_value_locked(u32 *dest, u32 __user *from)
> > {
> >int ret;
> >
> >pagefault_disable();
> >ret = __get_user(*dest, from);
> 
> I have very mixed feelings as to whether we should even try to recover
> from the first failure like this.  If we actually want to try to
> recover, perhaps we should instead arrange for the second MCE to
> recover successfully instead of panicking.

Well we obviously have to "recover" from the first machine check
in order to get to the second. Are you saying you don't like the
different return value from get_user()?

In my initial playing around with this I just had the second machine
check simply skip the task_work_add(). This worked for this case, but
only because there wasn't a third, fourth, etc. access to the poisoned
data. If the caller keeps peeking, then we'll keep taking more machine
checks - possibly forever.

Even if we do recover with just one extra machine check ... that's one
more than was necessary.

-Tony


Re: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery

2021-01-11 Thread Luck, Tony
On Mon, Jan 11, 2021 at 02:11:56PM -0800, Andy Lutomirski wrote:
> 
> > On Jan 11, 2021, at 1:45 PM, Tony Luck  wrote:
> > 
> > Recovery action when get_user() triggers a machine check uses the fixup
> > path to make get_user() return -EFAULT.  Also queue_task_work() sets up
> > so that kill_me_maybe() will be called on return to user mode to send a
> > SIGBUS to the current process.
> > 
> > But there are places in the kernel where the code assumes that this
> > EFAULT return was simply because of a page fault. The code takes some
> > action to fix that, and then retries the access. This results in a second
> > machine check.
> > 
> > While processing this second machine check queue_task_work() is called
> > again. But since this uses the same callback_head structure that
> > was used in the first call, the net result is an entry on the
> > current->task_works list that points to itself.
> 
> Is this happening in pagefault_disable context or normal sleepable fault 
> context?  If the latter, maybe we should reconsider finding a way for the 
> machine check code to do its work inline instead of deferring it.

The first machine check is in pagefault_disable() context.

static int get_futex_value_locked(u32 *dest, u32 __user *from)
{
int ret;

pagefault_disable();
ret = __get_user(*dest, from);
pagefault_enable();

return (ret == -ENXIO) ? ret : ret ? -EFAULT : 0;
}

-Tony


RE: [PATCH 2/2] futex, x86/mce: Avoid double machine checks

2021-01-08 Thread Luck, Tony
> Yeah, saw that, but that should be trivially fixable I'm thinking.

Trivial, maybe ... but then follows the audit of other get_user() calls:

git grep get_user\( | wc -l
2003

:-(

-Tony


RE: [PATCH 2/2] futex, x86/mce: Avoid double machine checks

2021-01-08 Thread Luck, Tony
> I think this is horrid; why can't we have it return something different
> then -EFAULT instead?

I did consider this ... but it appears that architectures aren't unified in the
return value from get_user()

Here's another function involved in the futex call chain leading to this:

static int get_futex_value_locked(u32 *dest, u32 __user *from)
{
int ret;

pagefault_disable();
ret = __get_user(*dest, from);
pagefault_enable();

return ret ? -EFAULT : 0;
}

It seems like the expectation here is just "zero or not" and we
don't care what the "not" value is ... just turn it into -EFAULT.

-Tony


RE: [PATCH RFC x86/mce] Make mce_timed_out() identify holdout CPUs

2021-01-06 Thread Luck, Tony
> Please see below for an updated patch.

Yes. That worked:

[   78.946069] mce: mce_timed_out: MCE holdout CPUs (may include false 
positives): 24-47,120-143
[   78.946151] mce: mce_timed_out: MCE holdout CPUs (may include false 
positives): 24-47,120-143
[   78.946153] Kernel panic - not syncing: Timeout: Not all CPUs entered 
broadcast exception handler

I guess that more than one CPU hit the timeout and so your new message was 
printed twice
before the panic code took over?

Once again, the whole of socket 1 is MIA rather than just the pair of threads 
on one of the cores there.
But that's a useful improvement (eliminating the other three sockets on this 
system).

Tested-by: Tony Luck 

-Tony


Re: [PATCH RFC x86/mce] Make mce_timed_out() identify holdout CPUs

2021-01-06 Thread Luck, Tony
On Wed, Jan 06, 2021 at 11:17:08AM -0800, Paul E. McKenney wrote:
> On Wed, Jan 06, 2021 at 06:39:30PM +0000, Luck, Tony wrote:
> > > The "Timeout: Not all CPUs entered broadcast exception handler" message
> > > will appear from time to time given enough systems, but this message does
> > > not identify which CPUs failed to enter the broadcast exception handler.
> > > This information would be valuable if available, for example, in order to
> > > correlated with other hardware-oriented error messages.  This commit
> > > therefore maintains a cpumask_t of CPUs that have entered this handler,
> > > and prints out which ones failed to enter in the event of a timeout.
> > 
> > I tried doing this a while back, but found that in my test case where I 
> > forced
> > an error that would cause both threads from one core to be "missing", the
> > output was highly unpredictable. Some random number of extra CPUs were
> > reported as missing. After I added some extra breadcrumbs it became clear
> > that pretty much all the CPUs (except the missing pair) entered 
> > do_machine_check(),
> > but some got hung up at various points beyond the entry point. My only 
> > theory
> > was that they were trying to snoop caches from the dead core (or access some
> > other resource held by the dead core) and so they hung too.
> > 
> > Your code is much neater than mine ... and perhaps works in other cases, but
> > maybe the message needs to allow for the fact that some of the cores that
> > are reported missing may just be collateral damage from the initial problem.
> 
> Understood.  The system is probably not in the best shape if this code
> is ever executed, after all.  ;-)
> 
> So how about like this?
> 
>   pr_info("%s: MCE holdout CPUs (may include false positives): %*pbl\n",

That looks fine.
> 
> Easy enough if so!
> 
> > If I get time in the next day or two, I'll run my old test against your 
> > code to
> > see what happens.

I got time today (plenty of meetings in which to run experiments in background).

This code:

-   if (mca_cfg.tolerant <= 1)
+   if (mca_cfg.tolerant <= 1) {
+   if (!cpumask_andnot(_missing_cpus, cpu_online_mask, 
_present_cpus))
+   pr_info("%s: MCE holdout CPUs: %*pbl\n",
+   __func__, 
cpumask_pr_args(_missing_cpus));
mce_panic(msg, NULL, NULL);

didn't trigger ... so maybe that cpumask_andnot() didn't return the value you 
expected?

I added a:

+   pr_info("%s: MCE present CPUs: %*pbl\n", __func__, 
cpumask_pr_args(_present_cpus));

to check that the mask was being set correctly, and saw:

[  219.329767] mce: mce_timed_out: MCE present CPUs: 0-23,48-119,144-191

So the every core of socket 1 failed to show up for this test.

> For my own testing, is this still the right thing to use?
> 
>   https://github.com/andikleen/mce-inject

That fakes up errors (by hooking into the mce_rdmsr() code to return arbitrary
user supplied values).  The plus side of this is that you can fake any error
signature without needing special h/w or f/w. The downside is that it is all 
fake
and you can't create situations where some CPUs don't show up in the machine
check handler.

-Tony


RE: [PATCH RFC x86/mce] Make mce_timed_out() identify holdout CPUs

2021-01-06 Thread Luck, Tony
> The "Timeout: Not all CPUs entered broadcast exception handler" message
> will appear from time to time given enough systems, but this message does
> not identify which CPUs failed to enter the broadcast exception handler.
> This information would be valuable if available, for example, in order to
> correlated with other hardware-oriented error messages.  This commit
> therefore maintains a cpumask_t of CPUs that have entered this handler,
> and prints out which ones failed to enter in the event of a timeout.

I tried doing this a while back, but found that in my test case where I forced
an error that would cause both threads from one core to be "missing", the
output was highly unpredictable. Some random number of extra CPUs were
reported as missing. After I added some extra breadcrumbs it became clear
that pretty much all the CPUs (except the missing pair) entered 
do_machine_check(),
but some got hung up at various points beyond the entry point. My only theory
was that they were trying to snoop caches from the dead core (or access some
other resource held by the dead core) and so they hung too.

Your code is much neater than mine ... and perhaps works in other cases, but
maybe the message needs to allow for the fact that some of the cores that
are reported missing may just be collateral damage from the initial problem.

If I get time in the next day or two, I'll run my old test against your code to
see what happens.

-Tony


Re: [PATCH] ia64: fix xchg() warning

2021-01-05 Thread Luck, Tony
On Tue, Jan 05, 2021 at 02:17:41PM +0100, Arnd Bergmann wrote:
> On Mon, Jan 4, 2021 at 5:00 PM Luck, Tony  wrote:
> >
> > > I have not received any reply from the ia64 maintainers, I assume they 
> > > were
> > > both out of office for Christmas.
> >
> > I'm back in the office ... but have no working ia64 machines, nor time to 
> > look at patches :-(
> >
> > Should drop me from the MAINTAINTERS file.
> 
> If you like, I can apply the patch below and take that through my
> asm-generic tree along with the two bug fixes I sent:
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 0f2e55faaf7f..b74093803154 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -8432,11 +8432,8 @@ F: drivers/i3c/
>  F: include/linux/i3c/
> 
>  IA64 (Itanium) PLATFORM
> -M: Tony Luck 
> -M: Fenghua Yu 
>  L: linux-i...@vger.kernel.org
> -S: Odd Fixes
> -T: git git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux.git
> +S: Orphan
>  F: Documentation/ia64/
>  F: arch/ia64/
> 
> Is that what you had in mind? I see that Fenghua Yu has not been
> actively involved for a long time. If you are both out, that would
> make the port unmaintained, but that may actually help find someone
> else to either volunteer as a maintainer or pay someone if they
> have a commercial interest.

Yes. Fenghua has moved to working on other things, so that looks good.

Acked-by: Tony Luck 

-Tony


RE: [PATCH] ia64: fix xchg() warning

2021-01-04 Thread Luck, Tony
> I have not received any reply from the ia64 maintainers, I assume they were
> both out of office for Christmas.

I'm back in the office ... but have no working ia64 machines, nor time to look 
at patches :-(

Should drop me from the MAINTAINTERS file.

-Tony


RE: linux-next: Tree for Nov 19 (drivers/edac/igen6_edac.c)

2020-11-19 Thread Luck, Tony
> ../drivers/edac/igen6_edac.c: In function 'ecclog_nmi_handler':
> ../drivers/edac/igen6_edac.c:525:10: error: 'NMI_DONE' undeclared (first use 
> in this function); did you mean 'DMI_NONE'?
>return NMI_DONE;

This driver has a #include 

But inside that file it says:

#if defined(CONFIG_HAVE_NMI_WATCHDOG)
#include 
#endif

and the randconfig used doesn't set CONFIG_HAVE_NMI_WATCHDOG


Some options:

1) Drop that #ifdef from 
It was introduced as part of this commit:
f2e0cff85ed1 ("kernel/watchdog: introduce arch_touch_nmi_watchdog()")
presumably for some good reason.

2) Make this edac driver select CONFIG_HAVE_NMI_WATCHDOG
Yuck!

3) Make this driver #include  instead of 
This fixes this build error, but I thought that general policy was to
use the  if it exists rather than the  one.
Maybe that's ok here because this is an x86 specific driver?

I'm leaning toward option #3.

-Tony


RE: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection

2020-11-13 Thread Luck, Tony
> Of course is this not only an x86 problem. Every architecture which
> supports virtualization has the same issue. ARM(64) has no way to tell
> for sure whether the machine runs bare metal either. No idea about the
> other architectures.

Sounds like a hypervisor problem. If the VMM provides perfect emulation
of every weird quirk of h/w, then it is OK to let the guest believe that it is
running on bare metal.

If it isn't perfect, then it should make sure the guest knows *for sure*, so 
that
the guest can take appropriate actions to avoid the sharp edges.

-Tony


[PATCH v2] x86/mce: Use "safe" MSR functions when enabling additional error logging

2020-11-10 Thread Luck, Tony
Booting as a guest under KVM results in error messages about
unchecked MSR access:

[6.814328][T0] unchecked MSR access error: RDMSR from 0x17f at rIP: 
0x84483f16 (mce_intel_feature_init+0x156/0x270)

because KVM doesn't provide emulation for random model specific registers.

Switch to using rdmsrl_safe()/wrmsrl_safe() to avoid the message.

Reported-by: Qian Cai 
Fixes: 68299a42f842 ("x86/mce: Enable additional error logging on certain Intel 
CPUs")
Signed-off-by: Tony Luck 
---
 arch/x86/kernel/cpu/mce/intel.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/intel.c b/arch/x86/kernel/cpu/mce/intel.c
index b47883e364b4..42e60ef16c3a 100644
--- a/arch/x86/kernel/cpu/mce/intel.c
+++ b/arch/x86/kernel/cpu/mce/intel.c
@@ -521,9 +521,10 @@ static void intel_imc_init(struct cpuinfo_x86 *c)
case INTEL_FAM6_SANDYBRIDGE_X:
case INTEL_FAM6_IVYBRIDGE_X:
case INTEL_FAM6_HASWELL_X:
-   rdmsrl(MSR_ERROR_CONTROL, error_control);
+   if (rdmsrl_safe(MSR_ERROR_CONTROL, _control))
+   return;
error_control |= 2;
-   wrmsrl(MSR_ERROR_CONTROL, error_control);
+   wrmsrl_safe(MSR_ERROR_CONTROL, error_control);
break;
}
 }
-- 
2.21.1



RE: [PATCH] x86/mce: Check for hypervisor before enabling additional error logging

2020-11-10 Thread Luck, Tony
I still think there is a reasonable case to claim that this usage is right to 
check
whether it is running as a guest.

Look at what it is trying to do ... change the behavior of the platform w.r.t. 
logging
of memory errors.  How does that make any sense for a guest ... that doesn't 
even
know what memory is present on the platform. Or have guarantees that what it 
sees
as memory address 0x12345678 maps to the same set of cells in a DRAM from one
second to the next?

-Tony


RE: [tip: ras/core] x86/mce: Enable additional error logging on certain Intel CPUs

2020-11-09 Thread Luck, Tony
> I wouldn't expect all hypervisors to necessarily set CPUID.01H:ECX[bit
> 31]. Architecturally, on Intel CPUs, that bit is simply defined as
> "not used." There is no documented contract between Intel and
> hypervisor vendors regarding the use of that bit. (AMD, on the other
> hand, *does* document that bit as "reserved for use by hypervisor to
> indicate guest status.")

Maybe no contract ... but a bunch of places (many of them in Intel
specific code) that check for it.  Perhaps I should poke the owners
of the Intel SDM to accept the inevitable.

$ git grep "boot_cpu_has(X86_FEATURE_HYPERVISOR"
arch/x86/events/core.c: if (boot_cpu_has(X86_FEATURE_HYPERVISOR)) {
arch/x86/events/intel/core.c:   if (!boot_cpu_has(X86_FEATURE_HYPERVISOR))
arch/x86/events/intel/core.c:   int assume = 3 * 
!boot_cpu_has(X86_FEATURE_HYPERVISOR);
arch/x86/events/intel/cstate.c: if (boot_cpu_has(X86_FEATURE_HYPERVISOR))
arch/x86/events/intel/uncore.c: if (boot_cpu_has(X86_FEATURE_HYPERVISOR))
arch/x86/kernel/apic/apic.c:if (boot_cpu_has(X86_FEATURE_HYPERVISOR))
arch/x86/kernel/cpu/bugs.c: if (boot_cpu_has(X86_FEATURE_HYPERVISOR))
arch/x86/kernel/cpu/bugs.c: else if (boot_cpu_has(X86_FEATURE_HYPERVISOR))
arch/x86/kernel/cpu/bugs.c: if (boot_cpu_has(X86_FEATURE_HYPERVISOR)) {
arch/x86/kernel/cpu/bugs.c: if (boot_cpu_has(X86_FEATURE_HYPERVISOR)) {
arch/x86/kernel/cpu/intel.c:if (boot_cpu_has(X86_FEATURE_HYPERVISOR))
arch/x86/kernel/cpu/mshyperv.c: if (!boot_cpu_has(X86_FEATURE_HYPERVISOR))
arch/x86/kernel/cpu/vmware.c: * If !boot_cpu_has(X86_FEATURE_HYPERVISOR), 
vmware_hypercall_mode
arch/x86/kernel/cpu/vmware.c:   if (boot_cpu_has(X86_FEATURE_HYPERVISOR)) {
arch/x86/kernel/jailhouse.c:!boot_cpu_has(X86_FEATURE_HYPERVISOR))
arch/x86/kernel/kvm.c:  if (boot_cpu_has(X86_FEATURE_HYPERVISOR))
arch/x86/kernel/paravirt.c: if (!boot_cpu_has(X86_FEATURE_HYPERVISOR))
arch/x86/kernel/tsc.c:  if (boot_cpu_has(X86_FEATURE_HYPERVISOR) ||
arch/x86/mm/init_64.c:  if (!boot_cpu_has(X86_FEATURE_HYPERVISOR)) {
drivers/acpi/processor_idle.c:  if (boot_cpu_has(X86_FEATURE_HYPERVISOR))
drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h:   return 
boot_cpu_has(X86_FEATURE_HYPERVISOR);
drivers/gpu/drm/i915/i915_memcpy.c: 
!boot_cpu_has(X86_FEATURE_HYPERVISOR))
drivers/gpu/drm/radeon/radeon_device.c: return 
boot_cpu_has(X86_FEATURE_HYPERVISOR);
drivers/visorbus/visorchipset.c:if 
(boot_cpu_has(X86_FEATURE_HYPERVISOR)) {

-Tony


[PATCH] x86/mce: Check for hypervisor before enabling additional error logging

2020-11-09 Thread Luck, Tony
Booting as a guest under KVM results in error messages about
unchecked MSR access:

[6.814328][T0] unchecked MSR access error: RDMSR from 0x17f at rIP: 
0x84483f16 (mce_intel_feature_init+0x156/0x270)

because KVM doesn't provide emulation for random model specific registers.

Check for X86_FEATURE_HYPERVISOR and skip trying to enable the mode (a
guest shouldn't be concerned with corrected errors anyway).

Reported-by: Qian Cai 
Fixes: 68299a42f842 ("x86/mce: Enable additional error logging on certain Intel 
CPUs")
Signed-off-by: Tony Luck 
---
 arch/x86/kernel/cpu/mce/intel.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/kernel/cpu/mce/intel.c b/arch/x86/kernel/cpu/mce/intel.c
index b47883e364b4..7f7d863400b7 100644
--- a/arch/x86/kernel/cpu/mce/intel.c
+++ b/arch/x86/kernel/cpu/mce/intel.c
@@ -517,6 +517,9 @@ static void intel_imc_init(struct cpuinfo_x86 *c)
 {
u64 error_control;
 
+   if (boot_cpu_has(X86_FEATURE_HYPERVISOR))
+   return;
+
switch (c->x86_model) {
case INTEL_FAM6_SANDYBRIDGE_X:
case INTEL_FAM6_IVYBRIDGE_X:
-- 
2.21.1



RE: [tip: ras/core] x86/mce: Enable additional error logging on certain Intel CPUs

2020-11-09 Thread Luck, Tony
> I thought Linux had long ago gone the route of turning rdmsr/wrmsr
> into rdmsr_safe/wrmsr_safe, so that the guest would ignore the #GPs on
> writes and return zero to the caller for #GPs on reads.

Linux just switched that around for the machine check banks ... if they #GP
fault, then something is seriously wrong.

Maybe that isn't a general change of direction though. Perhaps I
should either use rdmsrl_safe() in this code. Or (better?) add

if (boot_cpu_has(X86_FEATURE_HYPERVISOR))
return;

to the start of intel_imc_init().

-Tony


RE: [tip: ras/core] x86/mce: Enable additional error logging on certain Intel CPUs

2020-11-09 Thread Luck, Tony
What does KVM do with model specific MSRs?

Looks like you let the guest believe it was running on one of Sandy Bridge, Ivy 
Bridge, Haswell (Xeon).

So, the core MCE code tried to enable extended error reporting.

If there is a mode to have KVM let the guest think that it read/wrote MSR 0x17F,
but actually, doesn't do it ... that would seem to be a reasonable thing to do 
here.

-Tony


Re: [PATCH] x86/mce: Enable additional error logging on certain Intel CPUs

2020-10-30 Thread Luck, Tony
On Fri, Oct 30, 2020 at 12:04:03PM -0700, Luck, Tony wrote:

Bah, didn't notice this conversation didn't include LKML.

> The Xeon versions of Sandy Bridge, Ivy Bridge and Haswell support an
> optional additional error logging mode which is enabled by an MSR.
> 
> Previously this mode was enabled from the mcelog(8) tool via /dev/cpu,
> but the kernel is now very picky about which MSRs may be written. So
> move the enabling into the kernel.
> 
> Suggested-by: Boris Petkov 
> Signed-off-by: Tony Luck 
> ---
> 
> N.B. I don't have any of these old systems in my lab any more. So
> this is untested :-(
> 
>  arch/x86/include/asm/msr-index.h |  1 +
>  arch/x86/kernel/cpu/mce/intel.c  | 20 
>  2 files changed, 21 insertions(+)
> 
> diff --git a/arch/x86/include/asm/msr-index.h 
> b/arch/x86/include/asm/msr-index.h
> index 972a34d93505..b2dd2648c0e2 100644
> --- a/arch/x86/include/asm/msr-index.h
> +++ b/arch/x86/include/asm/msr-index.h
> @@ -139,6 +139,7 @@
>  #define MSR_IA32_MCG_CAP 0x0179
>  #define MSR_IA32_MCG_STATUS  0x017a
>  #define MSR_IA32_MCG_CTL 0x017b
> +#define MSR_ERROR_CONTROL0x017f
>  #define MSR_IA32_MCG_EXT_CTL 0x04d0
>  
>  #define MSR_OFFCORE_RSP_00x01a6
> diff --git a/arch/x86/kernel/cpu/mce/intel.c b/arch/x86/kernel/cpu/mce/intel.c
> index abe9fe0fb851..b47883e364b4 100644
> --- a/arch/x86/kernel/cpu/mce/intel.c
> +++ b/arch/x86/kernel/cpu/mce/intel.c
> @@ -509,12 +509,32 @@ static void intel_ppin_init(struct cpuinfo_x86 *c)
>   }
>  }
>  
> +/*
> + * Enable additional error logs from the integrated
> + * memory controller on processors that support this.
> + */
> +static void intel_imc_init(struct cpuinfo_x86 *c)
> +{
> + u64 error_control;
> +
> + switch (c->x86_model) {
> + case INTEL_FAM6_SANDYBRIDGE_X:
> + case INTEL_FAM6_IVYBRIDGE_X:
> + case INTEL_FAM6_HASWELL_X:
> + rdmsrl(MSR_ERROR_CONTROL, error_control);
> + error_control |= 2;
> + wrmsrl(MSR_ERROR_CONTROL, error_control);
> + break;
> + }
> +}
> +
>  void mce_intel_feature_init(struct cpuinfo_x86 *c)
>  {
>   intel_init_thermal(c);
>   intel_init_cmci();
>   intel_init_lmce();
>   intel_ppin_init(c);
> + intel_imc_init(c);
>  }
>  
>  void mce_intel_feature_clear(struct cpuinfo_x86 *c)
> -- 
> 2.21.1
> 


[PLEASE PULL] ia64 for v5.10

2020-10-12 Thread Luck, Tony
The following changes since commit f4d51dffc6c01a9e94650d95ce0104964f8ae822:

  Linux 5.9-rc4 (2020-09-06 17:11:40 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux.git 
tags/ia64_for_5.10

for you to fetch changes up to c331649e637152788b0ca1c857d0c2eaf34fcbc3:

  ia64: Use libata instead of the legacy ide driver in defconfigs (2020-09-11 
09:38:00 -0700)


Cleanups by Christoph
1) Switch to libata instead of legacy ide driver
2) Drop ia64 perfmon (it's been broken for a while)


Christoph Hellwig (2):
  ia64: Remove perfmon
  ia64: Use libata instead of the legacy ide driver in defconfigs

 arch/ia64/Kconfig |9 -
 arch/ia64/configs/bigsur_defconfig|8 +-
 arch/ia64/configs/generic_defconfig   |   10 +-
 arch/ia64/configs/gensparse_defconfig |   10 +-
 arch/ia64/configs/tiger_defconfig |   10 +-
 arch/ia64/configs/zx1_defconfig   |8 +-
 arch/ia64/include/asm/processor.h |   10 -
 arch/ia64/include/asm/switch_to.h |   10 +-
 arch/ia64/kernel/Makefile |3 +-
 arch/ia64/kernel/irq_ia64.c   |7 -
 arch/ia64/kernel/perfmon.c| 6703 -
 arch/ia64/kernel/process.c|   53 -
 arch/ia64/kernel/ptrace.c |   24 -
 arch/ia64/kernel/smpboot.c|8 -
 arch/ia64/kernel/syscalls/syscall.tbl |2 +-
 arch/ia64/lib/Makefile|1 -
 arch/ia64/lib/carta_random.S  |   55 -
 arch/ia64/oprofile/Makefile   |1 -
 arch/ia64/oprofile/init.c |   12 +-
 arch/ia64/oprofile/perfmon.c  |   99 -
 20 files changed, 22 insertions(+), 7021 deletions(-)
 delete mode 100644 arch/ia64/kernel/perfmon.c
 delete mode 100644 arch/ia64/lib/carta_random.S
 delete mode 100644 arch/ia64/oprofile/perfmon.c


RE: [PATCH v3 4/6] x86/mce: Avoid tail copy when machine check terminated a copy from user

2020-10-07 Thread Luck, Tony
>> Machine checks are more serious. Just give up at the point where the
>> main copy loop triggered the #MC and return from the copy code as if
>> the copy succeeded. The machine check handler will use task_work_add() to
>> make sure that the task is sent a SIGBUS.
>
> Isn't that just plain wrong?

It isn't pretty. I'm not sure how wrong it is.

> If copy is reported as succeeding the kernel code will use the 'old'
> data that is in the buffer as if it had been read from userspace.
> This could end up with kernel stack data being written to a file.

I ran a test with:

write(fd, buf, 512)

With poison injected into buf[256] to force a machine check mid-copy.

The size of the file did get incremented by 512 rather than 256. Which isn't 
good.

The data in the file up to the 256 byte mark was the user data from buf[0 ... 
255].

The data in the file past offset 256 was all zeroes. I suspect that isn't by 
chance.
The kernel has to defend against a user writing a partial page and using mmap(2)
on the same file to peek at data past EOF and up to the next PAGE_SIZE boundary.
So I think it must zero new pages allocated in page cache as they are allocated 
to
a file.

> Even zeroing the rest of the kernel buffer is wrong.

It wouldn't help/change anything.

> IIRC the code to try to maximise the copy has been removed.
> So the 'slow' retry wont happen any more.

Which code has been removed (and when ... TIP, and my testing, is based on 
5.9-rc1)

-Tony



Re: [PATCH v2 6/7] x86/mce: Recover from poison found while copying from user space

2020-10-05 Thread Luck, Tony
On Mon, Oct 05, 2020 at 06:32:47PM +0200, Borislav Petkov wrote:
> On Wed, Sep 30, 2020 at 04:26:10PM -0700, Tony Luck wrote:
> >  arch/x86/kernel/cpu/mce/core.c | 33 +
> >  include/linux/sched.h  |  2 ++
> >  2 files changed, 23 insertions(+), 12 deletions(-)
> 
> Isn't that just simpler?

Yes. A helper function avoids making the code a mess of if/else with subtle
fall through cases.

> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> index 4d2cf08820af..dc6c83aa2ec1 100644
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -1261,6 +1261,21 @@ static void kill_me_maybe(struct callback_head *cb)
>   kill_me_now(cb);
>  }
>  
> +static inline void queue_task_work(struct mce *m, int kill_it)

Does it need to be "inline" though? I hope machine check processing
is never in the critical code path for anyone!

> +{
> + current->mce_addr = m->addr;
> + current->mce_kflags = m->kflags;
> + current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV);
> + current->mce_whole_page = whole_page(m);
> +
> + if (kill_it)
> + current->mce_kill_me.func = kill_me_now;
> + else
> + current->mce_kill_me.func = kill_me_maybe;
> +
> + task_work_add(current, >mce_kill_me, true);
> +}
> +
>  /*
>   * The actual machine check handler. This only handles real
>   * exceptions when something got corrupted coming in through int 18.
> @@ -1402,13 +1417,8 @@ noinstr void do_machine_check(struct pt_regs *regs)
>   /* If this triggers there is no way to recover. Die hard. */
>   BUG_ON(!on_thread_stack() || !user_mode(regs));
>  
> - current->mce_addr = m.addr;
> - current->mce_ripv = !!(m.mcgstatus & MCG_STATUS_RIPV);
> - current->mce_whole_page = whole_page();
> - current->mce_kill_me.func = kill_me_maybe;
> - if (kill_it)
> - current->mce_kill_me.func = kill_me_now;
> - task_work_add(current, >mce_kill_me, true);
> + queue_task_work(, kill_it);
> +
>   } else {
>   /*
>* Handle an MCE which has happened in kernel space but from
> @@ -1423,6 +1433,9 @@ noinstr void do_machine_check(struct pt_regs *regs)
>   if (!fixup_exception(regs, X86_TRAP_MC, 0, 0))
>   mce_panic("Failed kernel mode recovery", , 
> msg);
>   }
> +
> + if (m.kflags & MCE_IN_KERNEL_COPYIN)
> + queue_task_work(, kill_it);
>   }
>  out:
>   mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);

-Tony


Re: [PATCH 0/3] x86: Add initial support to discover Intel hybrid CPUs

2020-10-02 Thread Luck, Tony
On Sat, Oct 03, 2020 at 03:39:29AM +0200, Thomas Gleixner wrote:
> On Fri, Oct 02 2020 at 13:19, Ricardo Neri wrote:
> > Add support to discover and enumerate CPUs in Intel hybrid parts. A hybrid
> > part has CPUs with more than one type of micro-architecture. Thus, certain
> > features may only be present in a specific CPU type.
> >
> > It is useful to know the type of CPUs present in a system. For instance,
> > perf may need to handle CPUs differently depending on the type of micro-
> > architecture. Decoding machine check error logs may need the additional
> > micro-architecture type information, so include that in the log.
> 
> 'It is useful' as justification just makes me barf.

This isn't "hetero" ... all of the cores are architecturally the same.
If CPUID says that some feature is supported, then it will be supported
on all of the cores.

There might be some model specific performance counter events that only
apply to some cores. Or a machine check error code that is logged in the
model specific MSCOD field of IA32_MCi_STATUS. But any and all code can run
on any core.

Sure there will be some different power/performance tradeoffs on some
cores. But we already have that with some cores able to achieve higher
turbo frequencies than others.

-Tony


  1   2   3   4   5   6   7   8   9   10   >