Re: [PATCH splitout] mm: memory-failure: serialize TestSetPageHWPoison with zone->lock

Michael S. Tsirkin Tue, 16 Jun 2026 14:36:08 -0700

On Tue, Jun 16, 2026 at 02:18:57PM +0200, David Hildenbrand (Arm) wrote:
> On 6/16/26 13:40, Miaohe Lin wrote:
> > On 2026/6/16 14:56, David Hildenbrand (Arm) wrote:
> >>>
> >>> These non-atomics are defined and used because they want to avoid atomic 
> >>> ops overhead?
> >>> So I'm afraid using rcu read lock in these places would lead to 
> >>> unexpected overhead.
> >>
> >> It should be cheaper than atomics IIUC. Further, I assume that some pages 
> >> could
> >> batch over multiple such operations (esp. page freeing path when we 
> >> process tail
> >> pages).
> >>
> >> With !CONFIG_PREEMPT_RCU it's simply preempt_disable()/preempt_enable(), 
> >> which
> >> is either a NOP or just adjusting the preempt counter of the current 
> >> thread. Cheap.
> >>
> >> With CONFIG_PREEMPT_RCU we mostly increment 
> >> current->rcu_read_lock_nesting. But
> >> there might be a function call involved (did not look into the details). 
> >> So that
> >> variant should be slightly more expensive.
> > 
> > I scanned the code and found rcu_read_unlock_special might be called in 
> > some cases.
> > Some expensive ops, e.g. irq_work_queue_on, might be called in some corner 
> > cases.
> > So the overhead of rcu read lock might be fluctuating.
> 
> Right. Usually rcu_read_lock+unlock is supposed to be very lightweight, but 
> that
> might not be completely the case with that PREEMPT_RCU thingy ...
> 
> > 
> >>
> >> We'd have to measure what an addition rcu read lock would cost in there. 
> >> that
> >> should be fairly easy to benchmark.
> > 
> > Sure. We can do that if needed.
> > 
> >>
> >>>
> >>> I think this is a good idea, although there are some remaining issues.
> >>> But such race should be really rare, is it worth all this effort? Could we
> >>> simply aim to resolve, not to be flawless? I.e. could we simply check
> >>> and re-set the hwpoison flag at the end of memory_failure handling to
> >>> simply avoid losing hwpoison flag as a best-effort attempt? Would it be
> >>> acceptable?
> >>
> >> Hacky. Sufficient for the hypervisor to suspend the nonatomic-setting CPU 
> >> at the
> >> wrong time to still trigger the same behavior.
> > 
> > Right. hypervisor could make the issue easier to trigger...
> > 
> >>
> >> I think, either we fix it properly, or we redesign hwpoison handling to 
> >> deal
> >> with setting/clearing becoming stale at some random point in the future.
> > 
> > I think your proposal, although there are still some issues to be resolved, 
> > is
> > nevertheless a good solution. We could also wait and see if anyone comes up 
> > with
> > a better one.
> 
> I wouldn't call it "good" ... it's the only thing I was easily able to come up
> with :)
> 
> The only alternative would be moving the hwpoison bit out of page->flags,
> storing it in a sparse bitmap or sth. like that. It would be a bigger rework 
> and
> I am sure there are issues with that as well.
> 
> -- 
> Cheers,
> 
> David



I had a vague feeling using static keys should be possible somehow,
but could not come up with anything robust.
So - like this? Untested.

--->

mm: memory-failure: use RCU and static key to fix HWPoison flag race

Non-atomic page flag operations (page->flags.f &= ~mask, __set_bit,
__clear_bit) can race with atomic TestSetPageHWPoison() in
memory_failure().  The non-atomic RMW reads flags, memory_failure()
atomically sets HWPoison, then the RMW writes back the old value
without HWPoison -- clobbering the bit.

Fix this by wrapping all non-atomic page flag operations in
rcu_read_lock/rcu_read_unlock via the hwpoison_safe() macro
(CONFIG_MEMORY_FAILURE only, skipped early boot via rcu_is_watching()).
memory_failure() then calls synchronize_rcu() to drain in-flight
non-atomic operations, and retries TestSetPageHWPoison() until the
bit sticks.

Fixes: 6a46079cf57a ("HWPOISON: The high level memory error handler in the VM 
v7")
Signed-off-by: Michael S. Tsirkin <[email protected]>
Assisted-by: Claude:claude-opus-4-6

---

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 06bbe9eba636..e607a77c1627 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2343,7 +2343,7 @@ int folio_xchg_last_cpupid(struct folio *folio, int 
cpupid);
 
 static inline void page_cpupid_reset_last(struct page *page)
 {
-       page->flags.f |= LAST_CPUPID_MASK << LAST_CPUPID_PGSHIFT;
+       set_page_flags_safe(page, LAST_CPUPID_MASK << LAST_CPUPID_PGSHIFT);
 }
 #endif /* LAST_CPUPID_NOT_IN_PAGE_FLAGS */
 
@@ -2503,8 +2503,8 @@ static inline struct zone *folio_zone(const struct folio 
*folio)
 #ifdef SECTION_IN_PAGE_FLAGS
 static inline void set_page_section(struct page *page, unsigned long section)
 {
-       page->flags.f &= ~(SECTIONS_MASK << SECTIONS_PGSHIFT);
-       page->flags.f |= (section & SECTIONS_MASK) << SECTIONS_PGSHIFT;
+       clear_page_flags_safe(page, SECTIONS_MASK << SECTIONS_PGSHIFT);
+       set_page_flags_safe(page, (section & SECTIONS_MASK) << 
SECTIONS_PGSHIFT);
 }
 
 static inline unsigned long memdesc_section(memdesc_flags_t mdf)
@@ -2719,14 +2719,14 @@ static inline bool folio_is_longterm_pinnable(struct 
folio *folio)
 
 static inline void set_page_zone(struct page *page, enum zone_type zone)
 {
-       page->flags.f &= ~(ZONES_MASK << ZONES_PGSHIFT);
-       page->flags.f |= (zone & ZONES_MASK) << ZONES_PGSHIFT;
+       clear_page_flags_safe(page, ZONES_MASK << ZONES_PGSHIFT);
+       set_page_flags_safe(page, (zone & ZONES_MASK) << ZONES_PGSHIFT);
 }
 
 static inline void set_page_node(struct page *page, unsigned long node)
 {
-       page->flags.f &= ~(NODES_MASK << NODES_PGSHIFT);
-       page->flags.f |= (node & NODES_MASK) << NODES_PGSHIFT;
+       clear_page_flags_safe(page, NODES_MASK << NODES_PGSHIFT);
+       set_page_flags_safe(page, (node & NODES_MASK) << NODES_PGSHIFT);
 }
 
 static inline void set_page_links(struct page *page, enum zone_type zone,
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 7223f6f4e2b4..e896d47d0031 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -9,6 +9,7 @@
 #include <linux/types.h>
 #include <linux/bug.h>
 #include <linux/mmdebug.h>
+#include <linux/rcupdate.h>
 #ifndef __GENERATING_BOUNDS_H
 #include <linux/mm_types.h>
 #include <generated/bounds.h>
@@ -404,6 +405,38 @@ static unsigned long *folio_flags(struct folio *folio, 
unsigned n)
 #define FOLIO_HEAD_PAGE                0
 #define FOLIO_SECOND_PAGE      1
 
+/*
+ * Non-atomic page flag operations (__set_bit, __clear_bit, flags &= ~mask)
+ * can race with atomic TestSetPageHWPoison() in memory_failure().
+ * Wrap non-atomic ops in rcu_read_lock so that synchronize_rcu() in
+ * memory_failure() drains in-flight callers.
+ */
+#ifdef CONFIG_MEMORY_FAILURE
+#define hwpoison_safe(op) do {                                         \
+       if (rcu_is_watching()) {                                        \
+               rcu_read_lock();                                        \
+               op;                                                     \
+               rcu_read_unlock();                                      \
+       } else {                                                        \
+               op;                                                     \
+       }                                                               \
+} while (0)
+#else
+#define hwpoison_safe(op) do { op; } while (0)
+#endif
+
+static __always_inline void clear_page_flags_safe(struct page *page,
+                                                 unsigned long mask)
+{
+       hwpoison_safe(page->flags.f &= ~mask);
+}
+
+static __always_inline void set_page_flags_safe(struct page *page,
+                                               unsigned long mask)
+{
+       hwpoison_safe(page->flags.f |= mask);
+}
+
 /*
  * Macros to create function definitions for page flags
  */
@@ -421,11 +454,11 @@ static __always_inline void folio_clear_##name(struct 
folio *folio)       \
 
 #define __FOLIO_SET_FLAG(name, page)                                   \
 static __always_inline void __folio_set_##name(struct folio *folio)    \
-{ __set_bit(PG_##name, folio_flags(folio, page)); }
+{ hwpoison_safe(__set_bit(PG_##name, folio_flags(folio, page))); }
 
 #define __FOLIO_CLEAR_FLAG(name, page)                                 \
 static __always_inline void __folio_clear_##name(struct folio *folio)  \
-{ __clear_bit(PG_##name, folio_flags(folio, page)); }
+{ hwpoison_safe(__clear_bit(PG_##name, folio_flags(folio, page))); }
 
 #define FOLIO_TEST_SET_FLAG(name, page)                                        
\
 static __always_inline bool folio_test_set_##name(struct folio *folio) \
@@ -458,12 +491,12 @@ static __always_inline void ClearPage##uname(struct page 
*page)           \
 #define __SETPAGEFLAG(uname, lname, policy)                            \
 __FOLIO_SET_FLAG(lname, FOLIO_##policy)                                        
\
 static __always_inline void __SetPage##uname(struct page *page)                
\
-{ __set_bit(PG_##lname, &policy(page, 1)->flags.f); }
+{ hwpoison_safe(__set_bit(PG_##lname, &policy(page, 1)->flags.f)); }
 
 #define __CLEARPAGEFLAG(uname, lname, policy)                          \
 __FOLIO_CLEAR_FLAG(lname, FOLIO_##policy)                              \
 static __always_inline void __ClearPage##uname(struct page *page)      \
-{ __clear_bit(PG_##lname, &policy(page, 1)->flags.f); }
+{ hwpoison_safe(__clear_bit(PG_##lname, &policy(page, 1)->flags.f)); }
 
 #define TESTSETFLAG(uname, lname, policy)                              \
 FOLIO_TEST_SET_FLAG(lname, FOLIO_##policy)                             \
@@ -806,7 +839,7 @@ static inline bool PageUptodate(const struct page *page)
 static __always_inline void __folio_mark_uptodate(struct folio *folio)
 {
        smp_wmb();
-       __set_bit(PG_uptodate, folio_flags(folio, 0));
+       hwpoison_safe(__set_bit(PG_uptodate, folio_flags(folio, 0)));
 }
 
 static __always_inline void folio_mark_uptodate(struct folio *folio)
@@ -1169,7 +1202,7 @@ static __always_inline void 
__ClearPageAnonExclusive(struct page *page)
 {
        VM_BUG_ON_PGFLAGS(!PageAnon(page), page);
        VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page);
-       __clear_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags.f);
+       hwpoison_safe(__clear_bit(PG_anon_exclusive, &PF_ANY(page, 
1)->flags.f));
 }
 
 #ifdef CONFIG_MMU
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 970e077019b7..da6a0747e4d3 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3624,8 +3624,9 @@ static void __split_folio_to_order(struct folio *folio, 
int old_order,
                 * unreferenced sub-pages of an anonymous THP: we can simply 
drop
                 * PG_anon_exclusive (-> PG_mappedtodisk) for these here.
                 */
-               new_folio->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
-               new_folio->flags.f |= (folio->flags.f &
+               clear_page_flags_safe(&new_folio->page, 
PAGE_FLAGS_CHECK_AT_PREP);
+               set_page_flags_safe(&new_folio->page,
+                       folio->flags.f &
                                ((1L << PG_referenced) |
                                 (1L << PG_swapbacked) |
                                 (1L << PG_swapcache) |
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index ee42d4361309..9bc1ad5bffca 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -76,6 +76,44 @@ static int sysctl_enable_soft_offline __read_mostly = 1;
 
 atomic_long_t num_poisoned_pages __read_mostly = ATOMIC_LONG_INIT(0);
 
+/*
+ * Drain any in-flight non-atomic page flag operations that could
+ * clobber a concurrently set HWPoison bit.  Retries until the bit sticks.
+ */
+static void set_hwpoison_drain_rcu(struct page *p)
+{
+       do {
+               synchronize_rcu();
+       } while (!TestSetPageHWPoison(p));
+}
+
+/*
+ * Drain any in-flight non-atomic page flag operations that could
+ * restore the HWPoison bit from stale data.  Retries until it stays clear.
+ */
+static void clear_hwpoison_drain_rcu(struct page *p)
+{
+       do {
+               synchronize_rcu();
+       } while (TestClearPageHWPoison(p));
+}
+
+static bool test_and_set_hwpoison_drain_rcu(struct page *p)
+{
+       bool was_set = TestSetPageHWPoison(p);
+
+       set_hwpoison_drain_rcu(p);
+       return was_set;
+}
+
+static bool test_and_clear_hwpoison_drain_rcu(struct page *p)
+{
+       bool was_set = TestClearPageHWPoison(p);
+
+       clear_hwpoison_drain_rcu(p);
+       return was_set;
+}
+
 static bool hw_memory_failure __read_mostly = false;
 
 static DEFINE_MUTEX(mf_mutex);
@@ -2390,7 +2428,7 @@ int memory_failure(unsigned long pfn, int flags)
        if (hugetlb)
                goto unlock_mutex;
 
-       if (TestSetPageHWPoison(p)) {
+       if (test_and_set_hwpoison_drain_rcu(p)) {
                res = -EHWPOISON;
                if (flags & MF_ACTION_REQUIRED)
                        res = kill_accessing_process(current, pfn, flags);
@@ -2420,7 +2458,7 @@ int memory_failure(unsigned long pfn, int flags)
                        } else {
                                /* We lost the race, try again */
                                if (retry) {
-                                       ClearPageHWPoison(p);
+                                       clear_hwpoison_drain_rcu(p);
                                        retry = false;
                                        goto try_again;
                                }
@@ -2441,7 +2479,7 @@ int memory_failure(unsigned long pfn, int flags)
        /* filter pages that are protected from hwpoison test by users */
        folio_lock(folio);
        if (hwpoison_filter(p)) {
-               ClearPageHWPoison(p);
+               clear_hwpoison_drain_rcu(p);
                folio_unlock(folio);
                folio_put(folio);
                res = -EOPNOTSUPP;
@@ -2761,7 +2799,7 @@ int unpoison_memory(unsigned long pfn)
                }
 
                folio_put(folio);
-               if (TestClearPageHWPoison(p)) {
+               if (test_and_clear_hwpoison_drain_rcu(p)) {
                        folio_put(folio);
                        ret = 0;
                }
diff --git a/mm/memremap.c b/mm/memremap.c
index 053842d45cb1..c3949fdca5aa 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -494,7 +494,7 @@ void zone_device_page_init(struct page *page, struct 
dev_pagemap *pgmap,
                 * blindly clear bits which could have set my order field here,
                 * including page head.
                 */
-               new_page->flags.f &= ~0xffUL;   /* Clear possible order, page 
head */
+               clear_page_flags_safe(new_page, 0xffUL);        /* Clear 
possible order, page head */
 
 #ifdef NR_PAGES_IN_LARGE_FOLIO
                /*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d49c254174da..1587acf431f4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1359,7 +1359,7 @@ __always_inline bool __free_pages_prepare(struct page 
*page,
                int i;
 
                if (compound) {
-                       page[1].flags.f &= ~PAGE_FLAGS_SECOND;
+                       clear_page_flags_safe(&page[1], PAGE_FLAGS_SECOND);
 #ifdef NR_PAGES_IN_LARGE_FOLIO
                        folio->_nr_pages = 0;
 #endif
@@ -1373,7 +1373,7 @@ __always_inline bool __free_pages_prepare(struct page 
*page,
                                        continue;
                                }
                        }
-                       (page + i)->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
+                       clear_page_flags_safe(page + i, 
PAGE_FLAGS_CHECK_AT_PREP);
                }
        }
        if (folio_test_anon(folio)) {
@@ -1392,7 +1392,7 @@ __always_inline bool __free_pages_prepare(struct page 
*page,
        }
 
        page_cpupid_reset_last(page);
-       page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
+       clear_page_flags_safe(page, PAGE_FLAGS_CHECK_AT_PREP);
        page->private = 0;
        reset_page_owner(page, order);
        page_table_check_free(page, order);
diff --git a/mm/slub.c b/mm/slub.c
index a2bf3756ca7d..2bfa7e3f8a84 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -617,7 +617,7 @@ static inline void slab_set_pfmemalloc(struct slab *slab)
 
 static inline void __slab_clear_pfmemalloc(struct slab *slab)
 {
-       __clear_bit(SL_pfmemalloc, &slab->flags.f);
+       hwpoison_safe(__clear_bit(SL_pfmemalloc, &slab->flags.f));
 }
 
 /*

Re: [PATCH splitout] mm: memory-failure: serialize TestSetPageHWPoison with zone->lock

Reply via email to