On 09.04.25 22:08, Dan Williams wrote:
David Hildenbrand wrote:
[..]
Maybe there is something missing in ZONE_DEVICE freeing/splitting code
of large folios, where we should do the same, to make sure that all
page->memcg_data is actually 0?
I assume so. Let me dig.
I suspect this should do the trick:
diff --git a/fs/dax.c b/fs/dax.c
index af5045b0f476e..8dffffef70d21 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -397,6 +397,10 @@ static inline unsigned long dax_folio_put(struct folio
*folio)
if (!order)
return 0;
+#ifdef NR_PAGES_IN_LARGE_FOLIO
+ folio->_nr_pages = 0;
+#endif
I assume this new fs/dax.c instance of this pattern motivates a
folio_set_nr_pages() helper to hide the ifdef?
Hm, not sure. We do have folio_set_order() but we WARN on order=0" for
good reasons. ... and having folio_set_nr_pages() that doesn't set the
order is also weird ...
In the THP case we handle it now by propagating the folio->memcg_data to
all new_folio->memcg_data.
Maybe we should simply allow setting order=0 for folio_set_order(),
adding a comment that it is for reset-before split.
Let me think about that.
While it is concerning that fs/dax.c misses common expectations like
this, but I think that is the nature of bypassing the page allocator to
get folios().
It was a bit unfortunate that Alistair's work and my work went into
mm-unstable and upstream shortly after each other.
However, raises the question if fixing it here is sufficient for other
ZONE_DEVICE folio cases. I did not immediately find a place where other
ZONE_DEVICE users might be calling prep_compound_page() and leaving
stale tail page metadata lying around. Alistair?
We only have to consider this when splitting folios (putting buddy
freeing aside). clear_compound_head() is what to search for.
We don't need it in mm/hugetlb.c because we'll only demote large folios
to smaller-large folios and effectively reset the order/nr_pages for all
involved folios.
Let me send an official patch tomorrow; maybe Alison can comment until
then if that fixes the issue.
diff --git a/fs/dax.c b/fs/dax.c
index af5045b0f476e..a1e354b748522 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -412,6 +412,9 @@ static inline unsigned long dax_folio_put(struct folio
*folio)
*/
new_folio->pgmap = pgmap;
new_folio->share = 0;
+#ifdef CONFIG_MEMCG
+ new_folio->memcg_data = 0;
+#endif
This looks correct, but I like the first option because I would never
expect a dax-page to need to worry about being part of a memcg.
Right.
--
Cheers,
David / dhildenb