This check ensures the correctness of collapse read-only THPs for FSes
after READ_ONLY_THP_FOR_FS is enabled by default for all FSes supporting
PMD THP pagecache.

READ_ONLY_THP_FOR_FS only supports read-only fd and uses mapping->nr_thps
and inode->i_writecount to prevent any write to read-only to-be-collapsed
folios. In upcoming commits, READ_ONLY_THP_FOR_FS will be removed and the
aforementioned mechanism will go away too. To ensure khugepaged functions
as expected after the changes, skip if any folio is dirty after
try_to_unmap(), since a dirty folio means this read-only folio
got some writes via mmap can happen between try_to_unmap() and
try_to_unmap_flush() via cached TLB entries and khugepaged does not support
writable pagecache folio collapse yet.

Signed-off-by: Zi Yan <[email protected]>
---
 mm/khugepaged.c | 25 +++++++++++++++++++++----
 1 file changed, 21 insertions(+), 4 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 3eb5d982d3d3..1c0fdc81d276 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1979,8 +1979,7 @@ static enum scan_result collapse_file(struct mm_struct 
*mm, unsigned long addr,
                                }
                        } else if (folio_test_dirty(folio)) {
                                /*
-                                * khugepaged only works on read-only fd,
-                                * so this page is dirty because it hasn't
+                                * This page is dirty because it hasn't
                                 * been flushed since first write. There
                                 * won't be new dirty pages.
                                 *
@@ -2038,8 +2037,8 @@ static enum scan_result collapse_file(struct mm_struct 
*mm, unsigned long addr,
                if (!is_shmem && (folio_test_dirty(folio) ||
                                  folio_test_writeback(folio))) {
                        /*
-                        * khugepaged only works on read-only fd, so this
-                        * folio is dirty because it hasn't been flushed
+                        * khugepaged only works on clean file-backed folios,
+                        * so this folio is dirty because it hasn't been flushed
                         * since first write.
                         */
                        result = SCAN_PAGE_DIRTY_OR_WRITEBACK;
@@ -2083,6 +2082,24 @@ static enum scan_result collapse_file(struct mm_struct 
*mm, unsigned long addr,
                        goto out_unlock;
                }
 
+               /*
+                * At this point, the folio is locked, unmapped. Make sure the
+                * folio is clean, so that no one else is able to write to it,
+                * since that would require taking the folio lock first.
+                * Otherwise that means the folio was pointed by a dirty PTE and
+                * some CPU might have a valid TLB entry with dirty bit set
+                * still pointing to this folio and writes can happen without
+                * causing a page table walk and folio lock acquisition before
+                * the try_to_unmap_flush() below is done. After the collapse,
+                * file-backed folio is not set as dirty and can be discarded
+                * before any new write marks the folio dirty, causing data
+                * corruption.
+                */
+               if (!is_shmem && folio_test_dirty(folio)) {
+                       result = SCAN_PAGE_DIRTY_OR_WRITEBACK;
+                       goto out_unlock;
+               }
+
                /*
                 * Accumulate the folios that are being collapsed.
                 */
-- 
2.43.0


Reply via email to