Re: [PATCH v3] mm: fix race by making init_zero_pfn() early_initcall

2021-03-29 Thread Zhou Yanjie

Hi Ilya,

On 2021/3/30 下午12:42, Ilya Lipnitskiy wrote:

There are code paths that rely on zero_pfn to be fully initialized
before core_initcall. For example, wq_sysfs_init() is a core_initcall
function that eventually results in a call to kernel_execve, which
causes a page fault with a subsequent mmput. If zero_pfn is not
initialized by then it may not get cleaned up properly and result in an
error:
   BUG: Bad rss-counter state mm:(ptrval) type:MM_ANONPAGES val:1

Here is an analysis of the race as seen on a MIPS device. On this
particular MT7621 device (Ubiquiti ER-X), zero_pfn is PFN 0 until
initialized, at which point it becomes PFN 5120:
   1. wq_sysfs_init calls into kobject_uevent_env at core_initcall:
[<80340dc8>] kobject_uevent_env+0x7e4/0x7ec
[<8033f8b8>] kset_register+0x68/0x88
[<803cf824>] bus_register+0xdc/0x34c
[<803cfac8>] subsys_virtual_register+0x34/0x78
[<8086afb0>] wq_sysfs_init+0x1c/0x4c
[<80001648>] do_one_initcall+0x50/0x1a8
[<8086503c>] kernel_init_freeable+0x230/0x2c8
[<8066bca0>] kernel_init+0x10/0x100
[<80003038>] ret_from_kernel_thread+0x14/0x1c

   2. kobject_uevent_env() calls call_usermodehelper_exec() which executes
  kernel_execve asynchronously.

   3. Memory allocations in kernel_execve cause a page fault, bumping the
  MM reference counter:
[<8015adb4>] add_mm_counter_fast+0xb4/0xc0
[<80160d58>] handle_mm_fault+0x6e4/0xea0
[<80158aa4>] __get_user_pages.part.78+0x190/0x37c
[<8015992c>] __get_user_pages_remote+0x128/0x360
[<801a6d9c>] get_arg_page+0x34/0xa0
[<801a7394>] copy_string_kernel+0x194/0x2a4
[<801a880c>] kernel_execve+0x11c/0x298
[<800420f4>] call_usermodehelper_exec_async+0x114/0x194

   4. In case zero_pfn has not been initialized yet, zap_pte_range does
  not decrement the MM_ANONPAGES RSS counter and the BUG message is
  triggered shortly afterwards when __mmdrop checks the ref counters:
[<800285e8>] __mmdrop+0x98/0x1d0
[<801a6de8>] free_bprm+0x44/0x118
[<801a86a8>] kernel_execve+0x160/0x1d8
[<800420f4>] call_usermodehelper_exec_async+0x114/0x194
[<80003198>] ret_from_kernel_thread+0x14/0x1c

To avoid races such as described above, initialize init_zero_pfn at
early_initcall level. Depending on the architecture, ZERO_PAGE is either
constant or gets initialized even earlier, at paging_init, so there is
no issue with initializing zero_pfn earlier.

Discussion: 
https://lkml.kernel.org/r/CALCv0x2YqOXEAy2Q=hafjhHCtTHVodChv1qpM=niaxopqeb...@mail.gmail.com

Signed-off-by: Ilya Lipnitskiy 
Cc: Hugh Dickins 
Cc: "Eric W. Biederman" 
Cc: sta...@vger.kernel.org
---
  mm/memory.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)



Tested-by: 周琰杰 (Zhou Yanjie) # on 
CU1000-Neo/X1000E and CU1830-Neo/X1830




diff --git a/mm/memory.c b/mm/memory.c
index 5c3b29d3af66..e66b11ac1659 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -166,7 +166,7 @@ static int __init init_zero_pfn(void)
zero_pfn = page_to_pfn(ZERO_PAGE(0));
return 0;
  }
-core_initcall(init_zero_pfn);
+early_initcall(init_zero_pfn);
  
  void mm_trace_rss_stat(struct mm_struct *mm, int member, long count)

  {


[PATCH v3] mm: fix race by making init_zero_pfn() early_initcall

2021-03-29 Thread Ilya Lipnitskiy
There are code paths that rely on zero_pfn to be fully initialized
before core_initcall. For example, wq_sysfs_init() is a core_initcall
function that eventually results in a call to kernel_execve, which
causes a page fault with a subsequent mmput. If zero_pfn is not
initialized by then it may not get cleaned up properly and result in an
error:
  BUG: Bad rss-counter state mm:(ptrval) type:MM_ANONPAGES val:1

Here is an analysis of the race as seen on a MIPS device. On this
particular MT7621 device (Ubiquiti ER-X), zero_pfn is PFN 0 until
initialized, at which point it becomes PFN 5120:
  1. wq_sysfs_init calls into kobject_uevent_env at core_initcall:
   [<80340dc8>] kobject_uevent_env+0x7e4/0x7ec
   [<8033f8b8>] kset_register+0x68/0x88
   [<803cf824>] bus_register+0xdc/0x34c
   [<803cfac8>] subsys_virtual_register+0x34/0x78
   [<8086afb0>] wq_sysfs_init+0x1c/0x4c
   [<80001648>] do_one_initcall+0x50/0x1a8
   [<8086503c>] kernel_init_freeable+0x230/0x2c8
   [<8066bca0>] kernel_init+0x10/0x100
   [<80003038>] ret_from_kernel_thread+0x14/0x1c

  2. kobject_uevent_env() calls call_usermodehelper_exec() which executes
 kernel_execve asynchronously.

  3. Memory allocations in kernel_execve cause a page fault, bumping the
 MM reference counter:
   [<8015adb4>] add_mm_counter_fast+0xb4/0xc0
   [<80160d58>] handle_mm_fault+0x6e4/0xea0
   [<80158aa4>] __get_user_pages.part.78+0x190/0x37c
   [<8015992c>] __get_user_pages_remote+0x128/0x360
   [<801a6d9c>] get_arg_page+0x34/0xa0
   [<801a7394>] copy_string_kernel+0x194/0x2a4
   [<801a880c>] kernel_execve+0x11c/0x298
   [<800420f4>] call_usermodehelper_exec_async+0x114/0x194

  4. In case zero_pfn has not been initialized yet, zap_pte_range does
 not decrement the MM_ANONPAGES RSS counter and the BUG message is
 triggered shortly afterwards when __mmdrop checks the ref counters:
   [<800285e8>] __mmdrop+0x98/0x1d0
   [<801a6de8>] free_bprm+0x44/0x118
   [<801a86a8>] kernel_execve+0x160/0x1d8
   [<800420f4>] call_usermodehelper_exec_async+0x114/0x194
   [<80003198>] ret_from_kernel_thread+0x14/0x1c

To avoid races such as described above, initialize init_zero_pfn at
early_initcall level. Depending on the architecture, ZERO_PAGE is either
constant or gets initialized even earlier, at paging_init, so there is
no issue with initializing zero_pfn earlier.

Discussion: 
https://lkml.kernel.org/r/CALCv0x2YqOXEAy2Q=hafjhHCtTHVodChv1qpM=niaxopqeb...@mail.gmail.com

Signed-off-by: Ilya Lipnitskiy 
Cc: Hugh Dickins 
Cc: "Eric W. Biederman" 
Cc: sta...@vger.kernel.org
---
 mm/memory.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index 5c3b29d3af66..e66b11ac1659 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -166,7 +166,7 @@ static int __init init_zero_pfn(void)
zero_pfn = page_to_pfn(ZERO_PAGE(0));
return 0;
 }
-core_initcall(init_zero_pfn);
+early_initcall(init_zero_pfn);
 
 void mm_trace_rss_stat(struct mm_struct *mm, int member, long count)
 {
-- 
2.31.0