Re: [RFC PATCH v3 1/1] x86/sgx: Explicitly give up the CPU in EDMM's ioctl() to avoid softlockup
On Thu May 16, 2024 at 1:29 AM EEST, Haitao Huang wrote: > On Wed, 15 May 2024 16:55:59 -0500, Haitao Huang > wrote: > > > On Wed, 15 May 2024 01:55:21 -0500, Bojun Zhu > > wrote: > > > >> EDMM's ioctl()s support batch operations, which may be > >> time-consuming. Try to explicitly give up the CPU as the prefix > >> operation at the every begin of "for loop" in > >> sgx_enclave_{ modify_types | restrict_permissions | remove_pages} > >> to give other tasks a chance to run, and avoid softlockup warning. > >> > >> Additionally perform pending signals check as the prefix operation, > >> and introduce sgx_check_signal_and_resched(), > >> which wraps all the checks. > >> > >> The following has been observed on Linux v6.9-rc5 with kernel > >> preemptions disabled(by configuring "PREEMPT_NONE=y"), when kernel > >> is requested to restrict page permissions of a large number of EPC > >> pages. > >> > >> [ cut here ] > >> watchdog: BUG: soft lockup - CPU#45 stuck for 22s! > >> ... > >> RIP: 0010:sgx_enclave_restrict_permissions+0xba/0x1f0 > >> ... > >> Call Trace: > >> sgx_ioctl > >> __x64_sys_ioctl > >> x64_sys_call > >> do_syscall_64 > >> entry_SYSCALL_64_after_hwframe > >> [ end trace ] > >> > >> Signed-off-by: Bojun Zhu > >> --- > >> arch/x86/kernel/cpu/sgx/ioctl.c | 40 +++-- > >> 1 file changed, 28 insertions(+), 12 deletions(-) > >> > >> diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c > >> b/arch/x86/kernel/cpu/sgx/ioctl.c > >> index b65ab214bdf5..6199f483143e 100644 > >> --- a/arch/x86/kernel/cpu/sgx/ioctl.c > >> +++ b/arch/x86/kernel/cpu/sgx/ioctl.c > >> @@ -365,6 +365,20 @@ static int sgx_validate_offset_length(struct > >> sgx_encl *encl, > >>return 0; > >> } > >> +/* > >> + * Check signals and invoke scheduler. Return true for a pending > >> signal. > >> + */ > >> +static bool sgx_check_signal_and_resched(void) > >> +{ > >> + if (signal_pending(current)) > >> + return true; > >> + > >> + if (need_resched()) > >> + cond_resched(); > >> + > >> + return false; > >> +} > >> + > >> /** > >> * sgx_ioc_enclave_add_pages() - The handler for > >> %SGX_IOC_ENCLAVE_ADD_PAGES > >> * @encl: an enclave pointer > >> @@ -409,7 +423,7 @@ static long sgx_ioc_enclave_add_pages(struct > >> sgx_encl *encl, void __user *arg) > >>struct sgx_enclave_add_pages add_arg; > >>struct sgx_secinfo secinfo; > >>unsigned long c; > >> - int ret; > >> + int ret = -ERESTARTSYS; > >>if (!test_bit(SGX_ENCL_CREATED, >flags) || > >>test_bit(SGX_ENCL_INITIALIZED, >flags)) > >> @@ -432,15 +446,8 @@ static long sgx_ioc_enclave_add_pages(struct > >> sgx_encl *encl, void __user *arg) > >>return -EINVAL; > >>for (c = 0 ; c < add_arg.length; c += PAGE_SIZE) { > >> - if (signal_pending(current)) { > >> - if (!c) > >> - ret = -ERESTARTSYS; > >> - > >> + if (sgx_check_signal_and_resched()) > >>break; > >> - } > > > > ERESTARTSYS is only appropriate if we have not EADDed any pages yet. > > If we got interrupted in the middle, we should return 0. User space > > would check the 'count' returned and decide to recall this ioctl() with > > 'offset' reset to the next page, and adjust length. > > NVM, I misread it. ret will be changed to zero in subsequent iteration. > > Reviewed-by: Haitao Huang Duh, and I responded too quickly. OK, I revisited the original patch and yes ret gets reseted. Ignore my previous response ;-) My tags still hold, sorry. BR, Jarkko
Re: [RFC PATCH v3 1/1] x86/sgx: Explicitly give up the CPU in EDMM's ioctl() to avoid softlockup
On Thu May 16, 2024 at 12:55 AM EEST, Haitao Huang wrote: > On Wed, 15 May 2024 01:55:21 -0500, Bojun Zhu > wrote: > > > EDMM's ioctl()s support batch operations, which may be > > time-consuming. Try to explicitly give up the CPU as the prefix > > operation at the every begin of "for loop" in > > sgx_enclave_{ modify_types | restrict_permissions | remove_pages} > > to give other tasks a chance to run, and avoid softlockup warning. > > > > Additionally perform pending signals check as the prefix operation, > > and introduce sgx_check_signal_and_resched(), > > which wraps all the checks. > > > > The following has been observed on Linux v6.9-rc5 with kernel > > preemptions disabled(by configuring "PREEMPT_NONE=y"), when kernel > > is requested to restrict page permissions of a large number of EPC pages. > > > > [ cut here ] > > watchdog: BUG: soft lockup - CPU#45 stuck for 22s! > > ... > > RIP: 0010:sgx_enclave_restrict_permissions+0xba/0x1f0 > > ... > > Call Trace: > > sgx_ioctl > > __x64_sys_ioctl > > x64_sys_call > > do_syscall_64 > > entry_SYSCALL_64_after_hwframe > > [ end trace ] > > > > Signed-off-by: Bojun Zhu > > --- > > arch/x86/kernel/cpu/sgx/ioctl.c | 40 +++-- > > 1 file changed, 28 insertions(+), 12 deletions(-) > > > > diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c > > b/arch/x86/kernel/cpu/sgx/ioctl.c > > index b65ab214bdf5..6199f483143e 100644 > > --- a/arch/x86/kernel/cpu/sgx/ioctl.c > > +++ b/arch/x86/kernel/cpu/sgx/ioctl.c > > @@ -365,6 +365,20 @@ static int sgx_validate_offset_length(struct > > sgx_encl *encl, > > return 0; > > } > > +/* > > + * Check signals and invoke scheduler. Return true for a pending signal. > > + */ > > +static bool sgx_check_signal_and_resched(void) > > +{ > > + if (signal_pending(current)) > > + return true; > > + > > + if (need_resched()) > > + cond_resched(); > > + > > + return false; > > +} > > + > > /** > > * sgx_ioc_enclave_add_pages() - The handler for > > %SGX_IOC_ENCLAVE_ADD_PAGES > > * @encl: an enclave pointer > > @@ -409,7 +423,7 @@ static long sgx_ioc_enclave_add_pages(struct > > sgx_encl *encl, void __user *arg) > > struct sgx_enclave_add_pages add_arg; > > struct sgx_secinfo secinfo; > > unsigned long c; > > - int ret; > > + int ret = -ERESTARTSYS; > > if (!test_bit(SGX_ENCL_CREATED, >flags) || > > test_bit(SGX_ENCL_INITIALIZED, >flags)) > > @@ -432,15 +446,8 @@ static long sgx_ioc_enclave_add_pages(struct > > sgx_encl *encl, void __user *arg) > > return -EINVAL; > > for (c = 0 ; c < add_arg.length; c += PAGE_SIZE) { > > - if (signal_pending(current)) { > > - if (!c) > > - ret = -ERESTARTSYS; > > - > > + if (sgx_check_signal_and_resched()) > > break; > > - } > > ERESTARTSYS is only appropriate if we have not EADDed any pages yet. > If we got interrupted in the middle, we should return 0. User space would > check the 'count' returned and decide to recall this ioctl() with > 'offset' reset to the next page, and adjust length. Good catch! Thanks Haitao for the remark. BR, Jarkko
Re: [PATCH v2 1/2] x86/sgx: Resolve EAUG race where losing thread returns SIGBUS
> Thank you very much. I understand the changelog is still being discussed > and those changes look good to me, to which you can add: > > Reviewed-by: Reinette Chatre also for this (with changelog tweak Dave suggested) so that we don't need a new round: Reviewed-by: Jarkko Sakkinen BR, Jarkko
Re: [PATCH v2 2/2] x86/sgx: Resolve EREMOVE page vs EAUG page data race
644 > --- a/arch/x86/kernel/cpu/sgx/encl.c > +++ b/arch/x86/kernel/cpu/sgx/encl.c > @@ -257,7 +257,8 @@ static struct sgx_encl_page *__sgx_encl_load_page(struct > sgx_encl *encl, > > /* Entry successfully located. */ > if (entry->epc_page) { > - if (entry->desc & SGX_ENCL_PAGE_BEING_RECLAIMED) > + if (entry->desc & (SGX_ENCL_PAGE_BEING_RECLAIMED | > +SGX_ENCL_PAGE_BEING_REMOVED)) > return ERR_PTR(-EBUSY); > > return entry; > diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h > index f94ff14c9486..fff5f2293ae7 100644 > --- a/arch/x86/kernel/cpu/sgx/encl.h > +++ b/arch/x86/kernel/cpu/sgx/encl.h > @@ -25,6 +25,9 @@ > /* 'desc' bit marking that the page is being reclaimed. */ > #define SGX_ENCL_PAGE_BEING_RECLAIMEDBIT(3) > > +/* 'desc' bit marking that the page is being removed. */ > +#define SGX_ENCL_PAGE_BEING_REMOVED BIT(2) > + > struct sgx_encl_page { > unsigned long desc; > unsigned long vm_max_prot_bits:8; > diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c > index b65ab214bdf5..c542d4dd3e64 100644 > --- a/arch/x86/kernel/cpu/sgx/ioctl.c > +++ b/arch/x86/kernel/cpu/sgx/ioctl.c > @@ -1142,6 +1142,7 @@ static long sgx_encl_remove_pages(struct sgx_encl *encl, >* Do not keep encl->lock because of dependency on >* mmap_lock acquired in sgx_zap_enclave_ptes(). >*/ > + entry->desc |= SGX_ENCL_PAGE_BEING_REMOVED; > mutex_unlock(>lock); > > sgx_zap_enclave_ptes(encl, addr); Makes perfect sense: Reviewed-by: Jarkko Sakkinen BR, Jarkko
Re: [PATCH v2 1/2] x86/sgx: Resolve EAUG race where losing thread returns SIGBUS
On Wed May 15, 2024 at 5:15 PM EEST, Dave Hansen wrote: > On 5/15/24 06:54, Jarkko Sakkinen wrote: > > I'd cut out 90% of the description out and just make the argument of > > the wrong error code, and done. The sequence is great for showing > > how this could happen. The prose makes my head hurt tbh. > > The changelog is too long, but not fatally so. I'd much rather have a > super verbose description than something super sparse. > > Would something like this make more sense to folks? > > Imagine an mmap()'d file. Two threads touch the same address at > the same time and fault. Both allocate a physical page and race > to install a PTE for that page. Only one will win the race. The > loser frees its page, but still continues handling the fault as > a success and returns VM_FAULT_NOPAGE from the fault handler. > > The same race can happen with SGX. But there's a bug: the loser > in the SGX steers into a failure path. The loser EREMOVE's the > winner's EPC page, then returns SIGBUS, likely killing the app. > > Fix the SGX loser's behavior. Change the return code to > VM_FAULT_NOPAGE to avoid SIGBUS and call sgx_free_epc_page() > which avoids EREMOVE'ing the winner's page and only frees the > page that the loser allocated. Yes! I did read the whole thing. My comment was only related to the chain of maintainers who also have to deal with this patch eventually. BR, Jarkko
Re: [PATCH v2 1/2] x86/sgx: Resolve EAUG race where losing thread returns SIGBUS
On Wed May 15, 2024 at 4:54 PM EEST, Jarkko Sakkinen wrote: > On Wed May 15, 2024 at 4:12 PM EEST, Dmitrii Kuvaiskii wrote: > > diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c > > index 279148e72459..41f14b1a3025 100644 > > --- a/arch/x86/kernel/cpu/sgx/encl.c > > +++ b/arch/x86/kernel/cpu/sgx/encl.c > > @@ -382,8 +382,11 @@ static vm_fault_t sgx_encl_eaug_page(struct > > vm_area_struct *vma, > > * If ret == -EBUSY then page was created in another flow while > > * running without encl->lock > > */ > > - if (ret) > > + if (ret) { > > + if (ret == -EBUSY) > > + vmret = VM_FAULT_NOPAGE; > > goto err_out_shrink; > > + } > > I agree that there is a bug but it does not categorize as race > condition. > > The bug is simply that for a valid page SIGBUS might be returned. > The fix is correct but the claim is not. > > > > > pginfo.secs = (unsigned long)sgx_get_epc_virt_addr(encl->secs.epc_page); > > pginfo.addr = encl_page->desc & PAGE_MASK; > > @@ -419,7 +422,7 @@ static vm_fault_t sgx_encl_eaug_page(struct > > vm_area_struct *vma, > > err_out_shrink: > > sgx_encl_shrink(encl, va_page); > > err_out_epc: > > - sgx_encl_free_epc_page(epc_page); > > + sgx_free_epc_page(epc_page); > > err_out_unlock: > > mutex_unlock(>lock); > > kfree(encl_page); > > Agree with code change 100% but not with the description. > > I'd cut out 90% of the description out and just make the argument of > the wrong error code, and done. The sequence is great for showing > how this could happen. The prose makes my head hurt tbh. Also please remember that stable maintainers need to read all of that if this is a bug fix (it is a bug fix!) :-) So shorted possible legit argument, no prose and the sequence was awesome :-) BR, Jarkko
Re: [PATCH v2 1/2] x86/sgx: Resolve EAUG race where losing thread returns SIGBUS
On Wed May 15, 2024 at 4:12 PM EEST, Dmitrii Kuvaiskii wrote: > diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c > index 279148e72459..41f14b1a3025 100644 > --- a/arch/x86/kernel/cpu/sgx/encl.c > +++ b/arch/x86/kernel/cpu/sgx/encl.c > @@ -382,8 +382,11 @@ static vm_fault_t sgx_encl_eaug_page(struct > vm_area_struct *vma, >* If ret == -EBUSY then page was created in another flow while >* running without encl->lock >*/ > - if (ret) > + if (ret) { > + if (ret == -EBUSY) > + vmret = VM_FAULT_NOPAGE; > goto err_out_shrink; > + } I agree that there is a bug but it does not categorize as race condition. The bug is simply that for a valid page SIGBUS might be returned. The fix is correct but the claim is not. > > pginfo.secs = (unsigned long)sgx_get_epc_virt_addr(encl->secs.epc_page); > pginfo.addr = encl_page->desc & PAGE_MASK; > @@ -419,7 +422,7 @@ static vm_fault_t sgx_encl_eaug_page(struct > vm_area_struct *vma, > err_out_shrink: > sgx_encl_shrink(encl, va_page); > err_out_epc: > - sgx_encl_free_epc_page(epc_page); > + sgx_free_epc_page(epc_page); > err_out_unlock: > mutex_unlock(>lock); > kfree(encl_page); Agree with code change 100% but not with the description. I'd cut out 90% of the description out and just make the argument of the wrong error code, and done. The sequence is great for showing how this could happen. The prose makes my head hurt tbh. BR, Jarkko
Re: [RFC PATCH v3 1/1] x86/sgx: Explicitly give up the CPU in EDMM's ioctl() to avoid softlockup
On Wed May 15, 2024 at 9:55 AM EEST, Bojun Zhu wrote: > EDMM's ioctl()s support batch operations, which may be > time-consuming. Try to explicitly give up the CPU as the prefix > operation at the every begin of "for loop" in > sgx_enclave_{ modify_types | restrict_permissions | remove_pages} > to give other tasks a chance to run, and avoid softlockup warning. > > Additionally perform pending signals check as the prefix operation, > and introduce sgx_check_signal_and_resched(), > which wraps all the checks. > > The following has been observed on Linux v6.9-rc5 with kernel > preemptions disabled(by configuring "PREEMPT_NONE=y"), when kernel > is requested to restrict page permissions of a large number of EPC pages. > > [ cut here ] > watchdog: BUG: soft lockup - CPU#45 stuck for 22s! > ... > RIP: 0010:sgx_enclave_restrict_permissions+0xba/0x1f0 > ... > Call Trace: > sgx_ioctl > __x64_sys_ioctl > x64_sys_call > do_syscall_64 > entry_SYSCALL_64_after_hwframe > [ end trace ] > Suggested-by: Jarkko Sakkinen > Signed-off-by: Bojun Zhu > --- > arch/x86/kernel/cpu/sgx/ioctl.c | 40 +++-- > 1 file changed, 28 insertions(+), 12 deletions(-) > > diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c > index b65ab214bdf5..6199f483143e 100644 > --- a/arch/x86/kernel/cpu/sgx/ioctl.c > +++ b/arch/x86/kernel/cpu/sgx/ioctl.c > @@ -365,6 +365,20 @@ static int sgx_validate_offset_length(struct sgx_encl > *encl, > return 0; > } > > +/* > + * Check signals and invoke scheduler. Return true for a pending signal. > + */ > +static bool sgx_check_signal_and_resched(void) > +{ > + if (signal_pending(current)) > + return true; > + > + if (need_resched()) > + cond_resched(); > + > + return false; > +} > + > /** > * sgx_ioc_enclave_add_pages() - The handler for %SGX_IOC_ENCLAVE_ADD_PAGES > * @encl: an enclave pointer > @@ -409,7 +423,7 @@ static long sgx_ioc_enclave_add_pages(struct sgx_encl > *encl, void __user *arg) > struct sgx_enclave_add_pages add_arg; > struct sgx_secinfo secinfo; > unsigned long c; > - int ret; > + int ret = -ERESTARTSYS; > > if (!test_bit(SGX_ENCL_CREATED, >flags) || > test_bit(SGX_ENCL_INITIALIZED, >flags)) > @@ -432,15 +446,8 @@ static long sgx_ioc_enclave_add_pages(struct sgx_encl > *encl, void __user *arg) > return -EINVAL; > > for (c = 0 ; c < add_arg.length; c += PAGE_SIZE) { > - if (signal_pending(current)) { > - if (!c) > - ret = -ERESTARTSYS; > - > + if (sgx_check_signal_and_resched()) > break; > - } > - > - if (need_resched()) > - cond_resched(); > > ret = sgx_encl_add_page(encl, add_arg.src + c, add_arg.offset + > c, > , add_arg.flags); > @@ -740,12 +747,15 @@ sgx_enclave_restrict_permissions(struct sgx_encl *encl, > unsigned long addr; > unsigned long c; > void *epc_virt; > - int ret; > + int ret = -ERESTARTSYS; > > memset(, 0, sizeof(secinfo)); > secinfo.flags = modp->permissions & SGX_SECINFO_PERMISSION_MASK; > > for (c = 0 ; c < modp->length; c += PAGE_SIZE) { > + if (sgx_check_signal_and_resched()) > + goto out; > + > addr = encl->base + modp->offset + c; > > sgx_reclaim_direct(); > @@ -898,7 +908,7 @@ static long sgx_enclave_modify_types(struct sgx_encl > *encl, > unsigned long addr; > unsigned long c; > void *epc_virt; > - int ret; > + int ret = -ERESTARTSYS; > > page_type = modt->page_type & SGX_PAGE_TYPE_MASK; > > @@ -913,6 +923,9 @@ static long sgx_enclave_modify_types(struct sgx_encl > *encl, > secinfo.flags = page_type << 8; > > for (c = 0 ; c < modt->length; c += PAGE_SIZE) { > + if (sgx_check_signal_and_resched()) > + goto out; > + > addr = encl->base + modt->offset + c; > > sgx_reclaim_direct(); > @@ -1095,12 +1108,15 @@ static long sgx_encl_remove_pages(struct sgx_encl > *encl, > unsigned long addr; > unsigned long c; > void *epc_virt; > - int ret; > + int ret = -ERESTARTSYS; > > memset(, 0, sizeof(secinfo)); > secinfo.flags = SGX_SECINFO_R | SGX_SECINFO_W | SGX_SECINFO_X; > > for (c = 0 ; c < params->length; c += PAGE_SIZE) { > + if (sgx_check_signal_and_resched()) > + goto out; > + > addr = encl->base + params->offset + c; > > sgx_reclaim_direct(); Reviewed-by: Jarkko Sakkinen BR, Jarkko
Re: [PATCH] mm: Remove mm argument from mm_get_unmapped_area()
+++ b/mm/mmap.c > @@ -1901,16 +1901,15 @@ __get_unmapped_area(struct file *file, unsigned long > addr, unsigned long len, > return error ? error : addr; > } > > -unsigned long > -mm_get_unmapped_area(struct mm_struct *mm, struct file *file, > - unsigned long addr, unsigned long len, > - unsigned long pgoff, unsigned long flags) > +unsigned long current_get_unmapped_area(struct file *file, unsigned long > addr, > + unsigned long len, unsigned long pgoff, > + unsigned long flags) > { > - if (test_bit(MMF_TOPDOWN, >flags)) > + if (test_bit(MMF_TOPDOWN, >mm->flags)) > return arch_get_unmapped_area_topdown(file, addr, len, pgoff, > flags); > return arch_get_unmapped_area(file, addr, len, pgoff, flags); > } > -EXPORT_SYMBOL(mm_get_unmapped_area); > +EXPORT_SYMBOL(current_get_unmapped_area); > > /** > * find_vma_intersection() - Look up the first VMA which intersects the > interval > diff --git a/mm/shmem.c b/mm/shmem.c > index f5d60436b604..c0acd7db93c8 100644 > --- a/mm/shmem.c > +++ b/mm/shmem.c > @@ -2276,8 +2276,7 @@ unsigned long shmem_get_unmapped_area(struct file *file, > if (len > TASK_SIZE) > return -ENOMEM; > > - addr = mm_get_unmapped_area(current->mm, file, uaddr, len, pgoff, > - flags); > + addr = current_get_unmapped_area(file, uaddr, len, pgoff, flags); > > if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) > return addr; > @@ -2334,8 +2333,8 @@ unsigned long shmem_get_unmapped_area(struct file *file, > if (inflated_len < len) > return addr; > > - inflated_addr = mm_get_unmapped_area(current->mm, NULL, uaddr, > - inflated_len, 0, flags); > + inflated_addr = current_get_unmapped_area(NULL, uaddr, > + inflated_len, 0, flags); > if (IS_ERR_VALUE(inflated_addr)) > return addr; > if (inflated_addr & ~PAGE_MASK) > @@ -4799,7 +4798,7 @@ unsigned long shmem_get_unmapped_area(struct file *file, > unsigned long addr, unsigned long len, > unsigned long pgoff, unsigned long flags) > { > - return mm_get_unmapped_area(current->mm, file, addr, len, pgoff, flags); > + return current_get_unmapped_area(file, addr, len, pgoff, flags); > } > #endif > > > base-commit: 9221b2819b8a4196eecf5476d66201be60fbcf29 Reviewed-by: Jarkko Sakkinen BR, Jarkko
Re: [PATCH v13 14/14] selftests/sgx: Add scripts for EPC cgroup testing
On Tue Apr 30, 2024 at 10:51 PM EEST, Haitao Huang wrote: > With different cgroups, the script starts one or multiple concurrent SGX > selftests (test_sgx), each to run the unclobbered_vdso_oversubscribed > test case, which loads an enclave of EPC size equal to the EPC capacity > available on the platform. The script checks results against the > expectation set for each cgroup and reports success or failure. > > The script creates 3 different cgroups at the beginning with following > expectations: > > 1) small - intentionally small enough to fail the test loading an > enclave of size equal to the capacity. > 2) large - large enough to run up to 4 concurrent tests but fail some if > more than 4 concurrent tests are run. The script starts 4 expecting at > least one test to pass, and then starts 5 expecting at least one test > to fail. > 3) larger - limit is the same as the capacity, large enough to run lots of > concurrent tests. The script starts 8 of them and expects all pass. > Then it reruns the same test with one process randomly killed and > usage checked to be zero after all processes exit. > > The script also includes a test with low mem_cg limit and large sgx_epc > limit to verify that the RAM used for per-cgroup reclamation is charged > to a proper mem_cg. For this test, it turns off swapping before start, > and turns swapping back on afterwards. > > Add README to document how to run the tests. > > Signed-off-by: Haitao Huang Here's the transcript: make: Entering directory '/home/jarkko/linux-tpmdd/tools/testing/selftests/sgx' gcc -Wall -Werror -g -I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include -fPIC -c main.c -o /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/main.o gcc -Wall -Werror -g -I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include -fPIC -c load.c -o /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/load.o gcc -Wall -Werror -g -I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include -fPIC -c sigstruct.c -o /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/sigstruct.o gcc -Wall -Werror -g -I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include -fPIC -c call.S -o /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/call.o gcc -Wall -Werror -g -I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include -fPIC -c sign_key.S -o /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/sign_key.o gcc -Wall -Werror -g -I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include -fPIC -o /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/test_sgx /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/main.o /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/load.o /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/sigstruct.o /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/call.o /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/sign_key.o -z noexecstack -lcrypto gcc -Wall -Werror -static-pie -nostdlib -ffreestanding -fPIE -fno-stack-protector -mrdrnd -I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include test_encl.c test_encl_bootstrap.S -o /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/test_encl.elf -Wl,-T,test_encl.lds,--build-id=none /usr/lib64/gcc/x86_64-suse-linux/13/../../../../x86_64-suse-linux/bin/ld: warning: /tmp/ccToNCLw.o: missing .note.GNU-stack section implies executable stack /usr/lib64/gcc/x86_64-suse-linux/13/../../../../x86_64-suse-linux/bin/ld: NOTE: This behaviour is deprecated and will be removed in a future version of the linker TAP version 13 1..2 # timeout set to 120 # selftests: sgx: test_sgx # TAP version 13 # 1..16 # # Starting 16 tests from 1 test cases. # # RUN enclave.unclobbered_vdso ... # #OK enclave.unclobbered_vdso # ok 1 enclave.unclobbered_vdso # # RUN enclave.unclobbered_vdso_oversubscribed ... # #OK enclave.unclobbered_vdso_oversubscribed # ok 2 enclave.unclobbered_vdso_oversubscribed # # RUN enclave.unclobbered_vdso_oversubscribed_remove ... # # main.c:402:unclobbered_vdso_oversubscribed_remove:Creating an enclave with 98566144 bytes heap may take a while ... # # main.c:457:unclobbered_vdso_oversubscribed_remove:Changing type of 98566144 bytes to trimmed may take a while ... # # main.c:473:unclobbered_vdso_oversubscribed_remove:Entering enclave to run EACCEPT for each page of 98566144 bytes may take a while ... # # main.c:494:unclobbered_vdso_oversubscribed_remove:Removing 98566144 bytes from enclave may take a while ... # #OK enclave.unclobbered_vdso_oversubscribed_remove # ok 3 enclave.unclobbered_vdso_oversubscribed_remove # # RUN enclave.clobbered_vdso ... # #OK enclave.clobbered_vdso # ok 4 enclave.clobbered_vdso # # RUN enclave.clobbered_vdso_and_user_function ... # #OK enclave.clobbered_vdso_and_user_function # ok 5
Re: [PATCH v12 14/14] selftests/sgx: Add scripts for EPC cgroup testing
On Mon Apr 29, 2024 at 7:18 PM EEST, Haitao Huang wrote: > Hi Jarkko > > On Sun, 28 Apr 2024 17:03:17 -0500, Jarkko Sakkinen > wrote: > > > On Fri Apr 26, 2024 at 5:28 PM EEST, Dave Hansen wrote: > >> On 4/16/24 07:15, Jarkko Sakkinen wrote: > >> > On Tue Apr 16, 2024 at 8:42 AM EEST, Huang, Kai wrote: > >> > Yes, exactly. I'd take one week break and cycle the kselftest part > >> > internally a bit as I said my previous response. I'm sure that there > >> > is experise inside Intel how to implement it properly. I.e. take some > >> > time to find the right person, and wait as long as that person has a > >> > bit of bandwidth to go through the test and suggest modifications. > >> > >> Folks, I worry that this series is getting bogged down in the selftests. > >> Yes, selftests are important. But getting _some_ tests in the kernel > >> is substantially more important than getting perfect tests. > >> > >> I don't think Haitao needs to "cycle" this back inside Intel. > > > > The problem with the tests was that they are hard to run anything else > > than Ubuntu (and perhaps Debian). It is hopefully now taken care of. > > Selftests do not have to be perfect but at minimum they need to be > > runnable. > > > > I need ret-test the latest series because it is possible that I did not > > have right flags (I was travelling few days thus have not done it yet). > > > > BR, Jarkko > > > > Let me know if you want me to send v13 before testing or you can just use > the sgx_cg_upstream_v12_plus branch in my repo. > > Also thanks for the "Reviewed-by" tags for other patches. But I've not got > "Reviewed-by" from you for patches #8-12 (not sure I missed). Could you go > through those alsoe when you get chance? So, I compiled v12 branch. Was the only difference in selftests? I can just copy them to the device. BR, Jarkko
Re: [PATCH 2/2] x86/sgx: Resolve EREMOVE page vs EAUG page data race
On Mon Apr 29, 2024 at 1:43 PM EEST, Dmitrii Kuvaiskii wrote: > Two enclave threads may try to add and remove the same enclave page > simultaneously (e.g., if the SGX runtime supports both lazy allocation > and `MADV_DONTNEED` semantics). Consider this race: > > 1. T1 performs page removal in sgx_encl_remove_pages() and stops right >after removing the page table entry and right before re-acquiring the >enclave lock to EREMOVE and xa_erase(>page_array) the page. > 2. T2 tries to access the page, and #PF[not_present] is raised. The >condition to EAUG in sgx_vma_fault() is not satisfied because the >page is still present in encl->page_array, thus the SGX driver >assumes that the fault happened because the page was swapped out. The >driver continues on a code path that installs a page table entry >*without* performing EAUG. > 3. The enclave page metadata is in inconsistent state: the PTE is >installed but there was no EAUG. Thus, T2 in userspace infinitely >receives SIGSEGV on this page (and EACCEPT always fails). > > Fix this by making sure that T1 (the page-removing thread) always wins > this data race. In particular, the page-being-removed is marked as such, > and T2 retries until the page is fully removed. > > Fixes: 9849bb27152c ("x86/sgx: Support complete page removal") > Cc: sta...@vger.kernel.org > Signed-off-by: Dmitrii Kuvaiskii > --- > arch/x86/kernel/cpu/sgx/encl.c | 3 ++- > arch/x86/kernel/cpu/sgx/encl.h | 3 +++ > arch/x86/kernel/cpu/sgx/ioctl.c | 1 + > 3 files changed, 6 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c > index 41f14b1a3025..7ccd8b2fce5f 100644 > --- a/arch/x86/kernel/cpu/sgx/encl.c > +++ b/arch/x86/kernel/cpu/sgx/encl.c > @@ -257,7 +257,8 @@ static struct sgx_encl_page *__sgx_encl_load_page(struct > sgx_encl *encl, > > /* Entry successfully located. */ > if (entry->epc_page) { > - if (entry->desc & SGX_ENCL_PAGE_BEING_RECLAIMED) > + if (entry->desc & (SGX_ENCL_PAGE_BEING_RECLAIMED | > +SGX_ENCL_PAGE_BEING_REMOVED)) > return ERR_PTR(-EBUSY); > > return entry; > diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h > index f94ff14c9486..fff5f2293ae7 100644 > --- a/arch/x86/kernel/cpu/sgx/encl.h > +++ b/arch/x86/kernel/cpu/sgx/encl.h > @@ -25,6 +25,9 @@ > /* 'desc' bit marking that the page is being reclaimed. */ > #define SGX_ENCL_PAGE_BEING_RECLAIMEDBIT(3) > > +/* 'desc' bit marking that the page is being removed. */ > +#define SGX_ENCL_PAGE_BEING_REMOVED BIT(2) > + > struct sgx_encl_page { > unsigned long desc; > unsigned long vm_max_prot_bits:8; > diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c > index b65ab214bdf5..c542d4dd3e64 100644 > --- a/arch/x86/kernel/cpu/sgx/ioctl.c > +++ b/arch/x86/kernel/cpu/sgx/ioctl.c > @@ -1142,6 +1142,7 @@ static long sgx_encl_remove_pages(struct sgx_encl *encl, >* Do not keep encl->lock because of dependency on >* mmap_lock acquired in sgx_zap_enclave_ptes(). >*/ > + entry->desc |= SGX_ENCL_PAGE_BEING_REMOVED; > mutex_unlock(>lock); > > sgx_zap_enclave_ptes(encl, addr); It is somewhat trivial to NAK this as the commit message does not do any effort describing the new flag. By default at least I have strong opposition against any new flags related to reclaiming even if it needs a bit of extra synchronization work in the user space. One way to describe concurrency scenarios would be to take example from https://www.kernel.org/doc/Documentation/memory-barriers.txt I.e. see the examples with CPU 1 and CPU 2. BR, Jarkko
Re: [PATCH 1/2] x86/sgx: Resolve EAUG race where losing thread returns SIGBUS
On Mon Apr 29, 2024 at 4:22 PM EEST, Jarkko Sakkinen wrote: > On Mon Apr 29, 2024 at 4:04 PM EEST, Jarkko Sakkinen wrote: > > > Fix these two bugs (1) by returning VM_FAULT_NOPAGE to the generic Linux > > > fault handler so that no signal is sent to userspace, and (2) by > > > replacing sgx_encl_free_epc_page() with sgx_free_epc_page() so that no > > > EREMOVE is performed. > > > > What is the collateral damage caused by ENCLS[EREMOVE]? > > Have you measured cost of eremove on an empty page? > > I tried to lookup for a thread from lore because I have a faint memory > that it was concluded that its cost irrelevant. Please correct if I'm > wrong. Also pseudocode for EREMOVE supports this as it just returns without actually doing anything. BR, Jarkko
Re: [PATCH 1/2] x86/sgx: Resolve EAUG race where losing thread returns SIGBUS
On Mon Apr 29, 2024 at 4:04 PM EEST, Jarkko Sakkinen wrote: > > Fix these two bugs (1) by returning VM_FAULT_NOPAGE to the generic Linux > > fault handler so that no signal is sent to userspace, and (2) by > > replacing sgx_encl_free_epc_page() with sgx_free_epc_page() so that no > > EREMOVE is performed. > > What is the collateral damage caused by ENCLS[EREMOVE]? Have you measured cost of eremove on an empty page? I tried to lookup for a thread from lore because I have a faint memory that it was concluded that its cost irrelevant. Please correct if I'm wrong. BR, Jarkko
Re: [PATCH 0/2] x86/sgx: Fix two data races in EAUG/EREMOVE flows
On Mon Apr 29, 2024 at 1:43 PM EEST, Dmitrii Kuvaiskii wrote: > SGX runtimes such as Gramine may implement EDMM-based lazy allocation of > enclave pages and may support MADV_DONTNEED semantics [1]. The former > implies #PF-based page allocation, and the latter implies the usage of > SGX_IOC_ENCLAVE_REMOVE_PAGES ioctl. > > A trivial program like below (run under Gramine and with EDMM enabled) > stresses these two flows in the SGX driver and hangs: > > /* repeatedly touch different enclave pages at random and mix with > * `madvise(MADV_DONTNEED)` to stress EAUG/EREMOVE flows */ > static void* thread_func(void* arg) { > size_t num_pages = 0xA000 / page_size; > for (int i = 0; i < 5000; i++) { > size_t page = get_random_ulong() % num_pages; > char data = READ_ONCE(((char*)arg)[page * page_size]); > > page = get_random_ulong() % num_pages; > madvise(arg + page * page_size, page_size, MADV_DONTNEED); > } > } > > addr = mmap(NULL, 0xA000, PROT_READ | PROT_WRITE, MAP_ANONYMOUS, -1, 0); > pthread_t threads[16]; > for (int i = 0; i < 16; i++) > pthread_create([i], NULL, thread_func, addr); I'm not convinced that kernel is the problem here but it could be also how Gramine is implemented. So maybe you could make a better case of that. The example looks a bit artificial to me. > > This program uncovers two data races in the SGX driver. The remaining > patches describe and fix these races. > > I performed several stress tests to verify that there are no other data > races (at least with the test program above): > > - On Icelake server with 128GB of PRMRR (EPC), without madvise(). This > stresses the first data race. A Gramine SGX test suite running in the > background for additional stressing. Result: 1,000 runs without hangs > (result without the first bug fix: hangs every time). > - On Icelake server with 128GB of PRMRR (EPC), with madvise(). This > stresses the second data race. A Gramine SGX test suite running in the > background for additional stressing. Result: 1,000 runs without hangs > (result with the first bug fix but without the second bug fix: hangs > approx. once in 50 runs). > - On Icelake server with 4GB of PRMRR (EPC), with madvise(). This > additionally stresses the enclave page swapping flows. Two Gramine SGX > test suites running in the background for additional stressing of > swapping (I observe 100% CPU utilization from ksgxd which confirms that > swapping happens). Result: 1,000 runs without hangs. > > (Sorry for the previous copy of this email, accidentally sent to > sta...@vger.kernel.org. Failed to use `--suppress-cc` during a test send.) > > Dmitrii Kuvaiskii (2): > x86/sgx: Resolve EAUG race where losing thread returns SIGBUS > x86/sgx: Resolve EREMOVE page vs EAUG page data race > > arch/x86/kernel/cpu/sgx/encl.c | 10 +++--- > arch/x86/kernel/cpu/sgx/encl.h | 3 +++ > arch/x86/kernel/cpu/sgx/ioctl.c | 1 + > 3 files changed, 11 insertions(+), 3 deletions(-) BR, Jarkko
Re: [PATCH 1/2] x86/sgx: Resolve EAUG race where losing thread returns SIGBUS
On Mon Apr 29, 2024 at 1:43 PM EEST, Dmitrii Kuvaiskii wrote: > Two enclave threads may try to access the same non-present enclave page > simultaneously (e.g., if the SGX runtime supports lazy allocation). The > threads will end up in sgx_encl_eaug_page(), racing to acquire the > enclave lock. The winning thread will perform EAUG, set up the page > table entry, and insert the page into encl->page_array. The losing > thread will then get -EBUSY on xa_insert(>page_array) and proceed > to error handling path. And that path removes page. Not sure I got gist of this tbh. > This error handling path contains two bugs: (1) SIGBUS is sent to > userspace even though the enclave page is correctly installed by another > thread, and (2) sgx_encl_free_epc_page() is called that performs EREMOVE > even though the enclave page was never intended to be removed. The first > bug is less severe because it impacts only the user space; the second > bug is more severe because it also impacts the OS state by ripping the > page (added by the winning thread) from the enclave. > > Fix these two bugs (1) by returning VM_FAULT_NOPAGE to the generic Linux > fault handler so that no signal is sent to userspace, and (2) by > replacing sgx_encl_free_epc_page() with sgx_free_epc_page() so that no > EREMOVE is performed. What is the collateral damage caused by ENCLS[EREMOVE]? > > Fixes: 5a90d2c3f5ef ("x86/sgx: Support adding of pages to an initialized > enclave") > Cc: sta...@vger.kernel.org > Reported-by: Marcelina KoĆcielnicka > Suggested-by: Reinette Chatre > Signed-off-by: Dmitrii Kuvaiskii > --- > arch/x86/kernel/cpu/sgx/encl.c | 7 +-- > 1 file changed, 5 insertions(+), 2 deletions(-) > > diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c > index 279148e72459..41f14b1a3025 100644 > --- a/arch/x86/kernel/cpu/sgx/encl.c > +++ b/arch/x86/kernel/cpu/sgx/encl.c > @@ -382,8 +382,11 @@ static vm_fault_t sgx_encl_eaug_page(struct > vm_area_struct *vma, >* If ret == -EBUSY then page was created in another flow while >* running without encl->lock >*/ > - if (ret) > + if (ret) { > + if (ret == -EBUSY) > + vmret = VM_FAULT_NOPAGE; > goto err_out_shrink; > + } > > pginfo.secs = (unsigned long)sgx_get_epc_virt_addr(encl->secs.epc_page); > pginfo.addr = encl_page->desc & PAGE_MASK; > @@ -419,7 +422,7 @@ static vm_fault_t sgx_encl_eaug_page(struct > vm_area_struct *vma, > err_out_shrink: > sgx_encl_shrink(encl, va_page); > err_out_epc: > - sgx_encl_free_epc_page(epc_page); > + sgx_free_epc_page(epc_page); This ignores check for the page being reclaimer tracked, i.e. it does changes that have been ignored in the commit message. > err_out_unlock: > mutex_unlock(>lock); > kfree(encl_page); BR, Jarkko
Re: [RFC PATCH v2 1/1] x86/sgx: Explicitly give up the CPU in EDMM's ioctl() to avoid softlockup
On Fri Apr 26, 2024 at 5:18 PM EEST, Bojun Zhu wrote: > EDMM's ioctl()s support batch operations, which may be > time-consuming. Try to explicitly give up the CPU as the prefix > operation at the every begin of "for loop" in > sgx_enclave_{ modify_types | restrict_permissions | remove_pages} > to give other tasks a chance to run, and avoid softlockup warning. > > Additionally perform pending signals check as the prefix operation, > and introduce sgx_check_signal_and_resched(), > which wraps all the checks. > > The following has been observed on Linux v6.9-rc5 with kernel > preemptions disabled(by configuring "PREEMPT_NONE=y"), when kernel > is requested to restrict page permissions of a large number of EPC pages. > > [ cut here ] > watchdog: BUG: soft lockup - CPU#45 stuck for 22s! > ... > RIP: 0010:sgx_enclave_restrict_permissions+0xba/0x1f0 > ... > Call Trace: > sgx_ioctl > __x64_sys_ioctl > x64_sys_call > do_syscall_64 > entry_SYSCALL_64_after_hwframe > [ end trace ] > > Signed-off-by: Bojun Zhu > --- > arch/x86/kernel/cpu/sgx/ioctl.c | 40 + > 1 file changed, 36 insertions(+), 4 deletions(-) > > diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c > index b65ab214bdf5..e0645920bcb5 100644 > --- a/arch/x86/kernel/cpu/sgx/ioctl.c > +++ b/arch/x86/kernel/cpu/sgx/ioctl.c > @@ -365,6 +365,20 @@ static int sgx_validate_offset_length(struct sgx_encl > *encl, > return 0; > } > > +/* > + * Check signals and invoke scheduler. Return true for a pending signal. > + */ > +static bool sgx_check_signal_and_resched(void) > +{ > + if (signal_pending(current)) > + return true; > + > + if (need_resched()) > + cond_resched(); > + > + return false; > +} > + > /** > * sgx_ioc_enclave_add_pages() - The handler for %SGX_IOC_ENCLAVE_ADD_PAGES > * @encl: an enclave pointer > @@ -432,16 +446,13 @@ static long sgx_ioc_enclave_add_pages(struct sgx_encl > *encl, void __user *arg) > return -EINVAL; > > for (c = 0 ; c < add_arg.length; c += PAGE_SIZE) { > - if (signal_pending(current)) { > + if (sgx_check_signal_and_resched()) { > if (!c) > ret = -ERESTARTSYS; > > break; > } > > - if (need_resched()) > - cond_resched(); > - > ret = sgx_encl_add_page(encl, add_arg.src + c, add_arg.offset + > c, > , add_arg.flags); > if (ret) > @@ -746,6 +757,13 @@ sgx_enclave_restrict_permissions(struct sgx_encl *encl, > secinfo.flags = modp->permissions & SGX_SECINFO_PERMISSION_MASK; > > for (c = 0 ; c < modp->length; c += PAGE_SIZE) { > + if (sgx_check_signal_and_resched()) { > + if (!c) > + ret = -ERESTARTSYS; > + > + goto out; > + } > + > addr = encl->base + modp->offset + c; > > sgx_reclaim_direct(); > @@ -913,6 +931,13 @@ static long sgx_enclave_modify_types(struct sgx_encl > *encl, > secinfo.flags = page_type << 8; > > for (c = 0 ; c < modt->length; c += PAGE_SIZE) { > + if (sgx_check_signal_and_resched()) { > + if (!c) > + ret = -ERESTARTSYS; > + > + goto out; > + } > + > addr = encl->base + modt->offset + c; > > sgx_reclaim_direct(); > @@ -1101,6 +1126,13 @@ static long sgx_encl_remove_pages(struct sgx_encl > *encl, > secinfo.flags = SGX_SECINFO_R | SGX_SECINFO_W | SGX_SECINFO_X; > > for (c = 0 ; c < params->length; c += PAGE_SIZE) { > + if (sgx_check_signal_and_resched()) { > + if (!c) > + ret = -ERESTARTSYS; > + > + goto out; > + } > + > addr = encl->base + params->offset + c; > > sgx_reclaim_direct(); I think Dave's suggestions make sense, so unfortunately needs yet another spin. BR, Jarkko
Re: [PATCH v12 14/14] selftests/sgx: Add scripts for EPC cgroup testing
On Fri Apr 26, 2024 at 5:28 PM EEST, Dave Hansen wrote: > On 4/16/24 07:15, Jarkko Sakkinen wrote: > > On Tue Apr 16, 2024 at 8:42 AM EEST, Huang, Kai wrote: > > Yes, exactly. I'd take one week break and cycle the kselftest part > > internally a bit as I said my previous response. I'm sure that there > > is experise inside Intel how to implement it properly. I.e. take some > > time to find the right person, and wait as long as that person has a > > bit of bandwidth to go through the test and suggest modifications. > > Folks, I worry that this series is getting bogged down in the selftests. > Yes, selftests are important. But getting _some_ tests in the kernel > is substantially more important than getting perfect tests. > > I don't think Haitao needs to "cycle" this back inside Intel. The problem with the tests was that they are hard to run anything else than Ubuntu (and perhaps Debian). It is hopefully now taken care of. Selftests do not have to be perfect but at minimum they need to be runnable. I need ret-test the latest series because it is possible that I did not have right flags (I was travelling few days thus have not done it yet). BR, Jarkko
Re: [PATCH v12 14/14] selftests/sgx: Add scripts for EPC cgroup testing
On Wed Apr 24, 2024 at 10:42 PM EEST, Haitao Huang wrote: > Hi Jarkko > On Tue, 16 Apr 2024 11:08:11 -0500, Jarkko Sakkinen > wrote: > > > On Tue Apr 16, 2024 at 5:54 PM EEST, Haitao Huang wrote: > >> I did declare the configs in the config file but I missed it in my patch > >> as stated earlier. IIUC, that would not cause this error though. > >> > >> Maybe I should exit with the skip code if no CGROUP_MISC (no more > >> CGROUP_SGX_EPC) is configured? > > OK, so I wanted to do a distro kernel test here, and used the default > > OpenSUSE kernel config. I need to check if it has CGROUP_MISC set. > > I couldn't figure out why this failure you have encountered. I think > OpenSUSE kernel most likely config CGROUP_MISC. > > Also if CGROUP_MISC not set, then there should be error happen earlier on > echoing "+misc" to cgroup.subtree_control at line 20. But your log > indicates only error on echoing "sgx_epc ..." to > /sys/fs/cgroup/...//misc.max. > > I can only speculate that can could happen (if sgx epc cgroup was compiled > in) when the cgroup-fs subdirectories in question already have populated > config that is conflicting with the scripts. > > Could you double check or start from a clean environment? > Thanks > Haitao I can re-check next week once I'm back from Estonia. I'm travelling now to https://lu.ma/uncloud. BR, Jarkko
Re: [RFC PATCH 1/1] x86/sgx: Explicitly give up the CPU in EDMM's ioctl() to avoid softlockup
On Wed Apr 24, 2024 at 8:44 PM EEST, Jarkko Sakkinen wrote: > On Wed Apr 24, 2024 at 2:50 PM EEST, Bojun Zhu wrote: > > I still have some questions: > > > > It seems that the variable "ret" is set to 0 if there is **some** EPC pages > > have been > > added when interrupted by signal(Supposed that sgx_encl_add_page() > > always returns successfully). > > Ah, ok. > > Returning zero is right thing to do because it also returns count of > pages successfully added. I.e. the function does not guarantee that > all pages are processsed but it does guarantee that the system is in > predictable state. > > It could be that e.g. sgx_alloc_epc_page() calls fails. > > So, it is a bit like how read system call works. Good that you asked that had rethink for a while. What I would suggest that you just put out 2nd verson out and then we see where it is going. For sgx_ioc_encl_add_pages() it is important to maintain exact semantics as we need to maintain backwards compatibility. BR, Jarkko
Re: [RFC PATCH 1/1] x86/sgx: Explicitly give up the CPU in EDMM's ioctl() to avoid softlockup
On Wed Apr 24, 2024 at 2:50 PM EEST, Bojun Zhu wrote: > I still have some questions: > > It seems that the variable "ret" is set to 0 if there is **some** EPC pages > have been > added when interrupted by signal(Supposed that sgx_encl_add_page() > always returns successfully). Ah, ok. Returning zero is right thing to do because it also returns count of pages successfully added. I.e. the function does not guarantee that all pages are processsed but it does guarantee that the system is in predictable state. It could be that e.g. sgx_alloc_epc_page() calls fails. So, it is a bit like how read system call works. BR, Jarkko
Re: [RFC PATCH 1/1] x86/sgx: Explicitly give up the CPU in EDMM's ioctl() to avoid softlockup
On Wed Apr 24, 2024 at 10:02 AM EEST, Jarkko Sakkinen wrote: > On Wed Apr 24, 2024 at 9:46 AM EEST, Bojun Zhu wrote: > > Based on the the discussion among you, Jarkko and Reinette, > > I will keep the need_resched() and wrap the logic in using sgx_resched(), > > as suggested by Jarkko. > > Sounds like a plan :-) In sgx_ioc_enclave_add_pages() "if (!c)" check might cause possibly some confusion. Reason for it is that in "transaction sense" the operation can be only meaningfully restarted when no pages have not been added as MRENCLAVE checksum cannot be reset. BR, Jarkko
Re: [RFC PATCH 1/1] x86/sgx: Explicitly give up the CPU in EDMM's ioctl() to avoid softlockup
On Wed Apr 24, 2024 at 9:46 AM EEST, Bojun Zhu wrote: > Based on the the discussion among you, Jarkko and Reinette, > I will keep the need_resched() and wrap the logic in using sgx_resched(), > as suggested by Jarkko. Sounds like a plan :-) > Regards, > > Bojun BR, Jarkko
Re: [RFC PATCH 1/1] x86/sgx: Explicitly give up the CPU in EDMM's ioctl() to avoid softlockup
On Tue Apr 23, 2024 at 8:08 PM EEST, Reinette Chatre wrote: > Hi Kai, > > On 4/23/2024 4:50 AM, Huang, Kai wrote: > >> diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c > >> b/arch/x86/kernel/cpu/sgx/ioctl.c > >> index b65ab214bdf5..2340a82fa796 100644 > >> --- a/arch/x86/kernel/cpu/sgx/ioctl.c > >> +++ b/arch/x86/kernel/cpu/sgx/ioctl.c > >> @@ -806,6 +806,9 @@ sgx_enclave_restrict_permissions(struct sgx_encl *encl, > >>} > >> > >>mutex_unlock(>lock); > >> + > >> + if (need_resched()) > >> + cond_resched(); > >>} > >> > >>ret = 0; > >> @@ -1010,6 +1013,9 @@ static long sgx_enclave_modify_types(struct sgx_encl > >> *encl, > >>entry->type = page_type; > >> > >>mutex_unlock(>lock); > >> + > >> + if (need_resched()) > >> + cond_resched(); > >>} > >> > >>ret = 0; > >> @@ -1156,6 +1162,9 @@ static long sgx_encl_remove_pages(struct sgx_encl > >> *encl, > >>kfree(entry); > >> > >>mutex_unlock(>lock); > >> + > >> + if (need_resched()) > >> + cond_resched(); > >>} > >> > > > > You can remove the need_reshced() in all 3 places above but just call > > cond_resched() directly. > > > > This change will call cond_resched() after dealing with each page in a > potentially large page range (cover mentions 30GB but we have also had to > make optimizations for enclaves larger than this). Adding a cond_resched() > here will surely placate the soft lockup detector, but we need to take care > how changes like this impact the performance of the system and having actions > on these page ranges take much longer than necessary. > For reference, please see 7b72c823ddf8 ("x86/sgx: Reduce delay and > interference > of enclave release") that turned frequent cond_resched() into batches > to address performance issues. > > It looks to me like the need_resched() may be a quick check that can be used > to improve performance? I am not familiar with all use cases that need to be > considered to determine if a batching solution may be needed. Ya, well no matter it is the reasoning will need to be documented because this should have symmetry with sgx_ioc_enclave_add_pages() (see my response to Kai). I because this makes dealing with need_resched() a change in code even if it is left out as a side-effect, I'd support of not removing which means adding need_resched() as a side-effect. >From this follows that *if* need_resched() needs to be removed then that is not really part of the bug fix, so in all cases the bug fix itself must include need_resched() :-) phew, hope you got my logic here, i think it reasonable... BR, Jarkko
Re: [RFC PATCH 1/1] x86/sgx: Explicitly give up the CPU in EDMM's ioctl() to avoid softlockup
On Tue Apr 23, 2024 at 2:50 PM EEST, Huang, Kai wrote: > On Tue, 2024-04-23 at 17:25 +0800, æ±äŒŻć(æ°é) wrote: > > EDMM's ioctl()s support batch operations, which may be > > time-consuming. Try to explicitly give up the CPU at > > the every end of "for loop" in > > sgx_enclave_{ modify_types | restrict_permissions | remove_pages} > > to give other tasks a chance to run, and avoid softlockup warning. > > > > The following has been observed on Linux v6.9-rc5 with kernel > > preemptions disabled(by configuring "PREEMPT_NONE=y"), when kernel > > is requested to restrict page permissions of a large number of EPC pages. > > > > [ cut here ] > > watchdog: BUG: soft lockup - CPU#45 stuck for 22s! [occlum-run:3905] > > ... > > CPU: 45 PID: 3905 Comm: occlum-run Not tainted 6.9.0-rc5 #7 > > ... > > RIP: 0010:sgx_enclave_restrict_permissions+0xba/0x1f0 > > Code: 48 c1 e6 05 48 89 d1 48 8d 5c 24 40 b8 0e 00 00 00 48 2b 8e 70 8e > > f5 93 48 c1 e9 05 48 c1 e1 0c 48 03 8e 68 8e f5 93 0f 01 cf 00 00 00 > > 40 0f 85 b2 00 00 00 85 c0 0f 85 db 00 00 00 4c 89 ef > > RSP: 0018:b55a6591fa80 EFLAGS: 0202 > > RAX: RBX: b55a6591fac0 RCX: b581e7384000 > > RDX: b59a9e4e8080 RSI: 0020 RDI: 91d69e8cc000 > > RBP: b55a6591fb70 R08: 0002 R09: 91d646e12be0 > > R10: 006e R11: 0002 R12: 00072052d000 > > R13: 91d69e8cc000 R14: b55a6591fbd8 R15: 91d69e8cc020 > > FS: 7fe10dbda740() GS:92163e48() > > knlGS: > > CS: 0010 DS: ES: CR0: 80050033 > > CR2: 7fc041811000 CR3: 0040d95c8005 CR4: 00770ef0 > > DR0: DR1: DR2: > > DR3: DR6: fffe07f0 DR7: 0400 > > PKRU: 5554 > > Call Trace: > > > > ? show_regs+0x67/0x70 > > ? watchdog_timer_fn+0x1f3/0x280 > > ? __pfx_watchdog_timer_fn+0x10/0x10 > > ? __hrtimer_run_queues+0xc8/0x220 > > ? hrtimer_interrupt+0x10c/0x250 > > ? __sysvec_apic_timer_interrupt+0x53/0x130 > > ? sysvec_apic_timer_interrupt+0x7b/0x90 > > > > > > ? asm_sysvec_apic_timer_interrupt+0x1b/0x20 > > ? sgx_enclave_restrict_permissions+0xba/0x1f0 > > ? __pte_offset_map_lock+0x94/0x110 > > ? sgx_encl_test_and_clear_young_cb+0x40/0x60 > > sgx_ioctl+0x1ab/0x900 > > ? do_syscall_64+0x79/0x110 > > ? apply_to_page_range+0x14/0x20 > > ? sgx_encl_test_and_clear_young+0x6c/0x80 > > ? sgx_vma_fault+0x132/0x4f0 > > __x64_sys_ioctl+0x95/0xd0 > > x64_sys_call+0x1209/0x20c0 > > do_syscall_64+0x6d/0x110 > > ? do_syscall_64+0x79/0x110 > > ? do_pte_missing+0x2e8/0xcc0 > > ? __pte_offset_map+0x1c/0x190 > > ? __handle_mm_fault+0x7b9/0xe60 > > ? __count_memcg_events+0x70/0x100 > > ? handle_mm_fault+0x256/0x360 > > ? do_user_addr_fault+0x3c1/0x860 > > ? irqentry_exit_to_user_mode+0x67/0x190 > > ? irqentry_exit+0x3b/0x50 > > ? exc_page_fault+0x89/0x180 > > entry_SYSCALL_64_after_hwframe+0x76/0x7e > > RIP: 0033:0x7fe10e2ee5cb > > Code: 0f 1e fa 48 8b 05 c5 78 0d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff > > ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 > > ff ff 73 01 c3 48 8b 0d 95 78 0d 00 f7 d8 64 89 01 48 > > RSP: 002b:7fffb2c75518 EFLAGS: 0246 ORIG_RAX: 0010 > > RAX: ffda RBX: 00078000 RCX: 7fe10e2ee5cb > > RDX: 7fffb2c75520 RSI: c028a405 RDI: 0005 > > RBP: 0005 R08: R09: 7fffb2c75594 > > R10: 7fffb2c755c8 R11: 0246 R12: c028a405 > > R13: 7fffb2c75520 R14: 00078000 R15: 7fe10e1a7980 > > > > [ end trace ] > > Could you trim down the trace to only include the relevant part? > > E.g., please at least remove the two register dumps at the beginning and > end of the trace. > > Please refer to "Backtraces in commit messages" section in > Documentation/process/submitting-patches.rst. > > > > > Signed-off-by: Bojun Zhu > > --- > > arch/x86/kernel/cpu/sgx/ioctl.c | 9 + > > 1 file changed, 9 insertions(+) > > > > diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c > > b/arch/x86/kernel/cpu/sgx/ioctl.c > > index b65ab214bdf5..2340a82fa796 100644 > > --- a/arch/x86/kernel/cpu/sgx/ioctl.c > > +++ b/arch/x86/kernel/cpu/sgx/ioctl.c > > @@ -806,6 +806,9 @@ sgx_enclave_restrict_permissions(struct sgx_encl *encl, > > } > > > > mutex_unlock(>lock); > > + > > + if (need_resched()) > > + cond_resched(); > > } > > > > ret = 0; > > @@ -1010,6 +1013,9 @@ static long sgx_enclave_modify_types(struct sgx_encl > > *encl, > >
Re: [RFC PATCH 1/1] x86/sgx: Explicitly give up the CPU in EDMM's ioctl() to avoid softlockup
On Wed Apr 24, 2024 at 12:10 AM EEST, Jarkko Sakkinen wrote: > On Tue Apr 23, 2024 at 12:25 PM EEST, =?UTF-8?B?5pyx5Lyv5ZCbKOadsOmTrSk=?= > wrote: > > EDMM's ioctl()s support batch operations, which may be > > time-consuming. Try to explicitly give up the CPU at > > the every end of "for loop" in > > sgx_enclave_{ modify_types | restrict_permissions | remove_pages} > > to give other tasks a chance to run, and avoid softlockup warning. > > > > The following has been observed on Linux v6.9-rc5 with kernel > > preemptions disabled(by configuring "PREEMPT_NONE=y"), when kernel > > is requested to restrict page permissions of a large number of EPC pages. > > > > [ cut here ] > > watchdog: BUG: soft lockup - CPU#45 stuck for 22s! [occlum-run:3905] > > ... > > CPU: 45 PID: 3905 Comm: occlum-run Not tainted 6.9.0-rc5 #7 > > ... > > RIP: 0010:sgx_enclave_restrict_permissions+0xba/0x1f0 > > Code: 48 c1 e6 05 48 89 d1 48 8d 5c 24 40 b8 0e 00 00 00 48 2b 8e 70 8e > > f5 93 48 c1 e9 05 48 c1 e1 0c 48 03 8e 68 8e f5 93 0f 01 cf 00 00 00 > > 40 0f 85 b2 00 00 00 85 c0 0f 85 db 00 00 00 4c 89 ef > > RSP: 0018:b55a6591fa80 EFLAGS: 0202 > > RAX: RBX: b55a6591fac0 RCX: b581e7384000 > > RDX: b59a9e4e8080 RSI: 0020 RDI: 91d69e8cc000 > > RBP: b55a6591fb70 R08: 0002 R09: 91d646e12be0 > > R10: 006e R11: 0002 R12: 00072052d000 > > R13: 91d69e8cc000 R14: b55a6591fbd8 R15: 91d69e8cc020 > > FS: 7fe10dbda740() GS:92163e48() > > knlGS: > > CS: 0010 DS: ES: CR0: 80050033 > > CR2: 7fc041811000 CR3: 0040d95c8005 CR4: 00770ef0 > > DR0: DR1: DR2: > > DR3: DR6: fffe07f0 DR7: 0400 > > PKRU: 5554 > > Call Trace: > > > > ? show_regs+0x67/0x70 > > ? watchdog_timer_fn+0x1f3/0x280 > > ? __pfx_watchdog_timer_fn+0x10/0x10 > > ? __hrtimer_run_queues+0xc8/0x220 > > ? hrtimer_interrupt+0x10c/0x250 > > ? __sysvec_apic_timer_interrupt+0x53/0x130 > > ? sysvec_apic_timer_interrupt+0x7b/0x90 > > > > > > ? asm_sysvec_apic_timer_interrupt+0x1b/0x20 > > ? sgx_enclave_restrict_permissions+0xba/0x1f0 > > ? __pte_offset_map_lock+0x94/0x110 > > ? sgx_encl_test_and_clear_young_cb+0x40/0x60 > > sgx_ioctl+0x1ab/0x900 > > ? do_syscall_64+0x79/0x110 > > ? apply_to_page_range+0x14/0x20 > > ? sgx_encl_test_and_clear_young+0x6c/0x80 > > ? sgx_vma_fault+0x132/0x4f0 > > __x64_sys_ioctl+0x95/0xd0 > > x64_sys_call+0x1209/0x20c0 > > do_syscall_64+0x6d/0x110 > > ? do_syscall_64+0x79/0x110 > > ? do_pte_missing+0x2e8/0xcc0 > > ? __pte_offset_map+0x1c/0x190 > > ? __handle_mm_fault+0x7b9/0xe60 > > ? __count_memcg_events+0x70/0x100 > > ? handle_mm_fault+0x256/0x360 > > ? do_user_addr_fault+0x3c1/0x860 > > ? irqentry_exit_to_user_mode+0x67/0x190 > > ? irqentry_exit+0x3b/0x50 > > ? exc_page_fault+0x89/0x180 > > entry_SYSCALL_64_after_hwframe+0x76/0x7e > > RIP: 0033:0x7fe10e2ee5cb > > Code: 0f 1e fa 48 8b 05 c5 78 0d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff > > ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 > > ff ff 73 01 c3 48 8b 0d 95 78 0d 00 f7 d8 64 89 01 48 > > RSP: 002b:7fffb2c75518 EFLAGS: 0246 ORIG_RAX: 0010 > > RAX: ffda RBX: 00078000 RCX: 7fe10e2ee5cb > > RDX: 7fffb2c75520 RSI: c028a405 RDI: 0005 > > RBP: 0005 R08: R09: 7fffb2c75594 > > R10: 7fffb2c755c8 R11: 0246 R12: c028a405 > > R13: 7fffb2c75520 R14: 00078000 R15: 7fe10e1a7980 > > > > [ end trace ] > > > > Signed-off-by: Bojun Zhu > > Can you also fixup this as your "firstname lastname" in your emails > from field? This matters so that author field in git log matches your > sob. > > > --- > > arch/x86/kernel/cpu/sgx/ioctl.c | 9 + > > 1 file changed, 9 insertions(+) > > > > diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c > > b/arch/x86/kernel
Re: [RFC PATCH 1/1] x86/sgx: Explicitly give up the CPU in EDMM's ioctl() to avoid softlockup
On Tue Apr 23, 2024 at 12:25 PM EEST, =?UTF-8?B?5pyx5Lyv5ZCbKOadsOmTrSk=?= wrote: > EDMM's ioctl()s support batch operations, which may be > time-consuming. Try to explicitly give up the CPU at > the every end of "for loop" in > sgx_enclave_{ modify_types | restrict_permissions | remove_pages} > to give other tasks a chance to run, and avoid softlockup warning. > > The following has been observed on Linux v6.9-rc5 with kernel > preemptions disabled(by configuring "PREEMPT_NONE=y"), when kernel > is requested to restrict page permissions of a large number of EPC pages. > > [ cut here ] > watchdog: BUG: soft lockup - CPU#45 stuck for 22s! [occlum-run:3905] > ... > CPU: 45 PID: 3905 Comm: occlum-run Not tainted 6.9.0-rc5 #7 > ... > RIP: 0010:sgx_enclave_restrict_permissions+0xba/0x1f0 > Code: 48 c1 e6 05 48 89 d1 48 8d 5c 24 40 b8 0e 00 00 00 48 2b 8e 70 8e > f5 93 48 c1 e9 05 48 c1 e1 0c 48 03 8e 68 8e f5 93 0f 01 cf 00 00 00 40 > 0f 85 b2 00 00 00 85 c0 0f 85 db 00 00 00 4c 89 ef > RSP: 0018:b55a6591fa80 EFLAGS: 0202 > RAX: RBX: b55a6591fac0 RCX: b581e7384000 > RDX: b59a9e4e8080 RSI: 0020 RDI: 91d69e8cc000 > RBP: b55a6591fb70 R08: 0002 R09: 91d646e12be0 > R10: 006e R11: 0002 R12: 00072052d000 > R13: 91d69e8cc000 R14: b55a6591fbd8 R15: 91d69e8cc020 > FS: 7fe10dbda740() GS:92163e48() > knlGS: > CS: 0010 DS: ES: CR0: 80050033 > CR2: 7fc041811000 CR3: 0040d95c8005 CR4: 00770ef0 > DR0: DR1: DR2: > DR3: DR6: fffe07f0 DR7: 0400 > PKRU: 5554 > Call Trace: > > ? show_regs+0x67/0x70 > ? watchdog_timer_fn+0x1f3/0x280 > ? __pfx_watchdog_timer_fn+0x10/0x10 > ? __hrtimer_run_queues+0xc8/0x220 > ? hrtimer_interrupt+0x10c/0x250 > ? __sysvec_apic_timer_interrupt+0x53/0x130 > ? sysvec_apic_timer_interrupt+0x7b/0x90 > > > ? asm_sysvec_apic_timer_interrupt+0x1b/0x20 > ? sgx_enclave_restrict_permissions+0xba/0x1f0 > ? __pte_offset_map_lock+0x94/0x110 > ? sgx_encl_test_and_clear_young_cb+0x40/0x60 > sgx_ioctl+0x1ab/0x900 > ? do_syscall_64+0x79/0x110 > ? apply_to_page_range+0x14/0x20 > ? sgx_encl_test_and_clear_young+0x6c/0x80 > ? sgx_vma_fault+0x132/0x4f0 > __x64_sys_ioctl+0x95/0xd0 > x64_sys_call+0x1209/0x20c0 > do_syscall_64+0x6d/0x110 > ? do_syscall_64+0x79/0x110 > ? do_pte_missing+0x2e8/0xcc0 > ? __pte_offset_map+0x1c/0x190 > ? __handle_mm_fault+0x7b9/0xe60 > ? __count_memcg_events+0x70/0x100 > ? handle_mm_fault+0x256/0x360 > ? do_user_addr_fault+0x3c1/0x860 > ? irqentry_exit_to_user_mode+0x67/0x190 > ? irqentry_exit+0x3b/0x50 > ? exc_page_fault+0x89/0x180 > entry_SYSCALL_64_after_hwframe+0x76/0x7e > RIP: 0033:0x7fe10e2ee5cb > Code: 0f 1e fa 48 8b 05 c5 78 0d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff > ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff > ff 73 01 c3 48 8b 0d 95 78 0d 00 f7 d8 64 89 01 48 > RSP: 002b:7fffb2c75518 EFLAGS: 0246 ORIG_RAX: 0010 > RAX: ffda RBX: 00078000 RCX: 7fe10e2ee5cb > RDX: 7fffb2c75520 RSI: c028a405 RDI: 0005 > RBP: 0005 R08: R09: 7fffb2c75594 > R10: 7fffb2c755c8 R11: 0246 R12: c028a405 > R13: 7fffb2c75520 R14: 00078000 R15: 7fe10e1a7980 > > [ end trace ] > > Signed-off-by: Bojun Zhu Can you also fixup this as your "firstname lastname" in your emails from field? This matters so that author field in git log matches your sob. > --- > arch/x86/kernel/cpu/sgx/ioctl.c | 9 + > 1 file changed, 9 insertions(+) > > diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c > index b65ab214bdf5..2340a82fa796 100644 > --- a/arch/x86/kernel/cpu/sgx/ioctl.c > +++ b/arch/x86/kernel/cpu/sgx/ioctl.c > @@ -806,6 +806,9 @@ sgx_enclave_restrict_permissions(struct sgx_encl *encl, > } > > mutex_unlock(>lock); > + > + if (need_resched()) > + cond_resched(); > } > > ret = 0; > @@ -1010,6 +1013,9 @@ static long sgx_enclave_modify_types(struct sgx_encl > *encl, > entry->type = page_type; > > mutex_unlock(>lock); > + > + if (need_resched()) > + cond_resched(); > } > > ret = 0; > @@ -1156,6 +1162,9 @@ static long sgx_encl_remove_pages(struct sgx_encl *encl, > kfree(entry); > > mutex_unlock(>lock); > + > + if
timeout triggered probe
Hi I wonder if there is a way to attach a probe to a timeout in process context? This spun off from https://social.kernel.org/notice/Ah18GN3aezWLdcxcAK Also relevant: https://social.kernel.org/notice/Ah2O2qXO32lPRPQkk4 Maybe there is a way already but being able to do this would have practical use for sure if it does not. BR, Jarkko
Re: [PATCH v12 14/14] selftests/sgx: Add scripts for EPC cgroup testing
On Wed Apr 17, 2024 at 1:04 AM EEST, Haitao Huang wrote: > On Tue, 16 Apr 2024 11:08:11 -0500, Jarkko Sakkinen > wrote: > > > On Tue Apr 16, 2024 at 5:54 PM EEST, Haitao Huang wrote: > >> I did declare the configs in the config file but I missed it in my patch > >> as stated earlier. IIUC, that would not cause this error though. > >> > >> Maybe I should exit with the skip code if no CGROUP_MISC (no more > >> CGROUP_SGX_EPC) is configured? > > > > OK, so I wanted to do a distro kernel test here, and used the default > > OpenSUSE kernel config. I need to check if it has CGROUP_MISC set. > > > >> tools/testing/selftests$ find . -name README > >> ./futex/README > >> ./tc-testing/README > >> ./net/forwarding/README > >> ./powerpc/nx-gzip/README > >> ./ftrace/README > >> ./arm64/signal/README > >> ./arm64/fp/README > >> ./arm64/README > >> ./zram/README > >> ./livepatch/README > >> ./resctrl/README > > > > So is there a README because of override timeout parameter? Maybe it > > should be just set to a high enough value? > > > > BR, Jarkko > > > > > From the docs, I think we are supposed to use override. > See: https://docs.kernel.org/dev-tools/kselftest.html#timeout-for-selftests OK, fair enough :-) I did not know this. BR, Jarkko
Re: [PATCH v12 14/14] selftests/sgx: Add scripts for EPC cgroup testing
On Tue Apr 16, 2024 at 5:54 PM EEST, Haitao Huang wrote: > I did declare the configs in the config file but I missed it in my patch > as stated earlier. IIUC, that would not cause this error though. > > Maybe I should exit with the skip code if no CGROUP_MISC (no more > CGROUP_SGX_EPC) is configured? OK, so I wanted to do a distro kernel test here, and used the default OpenSUSE kernel config. I need to check if it has CGROUP_MISC set. > tools/testing/selftests$ find . -name README > ./futex/README > ./tc-testing/README > ./net/forwarding/README > ./powerpc/nx-gzip/README > ./ftrace/README > ./arm64/signal/README > ./arm64/fp/README > ./arm64/README > ./zram/README > ./livepatch/README > ./resctrl/README So is there a README because of override timeout parameter? Maybe it should be just set to a high enough value? BR, Jarkko
Re: [PATCH v12 14/14] selftests/sgx: Add scripts for EPC cgroup testing
On Tue Apr 16, 2024 at 8:42 AM EEST, Huang, Kai wrote: > > > > I'll send a fixup for this patch or another version of the series if more > > changes are needed. > > Hi Haitao, > > I don't like to say but in general I think you are sending too frequently. > The > last version was sent April, 11th (my time), so considering the weekend it has > only been 3 or at most 4 days. Â > > Please slow down a little bit to give people more time. > > More information please also see: > > https://www.kernel.org/doc/html/next/process/submitting-patches.html#resend-reminders +1 Yes, exactly. I'd take one week break and cycle the kselftest part internally a bit as I said my previous response. I'm sure that there is experise inside Intel how to implement it properly. I.e. take some time to find the right person, and wait as long as that person has a bit of bandwidth to go through the test and suggest modifications. Cannot blame, as I've done the same mistake a few times in past but yeah this would be the best possible corrective action to take. BR, Jarkko
Re: [PATCH v12 14/14] selftests/sgx: Add scripts for EPC cgroup testing
On Tue Apr 16, 2024 at 5:05 PM EEST, Jarkko Sakkinen wrote: > On Tue Apr 16, 2024 at 6:20 AM EEST, Haitao Huang wrote: > > With different cgroups, the script starts one or multiple concurrent SGX > > selftests (test_sgx), each to run the unclobbered_vdso_oversubscribed > > test case, which loads an enclave of EPC size equal to the EPC capacity > > available on the platform. The script checks results against the > > expectation set for each cgroup and reports success or failure. > > > > The script creates 3 different cgroups at the beginning with following > > expectations: > > > > 1) SMALL - intentionally small enough to fail the test loading an > > enclave of size equal to the capacity. > > 2) LARGE - large enough to run up to 4 concurrent tests but fail some if > > more than 4 concurrent tests are run. The script starts 4 expecting at > > least one test to pass, and then starts 5 expecting at least one test > > to fail. > > 3) LARGER - limit is the same as the capacity, large enough to run lots of > > concurrent tests. The script starts 8 of them and expects all pass. > > Then it reruns the same test with one process randomly killed and > > usage checked to be zero after all processes exit. > > > > The script also includes a test with low mem_cg limit and LARGE sgx_epc > > limit to verify that the RAM used for per-cgroup reclamation is charged > > to a proper mem_cg. For this test, it turns off swapping before start, > > and turns swapping back on afterwards. > > > > Add README to document how to run the tests. > > > > Signed-off-by: Haitao Huang > > jarkko@mustatorvisieni:~/linux-tpmdd> sudo make -C > tools/testing/selftests/sgx run_tests > make: Entering directory > '/home/jarkko/linux-tpmdd/tools/testing/selftests/sgx' > gcc -Wall -Werror -g > -I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include > -fPIC -c main.c -o /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/main.o > gcc -Wall -Werror -g > -I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include > -fPIC -c load.c -o /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/load.o > gcc -Wall -Werror -g > -I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include > -fPIC -c sigstruct.c -o > /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/sigstruct.o > gcc -Wall -Werror -g > -I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include > -fPIC -c call.S -o /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/call.o > gcc -Wall -Werror -g > -I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include > -fPIC -c sign_key.S -o > /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/sign_key.o > gcc -Wall -Werror -g > -I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include > -fPIC -o /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/test_sgx > /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/main.o > /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/load.o > /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/sigstruct.o > /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/call.o > /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/sign_key.o -z > noexecstack -lcrypto > gcc -Wall -Werror -static-pie -nostdlib -ffreestanding -fPIE > -fno-stack-protector -mrdrnd > -I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include > test_encl.c test_encl_bootstrap.S -o > /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/test_encl.elf > -Wl,-T,test_encl.lds,--build-id=none > /usr/lib64/gcc/x86_64-suse-linux/13/../../../../x86_64-suse-linux/bin/ld: > warning: /tmp/ccqvDJVg.o: missing .note.GNU-stack section implies executable > stack > /usr/lib64/gcc/x86_64-suse-linux/13/../../../../x86_64-suse-linux/bin/ld: > NOTE: This behaviour is deprecated and will be removed in a future version of > the linker > TAP version 13 > 1..2 > # timeout set to 45 > # selftests: sgx: test_sgx > # TAP version 13 > # 1..16 > # # Starting 16 tests from 1 test cases. > # # RUN enclave.unclobbered_vdso ... > # #OK enclave.unclobbered_vdso > # ok 1 enclave.unclobbered_vdso > # # RUN enclave.unclobbered_vdso_oversubscribed ... > # #OK enclave.unclobbered_vdso_oversubscribed > # ok 2 enclave.unclobbered_vdso_oversubscribed > # # RUN enclave.unclobbered_vdso_oversubscribed_remove ... > # # main.c:402:unclobbered_vdso_oversubscribed_remove:Creating an enclave > with 98566144 bytes heap may take a while ... > # # main.c:457:unclobbered_vdso_oversubscribed_remove:Changing type of > 98566144 bytes to trimmed m
Re: [PATCH v12 14/14] selftests/sgx: Add scripts for EPC cgroup testing
On Tue Apr 16, 2024 at 6:20 AM EEST, Haitao Huang wrote: > With different cgroups, the script starts one or multiple concurrent SGX > selftests (test_sgx), each to run the unclobbered_vdso_oversubscribed > test case, which loads an enclave of EPC size equal to the EPC capacity > available on the platform. The script checks results against the > expectation set for each cgroup and reports success or failure. > > The script creates 3 different cgroups at the beginning with following > expectations: > > 1) SMALL - intentionally small enough to fail the test loading an > enclave of size equal to the capacity. > 2) LARGE - large enough to run up to 4 concurrent tests but fail some if > more than 4 concurrent tests are run. The script starts 4 expecting at > least one test to pass, and then starts 5 expecting at least one test > to fail. > 3) LARGER - limit is the same as the capacity, large enough to run lots of > concurrent tests. The script starts 8 of them and expects all pass. > Then it reruns the same test with one process randomly killed and > usage checked to be zero after all processes exit. > > The script also includes a test with low mem_cg limit and LARGE sgx_epc > limit to verify that the RAM used for per-cgroup reclamation is charged > to a proper mem_cg. For this test, it turns off swapping before start, > and turns swapping back on afterwards. > > Add README to document how to run the tests. > > Signed-off-by: Haitao Huang jarkko@mustatorvisieni:~/linux-tpmdd> sudo make -C tools/testing/selftests/sgx run_tests make: Entering directory '/home/jarkko/linux-tpmdd/tools/testing/selftests/sgx' gcc -Wall -Werror -g -I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include -fPIC -c main.c -o /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/main.o gcc -Wall -Werror -g -I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include -fPIC -c load.c -o /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/load.o gcc -Wall -Werror -g -I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include -fPIC -c sigstruct.c -o /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/sigstruct.o gcc -Wall -Werror -g -I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include -fPIC -c call.S -o /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/call.o gcc -Wall -Werror -g -I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include -fPIC -c sign_key.S -o /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/sign_key.o gcc -Wall -Werror -g -I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include -fPIC -o /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/test_sgx /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/main.o /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/load.o /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/sigstruct.o /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/call.o /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/sign_key.o -z noexecstack -lcrypto gcc -Wall -Werror -static-pie -nostdlib -ffreestanding -fPIE -fno-stack-protector -mrdrnd -I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include test_encl.c test_encl_bootstrap.S -o /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/test_encl.elf -Wl,-T,test_encl.lds,--build-id=none /usr/lib64/gcc/x86_64-suse-linux/13/../../../../x86_64-suse-linux/bin/ld: warning: /tmp/ccqvDJVg.o: missing .note.GNU-stack section implies executable stack /usr/lib64/gcc/x86_64-suse-linux/13/../../../../x86_64-suse-linux/bin/ld: NOTE: This behaviour is deprecated and will be removed in a future version of the linker TAP version 13 1..2 # timeout set to 45 # selftests: sgx: test_sgx # TAP version 13 # 1..16 # # Starting 16 tests from 1 test cases. # # RUN enclave.unclobbered_vdso ... # #OK enclave.unclobbered_vdso # ok 1 enclave.unclobbered_vdso # # RUN enclave.unclobbered_vdso_oversubscribed ... # #OK enclave.unclobbered_vdso_oversubscribed # ok 2 enclave.unclobbered_vdso_oversubscribed # # RUN enclave.unclobbered_vdso_oversubscribed_remove ... # # main.c:402:unclobbered_vdso_oversubscribed_remove:Creating an enclave with 98566144 bytes heap may take a while ... # # main.c:457:unclobbered_vdso_oversubscribed_remove:Changing type of 98566144 bytes to trimmed may take a while ... # # main.c:473:unclobbered_vdso_oversubscribed_remove:Entering enclave to run EACCEPT for each page of 98566144 bytes may take a while ... # # main.c:494:unclobbered_vdso_oversubscribed_remove:Removing 98566144 bytes from enclave may take a while ... # #OK enclave.unclobbered_vdso_oversubscribed_remove # ok 3 enclave.unclobbered_vdso_oversubscribed_remove # # RUN enclave.clobbered_vdso ... # #OK enclave.clobbered_vdso # ok 4 enclave.clobbered_vdso # # RUN enclave.clobbered_vdso_and_user_function ... # #OK
Re: [PATCH v11 14/14] selftests/sgx: Add scripts for EPC cgroup testing
On Mon Apr 15, 2024 at 8:32 PM EEST, Haitao Huang wrote: > On Sat, 13 Apr 2024 16:34:17 -0500, Jarkko Sakkinen > wrote: > > > On Wed Apr 10, 2024 at 9:25 PM EEST, Haitao Huang wrote: > >> To run selftests for EPC cgroup: > >> > >> sudo ./run_epc_cg_selftests.sh > >> > >> To watch misc cgroup 'current' changes during testing, run this in a > >> separate terminal: > >> > >> ./watch_misc_for_tests.sh current > >> > >> With different cgroups, the script starts one or multiple concurrent SGX > >> selftests (test_sgx), each to run the unclobbered_vdso_oversubscribed > >> test case, which loads an enclave of EPC size equal to the EPC capacity > >> available on the platform. The script checks results against the > >> expectation set for each cgroup and reports success or failure. > >> > >> The script creates 3 different cgroups at the beginning with following > >> expectations: > >> > >> 1) SMALL - intentionally small enough to fail the test loading an > >> enclave of size equal to the capacity. > >> 2) LARGE - large enough to run up to 4 concurrent tests but fail some if > >> more than 4 concurrent tests are run. The script starts 4 expecting at > >> least one test to pass, and then starts 5 expecting at least one test > >> to fail. > >> 3) LARGER - limit is the same as the capacity, large enough to run lots > >> of > >> concurrent tests. The script starts 8 of them and expects all pass. > >> Then it reruns the same test with one process randomly killed and > >> usage checked to be zero after all processes exit. > >> > >> The script also includes a test with low mem_cg limit and LARGE sgx_epc > >> limit to verify that the RAM used for per-cgroup reclamation is charged > >> to a proper mem_cg. For this test, it turns off swapping before start, > >> and turns swapping back on afterwards. > >> > >> Signed-off-by: Haitao Huang > >> --- > >> V11: > >> - Remove cgroups-tools dependency and make scripts ash compatible. > >> (Jarkko) > >> - Drop support for cgroup v1 and simplify. (Michal, Jarkko) > >> - Add documentation for functions. (Jarkko) > >> - Turn off swapping before memcontrol tests and back on after > >> - Format and style fixes, name for hard coded values > >> > >> V7: > >> - Added memcontrol test. > >> > >> V5: > >> - Added script with automatic results checking, remove the interactive > >> script. > >> - The script can run independent from the series below. > >> --- > >> tools/testing/selftests/sgx/ash_cgexec.sh | 16 + > >> .../selftests/sgx/run_epc_cg_selftests.sh | 275 ++ > >> .../selftests/sgx/watch_misc_for_tests.sh | 11 + > >> 3 files changed, 302 insertions(+) > >> create mode 100755 tools/testing/selftests/sgx/ash_cgexec.sh > >> create mode 100755 tools/testing/selftests/sgx/run_epc_cg_selftests.sh > >> create mode 100755 tools/testing/selftests/sgx/watch_misc_for_tests.sh > >> > >> diff --git a/tools/testing/selftests/sgx/ash_cgexec.sh > >> b/tools/testing/selftests/sgx/ash_cgexec.sh > >> new file mode 100755 > >> index ..cfa5d2b0e795 > >> --- /dev/null > >> +++ b/tools/testing/selftests/sgx/ash_cgexec.sh > >> @@ -0,0 +1,16 @@ > >> +#!/usr/bin/env sh > >> +# SPDX-License-Identifier: GPL-2.0 > >> +# Copyright(c) 2024 Intel Corporation. > >> + > >> +# Start a program in a given cgroup. > >> +# Supports V2 cgroup paths, relative to /sys/fs/cgroup > >> +if [ "$#" -lt 2 ]; then > >> +echo "Usage: $0 [args...]" > >> +exit 1 > >> +fi > >> +# Move this shell to the cgroup. > >> +echo 0 >/sys/fs/cgroup/$1/cgroup.procs > >> +shift > >> +# Execute the command within the cgroup > >> +exec "$@" > >> + > >> diff --git a/tools/testing/selftests/sgx/run_epc_cg_selftests.sh > >> b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh > >> new file mode 100755 > >> index ..dd56273056fc > >> --- /dev/null > >> +++ b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh > >> @@ -0,0 +1,275 @@ > >> +#!/usr/bin/env sh > >> +# SPDX-License-Identifier: GPL-2.0 > >> +# Copyright(c) 2023, 2024 Intel Corporation. > >> + > >> +TEST_ROOT_CG=selftest >
Re: [PATCH v11 01/14] x86/sgx: Replace boolean parameters with enums
On Mon Apr 15, 2024 at 4:22 PM EEST, Huang, Kai wrote: > On Wed, 2024-04-10 at 11:25 -0700, Haitao Huang wrote: > > Replace boolean parameters for 'reclaim' in the function > > sgx_alloc_epc_page() and its callers with an enum. > > > > Also opportunistically remove non-static declaration of > > __sgx_alloc_epc_page() and a typo > > > > Signed-off-by: Haitao Huang > > Suggested-by: Jarkko Sakkinen > > Suggested-by: Dave Hansen > > > > Reviewed-by: Kai Huang Hmm... missing my reviewed-by so: Reviewed-by: Jarkko Sakkinen BR, Jarkko
Re: [PATCH v11 14/14] selftests/sgx: Add scripts for EPC cgroup testing
On Mon Apr 15, 2024 at 6:13 AM EEST, Haitao Huang wrote: > On Sun, 14 Apr 2024 10:01:03 -0500, Jarkko Sakkinen > wrote: > > > On Wed Apr 10, 2024 at 9:25 PM EEST, Haitao Huang wrote: > >> To run selftests for EPC cgroup: > >> > >> sudo ./run_epc_cg_selftests.sh > >> > >> To watch misc cgroup 'current' changes during testing, run this in a > >> separate terminal: > >> > >> ./watch_misc_for_tests.sh current > >> > >> With different cgroups, the script starts one or multiple concurrent SGX > >> selftests (test_sgx), each to run the unclobbered_vdso_oversubscribed > >> test case, which loads an enclave of EPC size equal to the EPC capacity > >> available on the platform. The script checks results against the > >> expectation set for each cgroup and reports success or failure. > >> > >> The script creates 3 different cgroups at the beginning with following > >> expectations: > >> > >> 1) SMALL - intentionally small enough to fail the test loading an > >> enclave of size equal to the capacity. > >> 2) LARGE - large enough to run up to 4 concurrent tests but fail some if > >> more than 4 concurrent tests are run. The script starts 4 expecting at > >> least one test to pass, and then starts 5 expecting at least one test > >> to fail. > >> 3) LARGER - limit is the same as the capacity, large enough to run lots > >> of > >> concurrent tests. The script starts 8 of them and expects all pass. > >> Then it reruns the same test with one process randomly killed and > >> usage checked to be zero after all processes exit. > >> > >> The script also includes a test with low mem_cg limit and LARGE sgx_epc > >> limit to verify that the RAM used for per-cgroup reclamation is charged > >> to a proper mem_cg. For this test, it turns off swapping before start, > >> and turns swapping back on afterwards. > >> > >> Signed-off-by: Haitao Huang > >> --- > >> V11: > >> - Remove cgroups-tools dependency and make scripts ash compatible. > >> (Jarkko) > >> - Drop support for cgroup v1 and simplify. (Michal, Jarkko) > >> - Add documentation for functions. (Jarkko) > >> - Turn off swapping before memcontrol tests and back on after > >> - Format and style fixes, name for hard coded values > >> > >> V7: > >> - Added memcontrol test. > >> > >> V5: > >> - Added script with automatic results checking, remove the interactive > >> script. > >> - The script can run independent from the series below. > >> --- > >> tools/testing/selftests/sgx/ash_cgexec.sh | 16 + > >> .../selftests/sgx/run_epc_cg_selftests.sh | 275 ++ > >> .../selftests/sgx/watch_misc_for_tests.sh | 11 + > >> 3 files changed, 302 insertions(+) > >> create mode 100755 tools/testing/selftests/sgx/ash_cgexec.sh > >> create mode 100755 tools/testing/selftests/sgx/run_epc_cg_selftests.sh > >> create mode 100755 tools/testing/selftests/sgx/watch_misc_for_tests.sh > >> > >> diff --git a/tools/testing/selftests/sgx/ash_cgexec.sh > >> b/tools/testing/selftests/sgx/ash_cgexec.sh > >> new file mode 100755 > >> index ..cfa5d2b0e795 > >> --- /dev/null > >> +++ b/tools/testing/selftests/sgx/ash_cgexec.sh > >> @@ -0,0 +1,16 @@ > >> +#!/usr/bin/env sh > >> +# SPDX-License-Identifier: GPL-2.0 > >> +# Copyright(c) 2024 Intel Corporation. > >> + > >> +# Start a program in a given cgroup. > >> +# Supports V2 cgroup paths, relative to /sys/fs/cgroup > >> +if [ "$#" -lt 2 ]; then > >> +echo "Usage: $0 [args...]" > >> +exit 1 > >> +fi > >> +# Move this shell to the cgroup. > >> +echo 0 >/sys/fs/cgroup/$1/cgroup.procs > >> +shift > >> +# Execute the command within the cgroup > >> +exec "$@" > >> + > >> diff --git a/tools/testing/selftests/sgx/run_epc_cg_selftests.sh > >> b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh > >> new file mode 100755 > >> index ..dd56273056fc > >> --- /dev/null > >> +++ b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh > >> @@ -0,0 +1,275 @@ > >> +#!/usr/bin/env sh > >> +# SPDX-License-Identifier: GPL-2.0 > >> +# Copyright(c) 2023, 2024 Intel Corporation. > >> + > >> +TEST_ROOT_CG=selftest >
Re: [PATCH v11 14/14] selftests/sgx: Add scripts for EPC cgroup testing
On Wed Apr 10, 2024 at 9:25 PM EEST, Haitao Huang wrote: > To run selftests for EPC cgroup: > > sudo ./run_epc_cg_selftests.sh > > To watch misc cgroup 'current' changes during testing, run this in a > separate terminal: > > ./watch_misc_for_tests.sh current > > With different cgroups, the script starts one or multiple concurrent SGX > selftests (test_sgx), each to run the unclobbered_vdso_oversubscribed > test case, which loads an enclave of EPC size equal to the EPC capacity > available on the platform. The script checks results against the > expectation set for each cgroup and reports success or failure. > > The script creates 3 different cgroups at the beginning with following > expectations: > > 1) SMALL - intentionally small enough to fail the test loading an > enclave of size equal to the capacity. > 2) LARGE - large enough to run up to 4 concurrent tests but fail some if > more than 4 concurrent tests are run. The script starts 4 expecting at > least one test to pass, and then starts 5 expecting at least one test > to fail. > 3) LARGER - limit is the same as the capacity, large enough to run lots of > concurrent tests. The script starts 8 of them and expects all pass. > Then it reruns the same test with one process randomly killed and > usage checked to be zero after all processes exit. > > The script also includes a test with low mem_cg limit and LARGE sgx_epc > limit to verify that the RAM used for per-cgroup reclamation is charged > to a proper mem_cg. For this test, it turns off swapping before start, > and turns swapping back on afterwards. > > Signed-off-by: Haitao Huang > --- > V11: > - Remove cgroups-tools dependency and make scripts ash compatible. (Jarkko) > - Drop support for cgroup v1 and simplify. (Michal, Jarkko) > - Add documentation for functions. (Jarkko) > - Turn off swapping before memcontrol tests and back on after > - Format and style fixes, name for hard coded values > > V7: > - Added memcontrol test. > > V5: > - Added script with automatic results checking, remove the interactive > script. > - The script can run independent from the series below. > --- > tools/testing/selftests/sgx/ash_cgexec.sh | 16 + > .../selftests/sgx/run_epc_cg_selftests.sh | 275 ++ > .../selftests/sgx/watch_misc_for_tests.sh | 11 + > 3 files changed, 302 insertions(+) > create mode 100755 tools/testing/selftests/sgx/ash_cgexec.sh > create mode 100755 tools/testing/selftests/sgx/run_epc_cg_selftests.sh > create mode 100755 tools/testing/selftests/sgx/watch_misc_for_tests.sh > > diff --git a/tools/testing/selftests/sgx/ash_cgexec.sh > b/tools/testing/selftests/sgx/ash_cgexec.sh > new file mode 100755 > index ..cfa5d2b0e795 > --- /dev/null > +++ b/tools/testing/selftests/sgx/ash_cgexec.sh > @@ -0,0 +1,16 @@ > +#!/usr/bin/env sh > +# SPDX-License-Identifier: GPL-2.0 > +# Copyright(c) 2024 Intel Corporation. > + > +# Start a program in a given cgroup. > +# Supports V2 cgroup paths, relative to /sys/fs/cgroup > +if [ "$#" -lt 2 ]; then > +echo "Usage: $0 [args...]" > +exit 1 > +fi > +# Move this shell to the cgroup. > +echo 0 >/sys/fs/cgroup/$1/cgroup.procs > +shift > +# Execute the command within the cgroup > +exec "$@" > + > diff --git a/tools/testing/selftests/sgx/run_epc_cg_selftests.sh > b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh > new file mode 100755 > index ..dd56273056fc > --- /dev/null > +++ b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh > @@ -0,0 +1,275 @@ > +#!/usr/bin/env sh > +# SPDX-License-Identifier: GPL-2.0 > +# Copyright(c) 2023, 2024 Intel Corporation. > + > +TEST_ROOT_CG=selftest > +TEST_CG_SUB1=$TEST_ROOT_CG/test1 > +TEST_CG_SUB2=$TEST_ROOT_CG/test2 > +# We will only set limit in test1 and run tests in test3 > +TEST_CG_SUB3=$TEST_ROOT_CG/test1/test3 > +TEST_CG_SUB4=$TEST_ROOT_CG/test4 > + > +# Cgroup v2 only > +CG_ROOT=/sys/fs/cgroup > +mkdir -p $CG_ROOT/$TEST_CG_SUB1 > +mkdir -p $CG_ROOT/$TEST_CG_SUB2 > +mkdir -p $CG_ROOT/$TEST_CG_SUB3 > +mkdir -p $CG_ROOT/$TEST_CG_SUB4 > + > +# Turn on misc and memory controller in non-leaf nodes > +echo "+misc" > $CG_ROOT/cgroup.subtree_control && \ > +echo "+memory" > $CG_ROOT/cgroup.subtree_control && \ > +echo "+misc" > $CG_ROOT/$TEST_ROOT_CG/cgroup.subtree_control && \ > +echo "+memory" > $CG_ROOT/$TEST_ROOT_CG/cgroup.subtree_control && \ > +echo "+misc" > $CG_ROOT/$TEST_CG_SUB1/cgroup.subtree_control > +if [ $? -ne 0 ]; then > +echo "# Failed setting up cgroups, make sure misc and memory cgroups are > enabled." > +exit 1 > +fi > + > +CAPACITY=$(grep "sgx_epc" "$CG_ROOT/misc.capacity" | awk '{print $2}') > +# This is below number of VA pages needed for enclave of capacity size. So > +# should fail oversubscribed cases > +SMALL=$(( CAPACITY / 512 )) > + > +# At least load one enclave of capacity size successfully, maybe up to 4. > +# But some may fail if we run more than 4 concurrent enclaves of capacity > size. > +LARGE=$(( SMALL * 4 )) > + >
Re: [PATCH v11 14/14] selftests/sgx: Add scripts for EPC cgroup testing
On Wed Apr 10, 2024 at 9:25 PM EEST, Haitao Huang wrote: > To run selftests for EPC cgroup: > > sudo ./run_epc_cg_selftests.sh > > To watch misc cgroup 'current' changes during testing, run this in a > separate terminal: > > ./watch_misc_for_tests.sh current > > With different cgroups, the script starts one or multiple concurrent SGX > selftests (test_sgx), each to run the unclobbered_vdso_oversubscribed > test case, which loads an enclave of EPC size equal to the EPC capacity > available on the platform. The script checks results against the > expectation set for each cgroup and reports success or failure. > > The script creates 3 different cgroups at the beginning with following > expectations: > > 1) SMALL - intentionally small enough to fail the test loading an > enclave of size equal to the capacity. > 2) LARGE - large enough to run up to 4 concurrent tests but fail some if > more than 4 concurrent tests are run. The script starts 4 expecting at > least one test to pass, and then starts 5 expecting at least one test > to fail. > 3) LARGER - limit is the same as the capacity, large enough to run lots of > concurrent tests. The script starts 8 of them and expects all pass. > Then it reruns the same test with one process randomly killed and > usage checked to be zero after all processes exit. > > The script also includes a test with low mem_cg limit and LARGE sgx_epc > limit to verify that the RAM used for per-cgroup reclamation is charged > to a proper mem_cg. For this test, it turns off swapping before start, > and turns swapping back on afterwards. > > Signed-off-by: Haitao Huang > --- > V11: > - Remove cgroups-tools dependency and make scripts ash compatible. (Jarkko) > - Drop support for cgroup v1 and simplify. (Michal, Jarkko) > - Add documentation for functions. (Jarkko) > - Turn off swapping before memcontrol tests and back on after > - Format and style fixes, name for hard coded values > > V7: > - Added memcontrol test. > > V5: > - Added script with automatic results checking, remove the interactive > script. > - The script can run independent from the series below. > --- > tools/testing/selftests/sgx/ash_cgexec.sh | 16 + > .../selftests/sgx/run_epc_cg_selftests.sh | 275 ++ > .../selftests/sgx/watch_misc_for_tests.sh | 11 + > 3 files changed, 302 insertions(+) > create mode 100755 tools/testing/selftests/sgx/ash_cgexec.sh > create mode 100755 tools/testing/selftests/sgx/run_epc_cg_selftests.sh > create mode 100755 tools/testing/selftests/sgx/watch_misc_for_tests.sh > > diff --git a/tools/testing/selftests/sgx/ash_cgexec.sh > b/tools/testing/selftests/sgx/ash_cgexec.sh > new file mode 100755 > index ..cfa5d2b0e795 > --- /dev/null > +++ b/tools/testing/selftests/sgx/ash_cgexec.sh > @@ -0,0 +1,16 @@ > +#!/usr/bin/env sh > +# SPDX-License-Identifier: GPL-2.0 > +# Copyright(c) 2024 Intel Corporation. > + > +# Start a program in a given cgroup. > +# Supports V2 cgroup paths, relative to /sys/fs/cgroup > +if [ "$#" -lt 2 ]; then > +echo "Usage: $0 [args...]" > +exit 1 > +fi > +# Move this shell to the cgroup. > +echo 0 >/sys/fs/cgroup/$1/cgroup.procs > +shift > +# Execute the command within the cgroup > +exec "$@" > + > diff --git a/tools/testing/selftests/sgx/run_epc_cg_selftests.sh > b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh > new file mode 100755 > index ..dd56273056fc > --- /dev/null > +++ b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh > @@ -0,0 +1,275 @@ > +#!/usr/bin/env sh > +# SPDX-License-Identifier: GPL-2.0 > +# Copyright(c) 2023, 2024 Intel Corporation. > + > +TEST_ROOT_CG=selftest > +TEST_CG_SUB1=$TEST_ROOT_CG/test1 > +TEST_CG_SUB2=$TEST_ROOT_CG/test2 > +# We will only set limit in test1 and run tests in test3 > +TEST_CG_SUB3=$TEST_ROOT_CG/test1/test3 > +TEST_CG_SUB4=$TEST_ROOT_CG/test4 > + > +# Cgroup v2 only > +CG_ROOT=/sys/fs/cgroup > +mkdir -p $CG_ROOT/$TEST_CG_SUB1 > +mkdir -p $CG_ROOT/$TEST_CG_SUB2 > +mkdir -p $CG_ROOT/$TEST_CG_SUB3 > +mkdir -p $CG_ROOT/$TEST_CG_SUB4 > + > +# Turn on misc and memory controller in non-leaf nodes > +echo "+misc" > $CG_ROOT/cgroup.subtree_control && \ > +echo "+memory" > $CG_ROOT/cgroup.subtree_control && \ > +echo "+misc" > $CG_ROOT/$TEST_ROOT_CG/cgroup.subtree_control && \ > +echo "+memory" > $CG_ROOT/$TEST_ROOT_CG/cgroup.subtree_control && \ > +echo "+misc" > $CG_ROOT/$TEST_CG_SUB1/cgroup.subtree_control > +if [ $? -ne 0 ]; then > +echo "# Failed setting up cgroups, make sure misc and memory cgroups are > enabled." > +exit 1 > +fi > + > +CAPACITY=$(grep "sgx_epc" "$CG_ROOT/misc.capacity" | awk '{print $2}') > +# This is below number of VA pages needed for enclave of capacity size. So > +# should fail oversubscribed cases > +SMALL=$(( CAPACITY / 512 )) > + > +# At least load one enclave of capacity size successfully, maybe up to 4. > +# But some may fail if we run more than 4 concurrent enclaves of capacity > size. > +LARGE=$(( SMALL * 4 )) > + >
Re: [PATCH v10 08/14] x86/sgx: Add basic EPC reclamation flow for cgroup
On Fri Apr 5, 2024 at 6:07 AM EEST, Huang, Kai wrote: > On Thu, 2024-04-04 at 12:05 -0500, Haitao Huang wrote: > > > > -static inline int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg) > > > > +static inline int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg, > > > > enum sgx_reclaim r) > > > > > > Is the @r here intentional for shorter typing? > > > > > > > yes :-) > > Will speel out to make it consistent if that's the concern. > > I kinda prefer the full name to match the CONFIG_CGROUP_SGX_EPC on case. You > can put the 'enum sgx_reclaim reclaim' parameter into the new line if needed: Why don't cgroups for SGX get always enabled when SGX and cgroups support are enabled? BR, Jarkko
Re: [PATCH v9 15/15] selftests/sgx: Add scripts for EPC cgroup testing
On Tue Apr 2, 2024 at 8:40 PM EEST, Michal KoutnĂœ wrote: > On Tue, Apr 02, 2024 at 11:20:21AM -0500, Haitao Huang > wrote: > > Do we really want to have it implemented in c? > > I only pointed to the available C boilerplate. > > > There are much fewer lines of > > code in shell scripts. Note we are not really testing basic cgroup stuff. > > All we needed were creating/deleting cgroups and set limits which I think > > have been demonstrated feasible in the ash scripts now. > > I assume you refer to > Message-Id: <20240331174442.51019-1-haitao.hu...@linux.intel.com> > right? > > Could it be even simpler if you didn't stick to cgtools APIs and v1 > compatibility? > > Reducing ash_cgexec.sh to something like > echo 0 >$R/$1/cgroup.procs > shift > exec "$@" > (with some small builerplate for $R and previous mkdirs) I already asked about necessity of v1 in some response, and fully support this idea. Then cgexec can simply be a function wrapping along the lines what you proposed. BR, Jarkko
Re: [PATCH v2] selftests/sgx: Improve cgroup test scripts
On Tue Apr 2, 2024 at 8:31 PM EEST, Haitao Huang wrote: > On Tue, 02 Apr 2024 02:43:25 -0500, Jarkko Sakkinen > wrote: > > > On Tue Apr 2, 2024 at 4:42 AM EEST, Haitao Huang wrote: > >> Make cgroup test scripts ash compatible. > >> Remove cg-tools dependency. > >> Add documentation for functions. > >> > >> Tested with busybox on Ubuntu. > >> > >> Signed-off-by: Haitao Huang > >> --- > >> v2: > >> - Fixes for v2 cgroup > >> - Turn off swapping before memcontrol tests and back on after > >> - Add comments and reformat > >> --- > >> tools/testing/selftests/sgx/ash_cgexec.sh | 57 ++ > >> .../selftests/sgx/run_epc_cg_selftests.sh | 187 +++--- > >> .../selftests/sgx/watch_misc_for_tests.sh | 13 +- > >> 3 files changed, 179 insertions(+), 78 deletions(-) > >> create mode 100755 tools/testing/selftests/sgx/ash_cgexec.sh > >> > >> diff --git a/tools/testing/selftests/sgx/ash_cgexec.sh > >> b/tools/testing/selftests/sgx/ash_cgexec.sh > >> new file mode 100755 > >> index ..9607784378df > >> --- /dev/null > >> +++ b/tools/testing/selftests/sgx/ash_cgexec.sh > >> @@ -0,0 +1,57 @@ > >> +#!/usr/bin/env sh > >> +# SPDX-License-Identifier: GPL-2.0 > >> +# Copyright(c) 2024 Intel Corporation. > >> + > >> +# Move the current shell process to the specified cgroup > >> +# Arguments: > >> +# $1 - The cgroup controller name, e.g., misc, memory. > >> +# $2 - The path of the cgroup, > >> +# relative to /sys/fs/cgroup for cgroup v2, > >> +# relative to /sys/fs/cgroup/$1 for v1. > >> +move_to_cgroup() { > >> +controllers="$1" > >> +path="$2" > >> + > >> +# Check if cgroup v2 is in use > >> +if [ ! -d "/sys/fs/cgroup/misc" ]; then > >> +# Cgroup v2 logic > >> +cgroup_full_path="/sys/fs/cgroup/${path}" > >> +echo $$ > "${cgroup_full_path}/cgroup.procs" > >> +else > >> +# Cgroup v1 logic > >> +OLD_IFS="$IFS" > >> +IFS=',' > >> +for controller in $controllers; do > >> +cgroup_full_path="/sys/fs/cgroup/${controller}/${path}" > >> +echo $$ > "${cgroup_full_path}/tasks" > >> +done > >> +IFS="$OLD_IFS" > >> +fi > > > > I think that if you could point me to git v10 and all this I could > > then quite easily create test image and see what I get from that. > > > > I will code review the whole thing but this is definitely good > > enough to start testing this series properly! Thanks for the > > effort with this. The payback from this comes after the feature > > is mainline. We have now sort of reference of the usage patterns > > and less layers when we need to debug any possible (likely) bugs > > in the future. > > > > This is definitely to the right direction. I'm just wondering do > > we want to support v1 cgroups or would it make sense support only > > v2? > > BR, Jarkko > > > I can drop v1. I think most distro now support v2. > Created this branch to host these changes so far: > https://github.com/haitaohuang/linux/tree/sgx_cg_upstream_v10_plus Thanks! I'll point my build to that, make a test image and report the results. BR, Jarkko
Re: [PATCH v9 15/15] selftests/sgx: Add scripts for EPC cgroup testing
On Tue Apr 2, 2024 at 7:20 PM EEST, Haitao Huang wrote: > On Tue, 02 Apr 2024 06:58:40 -0500, Jarkko Sakkinen > wrote: > > > On Tue Apr 2, 2024 at 2:23 PM EEST, Michal KoutnĂœ wrote: > >> Hello. > >> > >> On Sat, Mar 30, 2024 at 01:26:08PM +0200, Jarkko Sakkinen > >> wrote: > >> > > > It'd be more complicated and less readable to do all the stuff > >> without the > >> > > > cgroup-tools, esp cgexec. I checked dependency, cgroup-tools only > >> depends > >> > > > on libc so I hope this would not cause too much inconvenience. > >> > > > >> > > As per cgroup-tools, please prove this. It makes the job for more > >> > > complicated *for you* and you are making the job more complicated > >> > > to every possible person in the planet running any kernel QA. > >> > > > >> > > I weight the latter more than the former. And it is exactly the > >> > > reason why we did custom user space kselftest in the first place. > >> > > Let's keep the tradition. All I can say is that kselftest is > >> > > unfinished in its current form. > >> > > > >> > > What is "esp cgexec"? > >> > > >> > Also in kselftest we don't drive ultimate simplicity, we drive > >> > efficient CI/QA. By open coding something like subset of > >> > cgroup-tools needed to run the test you also help us later > >> > on to backtrack the kernel changes. With cgroups-tools you > >> > would have to use strace to get the same info. > >> > >> FWIW, see also functions in > >> tools/testing/selftests/cgroup/cgroup_util.{h,c}. > >> They likely cover what you need already -- if the tests are in C. > >> > >> (I admit that stuff in tools/testing/selftests/cgroup/ is best > >> understood with strace.) > > > > Thanks! > > > > My conclusions are that: > > > > 1. We probably cannot move the test part of cgroup test itself > >given the enclave payload dependency. > > 2. I think it makes sense to still follow the same pattern as > >other cgroups test and re-use cgroup_util.[ch] functionaltiy. > > > > So yeah I guess we need two test programs instead of one. > > > > Something along the lines: > > > > 1. main.[ch] -> test_sgx.[ch] > > 2. introduce test_sgx_cgroup.c > > > > And test_sgx_cgroup.c would be implement similar test as the shell > > script and would follow the structure of existing cgroups tests. > > > >> > >> HTH, > >> Michal > > > > BR, Jarkko > > > Do we really want to have it implemented in c? There are much fewer lines > of code in shell scripts. Note we are not really testing basic cgroup > stuff. All we needed were creating/deleting cgroups and set limits which I > think have been demonstrated feasible in the ash scripts now. > > Given Dave's comments, and test scripts being working and cover the cases > needed IMHO, I don't see much need to move to c code. I can add more cases > if needed and fall back a c implementation later if any case can't be > implemented in scripts. How about that? We can settle to: ash + no dependencies. I guess you have for that all the work done already. BR, Jarkko
Re: [PATCH v9 15/15] selftests/sgx: Add scripts for EPC cgroup testing
On Tue Apr 2, 2024 at 6:42 PM EEST, Dave Hansen wrote: > On 3/30/24 04:23, Jarkko Sakkinen wrote: > >>> I also wonder is cgroup-tools dependency absolutely required or could > >>> you just have a function that would interact with sysfs? > >> I should have checked email before hit the send button for v10 ï. > >> > >> It'd be more complicated and less readable to do all the stuff without the > >> > >> cgroup-tools, esp cgexec. I checked dependency, cgroup-tools only depends > >> on libc so I hope this would not cause too much inconvenience. > > As per cgroup-tools, please prove this. It makes the job for more > > complicated *for you* and you are making the job more complicated > > to every possible person in the planet running any kernel QA. > > I don't see any other use of cgroup-tools in testing/selftests. > > I *DO* see a ton of /bin/bash use though. I wouldn't go to much trouble > to make the thing ash-compatible. > > That said, the most important thing is to get some selftests in place. > If using cgroup-tools means we get actual, runnable tests in place, > that's a heck of a lot more important than making them perfect. > Remember, almost nobody uses SGX. It's available on *VERY* few systems > from one CPU vendor and only in very specific hardware configurations. Ash-compatible is good enough for me, so let's draw the line there. Ash-compatibility does not cause any major hurdle as can we seen from Haitao's patch. Earlier version was not even POSIX-compatible, given that it used hard-coded path. Most of the added stuff come open coding the tools but in the test code that is not the big deal, and helps with debugging in the future. Even right now it helps reviewing kernel patches because it documents exactly how the feature is seen from user space. BR, Jarkko
Re: [PATCH v9 15/15] selftests/sgx: Add scripts for EPC cgroup testing
On Tue Apr 2, 2024 at 2:23 PM EEST, Michal KoutnĂœ wrote: > Hello. > > On Sat, Mar 30, 2024 at 01:26:08PM +0200, Jarkko Sakkinen > wrote: > > > > It'd be more complicated and less readable to do all the stuff without > > > > the > > > > cgroup-tools, esp cgexec. I checked dependency, cgroup-tools only > > > > depends > > > > on libc so I hope this would not cause too much inconvenience. > > > > > > As per cgroup-tools, please prove this. It makes the job for more > > > complicated *for you* and you are making the job more complicated > > > to every possible person in the planet running any kernel QA. > > > > > > I weight the latter more than the former. And it is exactly the > > > reason why we did custom user space kselftest in the first place. > > > Let's keep the tradition. All I can say is that kselftest is > > > unfinished in its current form. > > > > > > What is "esp cgexec"? > > > > Also in kselftest we don't drive ultimate simplicity, we drive > > efficient CI/QA. By open coding something like subset of > > cgroup-tools needed to run the test you also help us later > > on to backtrack the kernel changes. With cgroups-tools you > > would have to use strace to get the same info. > > FWIW, see also functions in > tools/testing/selftests/cgroup/cgroup_util.{h,c}. > They likely cover what you need already -- if the tests are in C. > > (I admit that stuff in tools/testing/selftests/cgroup/ is best > understood with strace.) Thanks! My conclusions are that: 1. We probably cannot move the test part of cgroup test itself given the enclave payload dependency. 2. I think it makes sense to still follow the same pattern as other cgroups test and re-use cgroup_util.[ch] functionaltiy. So yeah I guess we need two test programs instead of one. Something along the lines: 1. main.[ch] -> test_sgx.[ch] 2. introduce test_sgx_cgroup.c And test_sgx_cgroup.c would be implement similar test as the shell script and would follow the structure of existing cgroups tests. > > HTH, > Michal BR, Jarkko
Re: [PATCH v2] selftests/sgx: Improve cgroup test scripts
On Tue Apr 2, 2024 at 4:42 AM EEST, Haitao Huang wrote: > Make cgroup test scripts ash compatible. > Remove cg-tools dependency. > Add documentation for functions. > > Tested with busybox on Ubuntu. > > Signed-off-by: Haitao Huang > --- > v2: > - Fixes for v2 cgroup > - Turn off swapping before memcontrol tests and back on after > - Add comments and reformat > --- > tools/testing/selftests/sgx/ash_cgexec.sh | 57 ++ > .../selftests/sgx/run_epc_cg_selftests.sh | 187 +++--- > .../selftests/sgx/watch_misc_for_tests.sh | 13 +- > 3 files changed, 179 insertions(+), 78 deletions(-) > create mode 100755 tools/testing/selftests/sgx/ash_cgexec.sh > > diff --git a/tools/testing/selftests/sgx/ash_cgexec.sh > b/tools/testing/selftests/sgx/ash_cgexec.sh > new file mode 100755 > index ..9607784378df > --- /dev/null > +++ b/tools/testing/selftests/sgx/ash_cgexec.sh > @@ -0,0 +1,57 @@ > +#!/usr/bin/env sh > +# SPDX-License-Identifier: GPL-2.0 > +# Copyright(c) 2024 Intel Corporation. > + > +# Move the current shell process to the specified cgroup > +# Arguments: > +#$1 - The cgroup controller name, e.g., misc, memory. > +#$2 - The path of the cgroup, > +#relative to /sys/fs/cgroup for cgroup v2, > +#relative to /sys/fs/cgroup/$1 for v1. > +move_to_cgroup() { > +controllers="$1" > +path="$2" > + > +# Check if cgroup v2 is in use > +if [ ! -d "/sys/fs/cgroup/misc" ]; then > +# Cgroup v2 logic > +cgroup_full_path="/sys/fs/cgroup/${path}" > +echo $$ > "${cgroup_full_path}/cgroup.procs" > +else > +# Cgroup v1 logic > +OLD_IFS="$IFS" > +IFS=',' > +for controller in $controllers; do > +cgroup_full_path="/sys/fs/cgroup/${controller}/${path}" > +echo $$ > "${cgroup_full_path}/tasks" > +done > +IFS="$OLD_IFS" > +fi I think that if you could point me to git v10 and all this I could then quite easily create test image and see what I get from that. I will code review the whole thing but this is definitely good enough to start testing this series properly! Thanks for the effort with this. The payback from this comes after the feature is mainline. We have now sort of reference of the usage patterns and less layers when we need to debug any possible (likely) bugs in the future. This is definitely to the right direction. I'm just wondering do we want to support v1 cgroups or would it make sense support only v2? BR, Jarkko
Re: [PATCH] selftests/sgx: Improve cgroup test scripts
On Tue Apr 2, 2024 at 1:55 AM EEST, Haitao Huang wrote: > On Mon, 01 Apr 2024 09:22:21 -0500, Jarkko Sakkinen > wrote: > > > On Sun Mar 31, 2024 at 8:44 PM EEST, Haitao Huang wrote: > >> Make cgroup test scripts ash compatible. > >> Remove cg-tools dependency. > >> Add documentation for functions. > >> > >> Tested with busybox on Ubuntu. > >> > >> Signed-off-by: Haitao Huang > > > > I'll run this next week on good old NUC7. Thank you. > > > > I really wish that either (hopefully both) Intel or AMD would bring up > > for developers home use meant platform to develop on TDX and SNP. It is > > a shame that the latest and greatest is from 2018. > > > > BR, Jarkko > > > > Argh, missed a few changes for v2 cgroup: > > --- a/tools/testing/selftests/sgx/run_epc_cg_selftests.sh > +++ b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh > @@ -15,6 +15,8 @@ CG_MEM_ROOT=/sys/fs/cgroup > CG_V1=0 > if [ ! -d "/sys/fs/cgroup/misc" ]; then > echo "# cgroup V2 is in use." > +echo "+misc" > $CG_MISC_ROOT/cgroup.subtree_control > +echo "+memory" > $CG_MEM_ROOT/cgroup.subtree_control > else > echo "# cgroup V1 is in use." > CG_MISC_ROOT=/sys/fs/cgroup/misc > @@ -26,6 +28,11 @@ mkdir -p $CG_MISC_ROOT/$TEST_CG_SUB2 > mkdir -p $CG_MISC_ROOT/$TEST_CG_SUB3 > mkdir -p $CG_MISC_ROOT/$TEST_CG_SUB4 > > +if [ $CG_V1 -eq 0 ]; then > +echo "+misc" > $CG_MISC_ROOT/$TEST_ROOT_CG/cgroup.subtree_control > +echo "+misc" > $CG_MISC_ROOT/$TEST_CG_SUB1/cgroup.subtree_control > +fi Maybe it would be most convenient just to +1 the kselftest patch? Alternatively you could point out to a Git branch with the series and the updated patch. BR, Jarkko
Re: [PATCH v10 05/14] x86/sgx: Implement basic EPC misc cgroup functionality
On Mon Apr 1, 2024 at 12:29 PM EEST, Huang, Kai wrote: > On Sat, 2024-03-30 at 13:17 +0200, Jarkko Sakkinen wrote: > > On Thu Mar 28, 2024 at 2:53 PM EET, Huang, Kai wrote: > > > > > > > --- /dev/null > > > > +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c > > > > @@ -0,0 +1,74 @@ > > > > +// SPDX-License-Identifier: GPL-2.0 > > > > +// Copyright(c) 2022 Intel Corporation. > > > > > > It's 2024 now. > > > > > > And looks you need to use C style comment for /* Copyright ... */, after > > > looking > > > at some other C files. > > > > To be fair, this happens *all the time* to everyone :-) > > > > I've proposed this few times in SGX context and going to say it now. > > Given the nature of Git copyrights would anyway need to be sorted by > > the Git log not possibly incorrect copyright platters in the header > > and source files. > > > > Sure fine to me either way. Thanks for pointing out. > > I have some vague memory that we should update the year but I guess I was > wrong. I think updating year makes sense! I'd be fine not having copyright platter at all since the commit is from Intel domain anyway but if it is kept then the year needs to be corrected. I mean Git commit stores all the data, including exact date. BR, Jarkko
Re: [PATCH] selftests/sgx: Improve cgroup test scripts
On Sun Mar 31, 2024 at 8:44 PM EEST, Haitao Huang wrote: > Make cgroup test scripts ash compatible. > Remove cg-tools dependency. > Add documentation for functions. > > Tested with busybox on Ubuntu. > > Signed-off-by: Haitao Huang I'll run this next week on good old NUC7. Thank you. I really wish that either (hopefully both) Intel or AMD would bring up for developers home use meant platform to develop on TDX and SNP. It is a shame that the latest and greatest is from 2018. BR, Jarkko
Re: [PATCH v9 15/15] selftests/sgx: Add scripts for EPC cgroup testing
On Sun Mar 31, 2024 at 8:35 PM EEST, Haitao Huang wrote: > On Sun, 31 Mar 2024 11:19:04 -0500, Jarkko Sakkinen > wrote: > > > On Sat Mar 30, 2024 at 5:32 PM EET, Haitao Huang wrote: > >> On Sat, 30 Mar 2024 06:15:14 -0500, Jarkko Sakkinen > >> wrote: > >> > >> > On Thu Mar 28, 2024 at 5:54 AM EET, Haitao Huang wrote: > >> >> On Wed, 27 Mar 2024 07:55:34 -0500, Jarkko Sakkinen > >> > >> >> wrote: > >> >> > >> >> > On Mon, 2024-02-05 at 13:06 -0800, Haitao Huang wrote: > >> >> >> The scripts rely on cgroup-tools package from libcgroup [1]. > >> >> >> > >> >> >> To run selftests for epc cgroup: > >> >> >> > >> >> >> sudo ./run_epc_cg_selftests.sh > >> >> >> > >> >> >> To watch misc cgroup 'current' changes during testing, run this > >> in a > >> >> >> separate terminal: > >> >> >> > >> >> >> ./watch_misc_for_tests.sh current > >> >> >> > >> >> >> With different cgroups, the script starts one or multiple > >> concurrent > >> >> >> SGX > >> >> >> selftests, each to run one unclobbered_vdso_oversubscribed > >> test.Each > >> >> >> of such test tries to load an enclave of EPC size equal to the EPC > >> >> >> capacity available on the platform. The script checks results > >> against > >> >> >> the expectation set for each cgroup and reports success or > >> failure. > >> >> >> > >> >> >> The script creates 3 different cgroups at the beginning with > >> >> >> following > >> >> >> expectations: > >> >> >> > >> >> >> 1) SMALL - intentionally small enough to fail the test loading an > >> >> >> enclave of size equal to the capacity. > >> >> >> 2) LARGE - large enough to run up to 4 concurrent tests but fail > >> some > >> >> >> if > >> >> >> more than 4 concurrent tests are run. The script starts 4 > >> expecting > >> >> >> at > >> >> >> least one test to pass, and then starts 5 expecting at least one > >> test > >> >> >> to fail. > >> >> >> 3) LARGER - limit is the same as the capacity, large enough to run > >> >> >> lots of > >> >> >> concurrent tests. The script starts 8 of them and expects all > >> pass. > >> >> >> Then it reruns the same test with one process randomly killed and > >> >> >> usage checked to be zero after all process exit. > >> >> >> > >> >> >> The script also includes a test with low mem_cg limit and LARGE > >> >> >> sgx_epc > >> >> >> limit to verify that the RAM used for per-cgroup reclamation is > >> >> >> charged > >> >> >> to a proper mem_cg. > >> >> >> > >> >> >> [1] https://github.com/libcgroup/libcgroup/blob/main/README > >> >> >> > >> >> >> Signed-off-by: Haitao Huang > >> >> >> --- > >> >> >> V7: > >> >> >> - Added memcontrol test. > >> >> >> > >> >> >> V5: > >> >> >> - Added script with automatic results checking, remove the > >> >> >> interactive > >> >> >> script. > >> >> >> - The script can run independent from the series below. > >> >> >> --- > >> >> >> .../selftests/sgx/run_epc_cg_selftests.sh | 246 > >> >> >> ++ > >> >> >> .../selftests/sgx/watch_misc_for_tests.sh | 13 + > >> >> >> 2 files changed, 259 insertions(+) > >> >> >> create mode 100755 > >> >> >> tools/testing/selftests/sgx/run_epc_cg_selftests.sh > >> >> >> create mode 100755 > >> >> >> tools/testing/selftests/sgx/watch_misc_for_tests.sh > >> >> >> > >> >> >> diff --git a/tools/testing/selftests/sgx/run_epc_cg_selftests.sh > >> >> >> b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
Re: [PATCH v9 15/15] selftests/sgx: Add scripts for EPC cgroup testing
On Sat Mar 30, 2024 at 5:32 PM EET, Haitao Huang wrote: > On Sat, 30 Mar 2024 06:15:14 -0500, Jarkko Sakkinen > wrote: > > > On Thu Mar 28, 2024 at 5:54 AM EET, Haitao Huang wrote: > >> On Wed, 27 Mar 2024 07:55:34 -0500, Jarkko Sakkinen > >> wrote: > >> > >> > On Mon, 2024-02-05 at 13:06 -0800, Haitao Huang wrote: > >> >> The scripts rely on cgroup-tools package from libcgroup [1]. > >> >> > >> >> To run selftests for epc cgroup: > >> >> > >> >> sudo ./run_epc_cg_selftests.sh > >> >> > >> >> To watch misc cgroup 'current' changes during testing, run this in a > >> >> separate terminal: > >> >> > >> >> ./watch_misc_for_tests.sh current > >> >> > >> >> With different cgroups, the script starts one or multiple concurrent > >> >> SGX > >> >> selftests, each to run one unclobbered_vdso_oversubscribed test.Each > >> >> of such test tries to load an enclave of EPC size equal to the EPC > >> >> capacity available on the platform. The script checks results against > >> >> the expectation set for each cgroup and reports success or failure. > >> >> > >> >> The script creates 3 different cgroups at the beginning with > >> >> following > >> >> expectations: > >> >> > >> >> 1) SMALL - intentionally small enough to fail the test loading an > >> >> enclave of size equal to the capacity. > >> >> 2) LARGE - large enough to run up to 4 concurrent tests but fail some > >> >> if > >> >> more than 4 concurrent tests are run. The script starts 4 expecting > >> >> at > >> >> least one test to pass, and then starts 5 expecting at least one test > >> >> to fail. > >> >> 3) LARGER - limit is the same as the capacity, large enough to run > >> >> lots of > >> >> concurrent tests. The script starts 8 of them and expects all pass. > >> >> Then it reruns the same test with one process randomly killed and > >> >> usage checked to be zero after all process exit. > >> >> > >> >> The script also includes a test with low mem_cg limit and LARGE > >> >> sgx_epc > >> >> limit to verify that the RAM used for per-cgroup reclamation is > >> >> charged > >> >> to a proper mem_cg. > >> >> > >> >> [1] https://github.com/libcgroup/libcgroup/blob/main/README > >> >> > >> >> Signed-off-by: Haitao Huang > >> >> --- > >> >> V7: > >> >> - Added memcontrol test. > >> >> > >> >> V5: > >> >> - Added script with automatic results checking, remove the > >> >> interactive > >> >> script. > >> >> - The script can run independent from the series below. > >> >> --- > >> >> .../selftests/sgx/run_epc_cg_selftests.sh | 246 > >> >> ++ > >> >> .../selftests/sgx/watch_misc_for_tests.sh | 13 + > >> >> 2 files changed, 259 insertions(+) > >> >> create mode 100755 > >> >> tools/testing/selftests/sgx/run_epc_cg_selftests.sh > >> >> create mode 100755 > >> >> tools/testing/selftests/sgx/watch_misc_for_tests.sh > >> >> > >> >> diff --git a/tools/testing/selftests/sgx/run_epc_cg_selftests.sh > >> >> b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh > >> >> new file mode 100755 > >> >> index ..e027bf39f005 > >> >> --- /dev/null > >> >> +++ b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh > >> >> @@ -0,0 +1,246 @@ > >> >> +#!/bin/bash > >> > > >> > This is not portable and neither does hold in the wild. > >> > > >> > It does not even often hold as it is not uncommon to place bash > >> > to the path /usr/bin/bash. If I recall correctly, e.g. NixOS has > >> > a path that is neither of those two. > >> > > >> > Should be #!/usr/bin/env bash > >> > > >> > That is POSIX compatible form. > >> > > >> > >> Sure > >> > >> > Just got around trying to test this in NUC7 so looking into this in > >> > more detail. > >> > >
Re: [PATCH v10 14/14] selftests/sgx: Add scripts for EPC cgroup testing
On Thu Mar 28, 2024 at 2:22 AM EET, Haitao Huang wrote: > The scripts rely on cgroup-tools package from libcgroup [1]. > > To run selftests for epc cgroup: > > sudo ./run_epc_cg_selftests.sh > > To watch misc cgroup 'current' changes during testing, run this in a > separate terminal: > > ./watch_misc_for_tests.sh current > > With different cgroups, the script starts one or multiple concurrent SGX > selftests, each to run one unclobbered_vdso_oversubscribed test. Each > of such test tries to load an enclave of EPC size equal to the EPC > capacity available on the platform. The script checks results against > the expectation set for each cgroup and reports success or failure. > > The script creates 3 different cgroups at the beginning with following > expectations: > > 1) SMALL - intentionally small enough to fail the test loading an > enclave of size equal to the capacity. > 2) LARGE - large enough to run up to 4 concurrent tests but fail some if > more than 4 concurrent tests are run. The script starts 4 expecting at > least one test to pass, and then starts 5 expecting at least one test > to fail. > 3) LARGER - limit is the same as the capacity, large enough to run lots of > concurrent tests. The script starts 8 of them and expects all pass. > Then it reruns the same test with one process randomly killed and > usage checked to be zero after all process exit. > > The script also includes a test with low mem_cg limit and LARGE sgx_epc > limit to verify that the RAM used for per-cgroup reclamation is charged > to a proper mem_cg. > > [1] https://github.com/libcgroup/libcgroup/blob/main/README > > Signed-off-by: Haitao Huang My previous comments and you have two undocumented dependencies for your selftest (I searched for cgexec and cgroups-tools as keywords). BR, Jarkko
Re: [PATCH v9 15/15] selftests/sgx: Add scripts for EPC cgroup testing
On Sat Mar 30, 2024 at 1:23 PM EET, Jarkko Sakkinen wrote: > On Thu Mar 28, 2024 at 2:57 AM EET, Haitao Huang wrote: > > On Wed, 27 Mar 2024 11:56:35 -0500, Jarkko Sakkinen > > wrote: > > > > > On Wed Mar 27, 2024 at 2:55 PM EET, Jarkko Sakkinen wrote: > > >> On Mon, 2024-02-05 at 13:06 -0800, Haitao Huang wrote: > > >> > The scripts rely on cgroup-tools package from libcgroup [1]. > > >> > > > >> > To run selftests for epc cgroup: > > >> > > > >> > sudo ./run_epc_cg_selftests.sh > > >> > > > >> > To watch misc cgroup 'current' changes during testing, run this in a > > >> > separate terminal: > > >> > > > >> > ./watch_misc_for_tests.sh current > > >> > > > >> > With different cgroups, the script starts one or multiple concurrent > > >> > SGX > > >> > selftests, each to run one unclobbered_vdso_oversubscribed test.> Each > > >> > of such test tries to load an enclave of EPC size equal to the EPC > > >> > capacity available on the platform. The script checks results against > > >> > the expectation set for each cgroup and reports success or failure. > > >> > > > >> > The script creates 3 different cgroups at the beginning with > > >> > following > > >> > expectations: > > >> > > > >> > 1) SMALL - intentionally small enough to fail the test loading an > > >> > enclave of size equal to the capacity. > > >> > 2) LARGE - large enough to run up to 4 concurrent tests but fail some > > >> > if > > >> > more than 4 concurrent tests are run. The script starts 4 expecting > > >> > at > > >> > least one test to pass, and then starts 5 expecting at least one test > > >> > to fail. > > >> > 3) LARGER - limit is the same as the capacity, large enough to run > > >> > lots of > > >> > concurrent tests. The script starts 8 of them and expects all pass. > > >> > Then it reruns the same test with one process randomly killed and > > >> > usage checked to be zero after all process exit. > > >> > > > >> > The script also includes a test with low mem_cg limit and LARGE > > >> > sgx_epc > > >> > limit to verify that the RAM used for per-cgroup reclamation is > > >> > charged > > >> > to a proper mem_cg. > > >> > > > >> > [1] https://github.com/libcgroup/libcgroup/blob/main/README > > >> > > > >> > Signed-off-by: Haitao Huang > > >> > --- > > >> > V7: > > >> > - Added memcontrol test. > > >> > > > >> > V5: > > >> > - Added script with automatic results checking, remove the > > >> > interactive > > >> > script. > > >> > - The script can run independent from the series below. > > >> > --- > > >> > .../selftests/sgx/run_epc_cg_selftests.sh | 246 > > >> > ++ > > >> > .../selftests/sgx/watch_misc_for_tests.sh | 13 + > > >> > 2 files changed, 259 insertions(+) > > >> > create mode 100755 > > >> > tools/testing/selftests/sgx/run_epc_cg_selftests.sh > > >> > create mode 100755 > > >> > tools/testing/selftests/sgx/watch_misc_for_tests.sh > > >> > > > >> > diff --git a/tools/testing/selftests/sgx/run_epc_cg_selftests.sh > > >> > b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh > > >> > new file mode 100755 > > >> > index ..e027bf39f005 > > >> > --- /dev/null > > >> > +++ b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh > > >> > @@ -0,0 +1,246 @@ > > >> > +#!/bin/bash > > >> > > >> This is not portable and neither does hold in the wild. > > >> > > >> It does not even often hold as it is not uncommon to place bash > > >> to the path /usr/bin/bash. If I recall correctly, e.g. NixOS has > > >> a path that is neither of those two. > > >> > > >> Should be #!/usr/bin/env bash > > >> > > >> That is POSIX compatible form. > > >> > > >> Just got around trying to test this in NUC7 so looking into this in > > >> more detail. &
Re: [PATCH v9 15/15] selftests/sgx: Add scripts for EPC cgroup testing
On Thu Mar 28, 2024 at 2:57 AM EET, Haitao Huang wrote: > On Wed, 27 Mar 2024 11:56:35 -0500, Jarkko Sakkinen > wrote: > > > On Wed Mar 27, 2024 at 2:55 PM EET, Jarkko Sakkinen wrote: > >> On Mon, 2024-02-05 at 13:06 -0800, Haitao Huang wrote: > >> > The scripts rely on cgroup-tools package from libcgroup [1]. > >> > > >> > To run selftests for epc cgroup: > >> > > >> > sudo ./run_epc_cg_selftests.sh > >> > > >> > To watch misc cgroup 'current' changes during testing, run this in a > >> > separate terminal: > >> > > >> > ./watch_misc_for_tests.sh current > >> > > >> > With different cgroups, the script starts one or multiple concurrent > >> > SGX > >> > selftests, each to run one unclobbered_vdso_oversubscribed test.> Each > >> > of such test tries to load an enclave of EPC size equal to the EPC > >> > capacity available on the platform. The script checks results against > >> > the expectation set for each cgroup and reports success or failure. > >> > > >> > The script creates 3 different cgroups at the beginning with > >> > following > >> > expectations: > >> > > >> > 1) SMALL - intentionally small enough to fail the test loading an > >> > enclave of size equal to the capacity. > >> > 2) LARGE - large enough to run up to 4 concurrent tests but fail some > >> > if > >> > more than 4 concurrent tests are run. The script starts 4 expecting > >> > at > >> > least one test to pass, and then starts 5 expecting at least one test > >> > to fail. > >> > 3) LARGER - limit is the same as the capacity, large enough to run > >> > lots of > >> > concurrent tests. The script starts 8 of them and expects all pass. > >> > Then it reruns the same test with one process randomly killed and > >> > usage checked to be zero after all process exit. > >> > > >> > The script also includes a test with low mem_cg limit and LARGE > >> > sgx_epc > >> > limit to verify that the RAM used for per-cgroup reclamation is > >> > charged > >> > to a proper mem_cg. > >> > > >> > [1] https://github.com/libcgroup/libcgroup/blob/main/README > >> > > >> > Signed-off-by: Haitao Huang > >> > --- > >> > V7: > >> > - Added memcontrol test. > >> > > >> > V5: > >> > - Added script with automatic results checking, remove the > >> > interactive > >> > script. > >> > - The script can run independent from the series below. > >> > --- > >> > .../selftests/sgx/run_epc_cg_selftests.sh | 246 > >> > ++ > >> > .../selftests/sgx/watch_misc_for_tests.sh | 13 + > >> > 2 files changed, 259 insertions(+) > >> > create mode 100755 > >> > tools/testing/selftests/sgx/run_epc_cg_selftests.sh > >> > create mode 100755 > >> > tools/testing/selftests/sgx/watch_misc_for_tests.sh > >> > > >> > diff --git a/tools/testing/selftests/sgx/run_epc_cg_selftests.sh > >> > b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh > >> > new file mode 100755 > >> > index ..e027bf39f005 > >> > --- /dev/null > >> > +++ b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh > >> > @@ -0,0 +1,246 @@ > >> > +#!/bin/bash > >> > >> This is not portable and neither does hold in the wild. > >> > >> It does not even often hold as it is not uncommon to place bash > >> to the path /usr/bin/bash. If I recall correctly, e.g. NixOS has > >> a path that is neither of those two. > >> > >> Should be #!/usr/bin/env bash > >> > >> That is POSIX compatible form. > >> > >> Just got around trying to test this in NUC7 so looking into this in > >> more detail. > >> > >> That said can you make the script work with just "#!/usr/bin/env sh" > >> and make sure that it is busybox ash compatible? > >> > >> I don't see any necessity to make this bash only and it adds to the > >> compilation time of the image. Otherwise lot of this could be tested > >> just with qemu+bzImage+busybox(inside initramfs). > >> > >> Now you are adding fully glibc shenanigans for the sake of syntax &g
Re: [PATCH v10 05/14] x86/sgx: Implement basic EPC misc cgroup functionality
On Thu Mar 28, 2024 at 2:53 PM EET, Huang, Kai wrote: > > > --- /dev/null > > +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c > > @@ -0,0 +1,74 @@ > > +// SPDX-License-Identifier: GPL-2.0 > > +// Copyright(c) 2022 Intel Corporation. > > It's 2024 now. > > And looks you need to use C style comment for /* Copyright ... */, after > looking > at some other C files. To be fair, this happens *all the time* to everyone :-) I've proposed this few times in SGX context and going to say it now. Given the nature of Git copyrights would anyway need to be sorted by the Git log not possibly incorrect copyright platters in the header and source files. BR, Jarkko
Re: [PATCH v9 15/15] selftests/sgx: Add scripts for EPC cgroup testing
On Thu Mar 28, 2024 at 5:54 AM EET, Haitao Huang wrote: > On Wed, 27 Mar 2024 07:55:34 -0500, Jarkko Sakkinen > wrote: > > > On Mon, 2024-02-05 at 13:06 -0800, Haitao Huang wrote: > >> The scripts rely on cgroup-tools package from libcgroup [1]. > >> > >> To run selftests for epc cgroup: > >> > >> sudo ./run_epc_cg_selftests.sh > >> > >> To watch misc cgroup 'current' changes during testing, run this in a > >> separate terminal: > >> > >> ./watch_misc_for_tests.sh current > >> > >> With different cgroups, the script starts one or multiple concurrent > >> SGX > >> selftests, each to run one unclobbered_vdso_oversubscribed test.Each > >> of such test tries to load an enclave of EPC size equal to the EPC > >> capacity available on the platform. The script checks results against > >> the expectation set for each cgroup and reports success or failure. > >> > >> The script creates 3 different cgroups at the beginning with > >> following > >> expectations: > >> > >> 1) SMALL - intentionally small enough to fail the test loading an > >> enclave of size equal to the capacity. > >> 2) LARGE - large enough to run up to 4 concurrent tests but fail some > >> if > >> more than 4 concurrent tests are run. The script starts 4 expecting > >> at > >> least one test to pass, and then starts 5 expecting at least one test > >> to fail. > >> 3) LARGER - limit is the same as the capacity, large enough to run > >> lots of > >> concurrent tests. The script starts 8 of them and expects all pass. > >> Then it reruns the same test with one process randomly killed and > >> usage checked to be zero after all process exit. > >> > >> The script also includes a test with low mem_cg limit and LARGE > >> sgx_epc > >> limit to verify that the RAM used for per-cgroup reclamation is > >> charged > >> to a proper mem_cg. > >> > >> [1] https://github.com/libcgroup/libcgroup/blob/main/README > >> > >> Signed-off-by: Haitao Huang > >> --- > >> V7: > >> - Added memcontrol test. > >> > >> V5: > >> - Added script with automatic results checking, remove the > >> interactive > >> script. > >> - The script can run independent from the series below. > >> --- > >> .../selftests/sgx/run_epc_cg_selftests.sh | 246 > >> ++ > >> .../selftests/sgx/watch_misc_for_tests.sh | 13 + > >> 2 files changed, 259 insertions(+) > >> create mode 100755 > >> tools/testing/selftests/sgx/run_epc_cg_selftests.sh > >> create mode 100755 > >> tools/testing/selftests/sgx/watch_misc_for_tests.sh > >> > >> diff --git a/tools/testing/selftests/sgx/run_epc_cg_selftests.sh > >> b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh > >> new file mode 100755 > >> index ..e027bf39f005 > >> --- /dev/null > >> +++ b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh > >> @@ -0,0 +1,246 @@ > >> +#!/bin/bash > > > > This is not portable and neither does hold in the wild. > > > > It does not even often hold as it is not uncommon to place bash > > to the path /usr/bin/bash. If I recall correctly, e.g. NixOS has > > a path that is neither of those two. > > > > Should be #!/usr/bin/env bash > > > > That is POSIX compatible form. > > > > Sure > > > Just got around trying to test this in NUC7 so looking into this in > > more detail. > > Thanks. Could you please check if this version works for you? > > https://github.com/haitaohuang/linux/commit/3c424b841cf3cf66b085a424f4b537fbc3bbff6f > > > > > That said can you make the script work with just "#!/usr/bin/env sh" > > and make sure that it is busybox ash compatible? > > Yes. > > > > > I don't see any necessity to make this bash only and it adds to the > > compilation time of the image. Otherwise lot of this could be tested > > just with qemu+bzImage+busybox(inside initramfs). > > > > will still need cgroup-tools as you pointed out later. Compiling from its > upstream code OK? Can you explain why you need it? What is the thing you cannot do without it? BR, Jarkko
Re: [PATCH v9 15/15] selftests/sgx: Add scripts for EPC cgroup testing
On Wed Mar 27, 2024 at 2:55 PM EET, Jarkko Sakkinen wrote: > On Mon, 2024-02-05 at 13:06 -0800, Haitao Huang wrote: > > The scripts rely on cgroup-tools package from libcgroup [1]. > > > > To run selftests for epc cgroup: > > > > sudo ./run_epc_cg_selftests.sh > > > > To watch misc cgroup 'current' changes during testing, run this in a > > separate terminal: > > > > ./watch_misc_for_tests.sh current > > > > With different cgroups, the script starts one or multiple concurrent > > SGX > > selftests, each to run one unclobbered_vdso_oversubscribed test. > > Each > > of such test tries to load an enclave of EPC size equal to the EPC > > capacity available on the platform. The script checks results against > > the expectation set for each cgroup and reports success or failure. > > > > The script creates 3 different cgroups at the beginning with > > following > > expectations: > > > > 1) SMALL - intentionally small enough to fail the test loading an > > enclave of size equal to the capacity. > > 2) LARGE - large enough to run up to 4 concurrent tests but fail some > > if > > more than 4 concurrent tests are run. The script starts 4 expecting > > at > > least one test to pass, and then starts 5 expecting at least one test > > to fail. > > 3) LARGER - limit is the same as the capacity, large enough to run > > lots of > > concurrent tests. The script starts 8 of them and expects all pass. > > Then it reruns the same test with one process randomly killed and > > usage checked to be zero after all process exit. > > > > The script also includes a test with low mem_cg limit and LARGE > > sgx_epc > > limit to verify that the RAM used for per-cgroup reclamation is > > charged > > to a proper mem_cg. > > > > [1] https://github.com/libcgroup/libcgroup/blob/main/README > > > > Signed-off-by: Haitao Huang > > --- > > V7: > > - Added memcontrol test. > > > > V5: > > - Added script with automatic results checking, remove the > > interactive > > script. > > - The script can run independent from the series below. > > --- > >  .../selftests/sgx/run_epc_cg_selftests.sh | 246 > > ++ > >  .../selftests/sgx/watch_misc_for_tests.sh | 13 + > >  2 files changed, 259 insertions(+) > >  create mode 100755 > > tools/testing/selftests/sgx/run_epc_cg_selftests.sh > >  create mode 100755 > > tools/testing/selftests/sgx/watch_misc_for_tests.sh > > > > diff --git a/tools/testing/selftests/sgx/run_epc_cg_selftests.sh > > b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh > > new file mode 100755 > > index ..e027bf39f005 > > --- /dev/null > > +++ b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh > > @@ -0,0 +1,246 @@ > > +#!/bin/bash > > This is not portable and neither does hold in the wild. > > It does not even often hold as it is not uncommon to place bash > to the path /usr/bin/bash. If I recall correctly, e.g. NixOS has > a path that is neither of those two. > > Should be #!/usr/bin/env bash > > That is POSIX compatible form. > > Just got around trying to test this in NUC7 so looking into this in > more detail. > > That said can you make the script work with just "#!/usr/bin/env sh" > and make sure that it is busybox ash compatible? > > I don't see any necessity to make this bash only and it adds to the > compilation time of the image. Otherwise lot of this could be tested > just with qemu+bzImage+busybox(inside initramfs). > > Now you are adding fully glibc shenanigans for the sake of syntax > sugar. > > > +# SPDX-License-Identifier: GPL-2.0 > > +# Copyright(c) 2023 Intel Corporation. > > + > > +TEST_ROOT_CG=selftest > > +cgcreate -g misc:$TEST_ROOT_CG > > How do you know that cgcreate exists? It is used a lot in the script > with no check for the existence. Please fix e.g. with "command -v > cgreate". > > > +if [ $? -ne 0 ]; then > > +   echo "# Please make sure cgroup-tools is installed, and misc > > cgroup is mounted." > > +   exit 1 > > +fi > > And please do not do it this way. Also, please remove the advice for > "cgroups-tool". This is not meant to be debian only. Better would be > to e.g. point out the URL of the upstream project. > > And yeah the whole message should be based on "command -v", not like > this. > > > +TEST_CG_SUB1=$TEST_ROOT_CG/test1 > > +TEST_CG_SUB2=$TEST_ROOT_CG/test2 > > +# We will only
Re: [PATCH v7 2/2] arch/riscv: Enable kprobes when CONFIG_MODULES=n
On Tue Mar 26, 2024 at 8:42 PM EET, Conor Dooley wrote: > On Tue, Mar 26, 2024 at 03:46:16PM +0200, Jarkko Sakkinen wrote: > > Tacing with kprobes while running a monolithic kernel is currently > > impossible due the kernel module allocator dependency. > > > > Address the issue by implementing textmem API for RISC-V. > > This doesn't compile for nommu: > /build/tmp.3xucsBhqDV/arch/riscv/kernel/execmem.c:10:46: error: > 'MODULES_VADDR' undeclared (first use in this function) > /build/tmp.3xucsBhqDV/arch/riscv/kernel/execmem.c:11:37: error: > 'MODULES_END' undeclared (first use in this function) > /build/tmp.3xucsBhqDV/arch/riscv/kernel/execmem.c:14:1: error: control > reaches end of non-void function [-Werror=return-type] > Clang builds also report: > ../arch/riscv/kernel/execmem.c:8:56: warning: omitting the parameter name in > a function definition is a C2x extension [-Wc2x-extensions] > > > > > Link: https://www.sochub.fi # for power on testing new SoC's with a minimal > > stack > > Link: > > https://lore.kernel.org/all/2022060814.3054333-1-jar...@profian.com/ # > > continuation > > Signed-off-by: Jarkko Sakkinen > > --- > > v5-v7: > > - No changes. > > v4: > > - Include linux/execmem.h. > > v3: > > - Architecture independent parts have been split to separate patches. > > - Do not change arch/riscv/kernel/module.c as it is out of scope for > > this patch set now. > > Meta comment. I dunno when v1 was sent, but versions can you please > relax with submitting new versions of your patches? There's conversations > ongoing on v5 at the moment, while this is a more recent version. v2 > seems to have been sent on the 23rd and there's been 5 versions in the > last day: > https://patchwork.kernel.org/project/linux-riscv/list/?submitter=195059=* > > Could you please also try and use a cover letter for patchsets, ideally > with a consistent subject? Otherwise I have to manually mark stuff as > superseded. Point taken but the work has been taken over by Mark and relevant changes are now sucked into that patch set. > Thanks, > Conor. BR, Jarkko
Re: [PATCH v4 02/14] mm: Switch mm->get_unmapped_area() to a flag
On Wed, 2024-03-27 at 02:42 +, Edgecombe, Rick P wrote: > On Tue, 2024-03-26 at 13:57 +0200, Jarkko Sakkinen wrote: > > In which conditions which path is used during the initialization of > > mm > > and why is this the case? It is an open claim in the current form. > > There is an arch_pick_mmap_layout() that arch's can have their own > rules for. There is also a > generic one. It gets called during exec. > > > > > That would be nice to have documented for the sake of being > > complete > > description. I have zero doubts of the claim being untrue. > > ...being untrue? > I mean I believe the change itself makes sense, it is just not fully documented in the commit message. BR, Jarkko
Re: [PATCH v9 15/15] selftests/sgx: Add scripts for EPC cgroup testing
On Mon, 2024-02-05 at 13:06 -0800, Haitao Huang wrote: > The scripts rely on cgroup-tools package from libcgroup [1]. > > To run selftests for epc cgroup: > > sudo ./run_epc_cg_selftests.sh > > To watch misc cgroup 'current' changes during testing, run this in a > separate terminal: > > ./watch_misc_for_tests.sh current > > With different cgroups, the script starts one or multiple concurrent > SGX > selftests, each to run one unclobbered_vdso_oversubscribed test. > Each > of such test tries to load an enclave of EPC size equal to the EPC > capacity available on the platform. The script checks results against > the expectation set for each cgroup and reports success or failure. > > The script creates 3 different cgroups at the beginning with > following > expectations: > > 1) SMALL - intentionally small enough to fail the test loading an > enclave of size equal to the capacity. > 2) LARGE - large enough to run up to 4 concurrent tests but fail some > if > more than 4 concurrent tests are run. The script starts 4 expecting > at > least one test to pass, and then starts 5 expecting at least one test > to fail. > 3) LARGER - limit is the same as the capacity, large enough to run > lots of > concurrent tests. The script starts 8 of them and expects all pass. > Then it reruns the same test with one process randomly killed and > usage checked to be zero after all process exit. > > The script also includes a test with low mem_cg limit and LARGE > sgx_epc > limit to verify that the RAM used for per-cgroup reclamation is > charged > to a proper mem_cg. > > [1] https://github.com/libcgroup/libcgroup/blob/main/README > > Signed-off-by: Haitao Huang > --- > V7: > - Added memcontrol test. > > V5: > - Added script with automatic results checking, remove the > interactive > script. > - The script can run independent from the series below. > --- >  .../selftests/sgx/run_epc_cg_selftests.sh | 246 > ++ >  .../selftests/sgx/watch_misc_for_tests.sh | 13 + >  2 files changed, 259 insertions(+) >  create mode 100755 > tools/testing/selftests/sgx/run_epc_cg_selftests.sh >  create mode 100755 > tools/testing/selftests/sgx/watch_misc_for_tests.sh > > diff --git a/tools/testing/selftests/sgx/run_epc_cg_selftests.sh > b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh > new file mode 100755 > index ..e027bf39f005 > --- /dev/null > +++ b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh > @@ -0,0 +1,246 @@ > +#!/bin/bash This is not portable and neither does hold in the wild. It does not even often hold as it is not uncommon to place bash to the path /usr/bin/bash. If I recall correctly, e.g. NixOS has a path that is neither of those two. Should be #!/usr/bin/env bash That is POSIX compatible form. Just got around trying to test this in NUC7 so looking into this in more detail. That said can you make the script work with just "#!/usr/bin/env sh" and make sure that it is busybox ash compatible? I don't see any necessity to make this bash only and it adds to the compilation time of the image. Otherwise lot of this could be tested just with qemu+bzImage+busybox(inside initramfs). Now you are adding fully glibc shenanigans for the sake of syntax sugar. > +# SPDX-License-Identifier: GPL-2.0 > +# Copyright(c) 2023 Intel Corporation. > + > +TEST_ROOT_CG=selftest > +cgcreate -g misc:$TEST_ROOT_CG How do you know that cgcreate exists? It is used a lot in the script with no check for the existence. Please fix e.g. with "command -v cgreate". > +if [ $? -ne 0 ]; then > +   echo "# Please make sure cgroup-tools is installed, and misc > cgroup is mounted." > +   exit 1 > +fi And please do not do it this way. Also, please remove the advice for "cgroups-tool". This is not meant to be debian only. Better would be to e.g. point out the URL of the upstream project. And yeah the whole message should be based on "command -v", not like this. > +TEST_CG_SUB1=$TEST_ROOT_CG/test1 > +TEST_CG_SUB2=$TEST_ROOT_CG/test2 > +# We will only set limit in test1 and run tests in test3 > +TEST_CG_SUB3=$TEST_ROOT_CG/test1/test3 > +TEST_CG_SUB4=$TEST_ROOT_CG/test4 > + > +cgcreate -g misc:$TEST_CG_SUB1 > +cgcreate -g misc:$TEST_CG_SUB2 > +cgcreate -g misc:$TEST_CG_SUB3 > +cgcreate -g misc:$TEST_CG_SUB4 > + > +# Default to V2 > +CG_MISC_ROOT=/sys/fs/cgroup > +CG_MEM_ROOT=/sys/fs/cgroup > +CG_V1=0 > +if [ ! -d "/sys/fs/cgroup/misc" ]; then > +   echo "# cgroup V2 is in use." > +else > +   echo "# cgroup V1 is in use." Is "#" prefix a standard for kselftest? I don't know this, thus asking. > +   CG_MISC_ROOT=/sys/fs/cgroup/misc > +   CG_MEM_ROOT=/sys/fs/cgroup/memory > +   CG_V1=1 Have you checked what is the indentation policy for bash scripts inside kernel tree. I don't know what it is. That's why I'm asking. > +fi > + > +CAPACITY=$(grep "sgx_epc" "$CG_MISC_ROOT/misc.capacity" | awk > '{print $2}') > +# This is below number of VA pages needed for enclave of capacity > size. So > +# should
Re: [PATCH v2] arch/riscv: Enable kprobes when CONFIG_MODULES=n
On Tue Mar 26, 2024 at 6:38 PM EET, Mark Rutland wrote: > On Wed, Mar 27, 2024 at 12:24:03AM +0900, Masami Hiramatsu wrote: > > On Tue, 26 Mar 2024 14:46:10 + > > Mark Rutland wrote: > > > > > > On Mon, Mar 25, 2024 at 11:56:32AM +0900, Masami Hiramatsu wrote: > > > > I think, we'd better to introduce `alloc_execmem()`, > > > > CONFIG_HAVE_ALLOC_EXECMEM and CONFIG_ALLOC_EXECMEM at first > > > > > > > > config HAVE_ALLOC_EXECMEM > > > > bool > > > > > > > > config ALLOC_EXECMEM > > > > bool "Executable trampline memory allocation" > > > > depends on MODULES || HAVE_ALLOC_EXECMEM > > > > > > > > And define fallback macro to module_alloc() like this. > > > > > > > > #ifndef CONFIG_HAVE_ALLOC_EXECMEM > > > > #define alloc_execmem(size, gfp)module_alloc(size) > > > > #endif > > > > > > Please can we *not* do this? I think this is abstracting at the wrong > > > level (as > > > I mentioned on the prior execmem proposals). > > > > > > Different exectuable allocations can have different requirements. For > > > example, > > > on arm64 modules need to be within 2G of the kernel image, but the > > > kprobes XOL > > > areas can be anywhere in the kernel VA space. > > > > > > Forcing those behind the same interface makes things *harder* for > > > architectures > > > and/or makes the common code more complicated (if that ends up having to > > > track > > > all those different requirements). From my PoV it'd be much better to have > > > separate kprobes_alloc_*() functions for kprobes which an architecture > > > can then > > > choose to implement using a common library if it wants to. > > > > > > I took a look at doing that using the core ifdeffery fixups from Jarkko's > > > v6, > > > and it looks pretty clean to me (and works in testing on arm64): > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=kprobes/without-modules > > > > > > Could we please start with that approach, with kprobe-specific alloc/free > > > code > > > provided by the architecture? > > > > OK, as far as I can read the code, this method also works and neat! > > (and minimum intrusion). I actually found that exposing CONFIG_ALLOC_EXECMEM > > to user does not help, it should be an internal change. So hiding this > > change > > from user is better choice. Then there is no reason to introduce the new > > alloc_execmem, but just expand kprobe_alloc_insn_page() is reasonable. > > > > Mark, can you send this series here, so that others can review/test it? > > I've written up a cover letter and sent that out: > > https://lore.kernel.org/lkml/20240326163624.3253157-1-mark.rutl...@arm.com/ > > Mark. Ya, saw it thanks! BR, Jarkko
Re: [PATCH v2] arch/riscv: Enable kprobes when CONFIG_MODULES=n
On Tue Mar 26, 2024 at 6:15 PM EET, Calvin Owens wrote: > On Wednesday 03/27 at 00:24 +0900, Masami Hiramatsu wrote: > > On Tue, 26 Mar 2024 14:46:10 + > > Mark Rutland wrote: > > > > > Hi Masami, > > > > > > On Mon, Mar 25, 2024 at 11:56:32AM +0900, Masami Hiramatsu wrote: > > > > Hi Jarkko, > > > > > > > > On Sun, 24 Mar 2024 01:29:08 +0200 > > > > Jarkko Sakkinen wrote: > > > > > > > > > Tracing with kprobes while running a monolithic kernel is currently > > > > > impossible due the kernel module allocator dependency. > > > > > > > > > > Address the issue by allowing architectures to implement > > > > > module_alloc() > > > > > and module_memfree() independent of the module subsystem. An arch tree > > > > > can signal this by setting HAVE_KPROBES_ALLOC in its Kconfig file. > > > > > > > > > > Realize the feature on RISC-V by separating allocator to > > > > > module_alloc.c > > > > > and implementing module_memfree(). > > > > > > > > Even though, this involves changes in arch-independent part. So it > > > > should > > > > be solved by generic way. Did you checked Calvin's thread? > > > > > > > > https://lore.kernel.org/all/cover.1709676663.git.jcalvinow...@gmail.com/ > > > > > > > > I think, we'd better to introduce `alloc_execmem()`, > > > > CONFIG_HAVE_ALLOC_EXECMEM and CONFIG_ALLOC_EXECMEM at first > > > > > > > > config HAVE_ALLOC_EXECMEM > > > > bool > > > > > > > > config ALLOC_EXECMEM > > > > bool "Executable trampline memory allocation" > > > > depends on MODULES || HAVE_ALLOC_EXECMEM > > > > > > > > And define fallback macro to module_alloc() like this. > > > > > > > > #ifndef CONFIG_HAVE_ALLOC_EXECMEM > > > > #define alloc_execmem(size, gfp)module_alloc(size) > > > > #endif > > > > > > Please can we *not* do this? I think this is abstracting at the wrong > > > level (as > > > I mentioned on the prior execmem proposals). > > > > > > Different exectuable allocations can have different requirements. For > > > example, > > > on arm64 modules need to be within 2G of the kernel image, but the > > > kprobes XOL > > > areas can be anywhere in the kernel VA space. > > > > > > Forcing those behind the same interface makes things *harder* for > > > architectures > > > and/or makes the common code more complicated (if that ends up having to > > > track > > > all those different requirements). From my PoV it'd be much better to have > > > separate kprobes_alloc_*() functions for kprobes which an architecture > > > can then > > > choose to implement using a common library if it wants to. > > > > > > I took a look at doing that using the core ifdeffery fixups from Jarkko's > > > v6, > > > and it looks pretty clean to me (and works in testing on arm64): > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=kprobes/without-modules > > > > > > Could we please start with that approach, with kprobe-specific alloc/free > > > code > > > provided by the architecture? > > Heh, I also noticed that dead !RWX branch in arm64 patch_map(), I was > about to send a patch to remove it. > > > OK, as far as I can read the code, this method also works and neat! > > (and minimum intrusion). I actually found that exposing CONFIG_ALLOC_EXECMEM > > to user does not help, it should be an internal change. So hiding this > > change > > from user is better choice. Then there is no reason to introduce the new > > alloc_execmem, but just expand kprobe_alloc_insn_page() is reasonable. > > I'm happy with this, it solves the first half of my problem. But I want > eBPF to work in the !MODULES case too. > > I think Mark's approach can work for bpf as well, without needing to > touch module_alloc() at all? So I might be able to drop that first patch > entirely. Yeah, I think we're aligned. Later on, if/when you send the bpf series please also cc me and I might possibly test those patches too. BR, Jarkko
Re: [PATCH v2] arch/riscv: Enable kprobes when CONFIG_MODULES=n
On Tue Mar 26, 2024 at 5:24 PM EET, Masami Hiramatsu (Google) wrote: > On Tue, 26 Mar 2024 14:46:10 + > Mark Rutland wrote: > > > Hi Masami, > > > > On Mon, Mar 25, 2024 at 11:56:32AM +0900, Masami Hiramatsu wrote: > > > Hi Jarkko, > > > > > > On Sun, 24 Mar 2024 01:29:08 +0200 > > > Jarkko Sakkinen wrote: > > > > > > > Tracing with kprobes while running a monolithic kernel is currently > > > > impossible due the kernel module allocator dependency. > > > > > > > > Address the issue by allowing architectures to implement module_alloc() > > > > and module_memfree() independent of the module subsystem. An arch tree > > > > can signal this by setting HAVE_KPROBES_ALLOC in its Kconfig file. > > > > > > > > Realize the feature on RISC-V by separating allocator to module_alloc.c > > > > and implementing module_memfree(). > > > > > > Even though, this involves changes in arch-independent part. So it should > > > be solved by generic way. Did you checked Calvin's thread? > > > > > > https://lore.kernel.org/all/cover.1709676663.git.jcalvinow...@gmail.com/ > > > > > > I think, we'd better to introduce `alloc_execmem()`, > > > CONFIG_HAVE_ALLOC_EXECMEM and CONFIG_ALLOC_EXECMEM at first > > > > > > config HAVE_ALLOC_EXECMEM > > > bool > > > > > > config ALLOC_EXECMEM > > > bool "Executable trampline memory allocation" > > > depends on MODULES || HAVE_ALLOC_EXECMEM > > > > > > And define fallback macro to module_alloc() like this. > > > > > > #ifndef CONFIG_HAVE_ALLOC_EXECMEM > > > #define alloc_execmem(size, gfp) module_alloc(size) > > > #endif > > > > Please can we *not* do this? I think this is abstracting at the wrong level > > (as > > I mentioned on the prior execmem proposals). > > > > Different exectuable allocations can have different requirements. For > > example, > > on arm64 modules need to be within 2G of the kernel image, but the kprobes > > XOL > > areas can be anywhere in the kernel VA space. > > > > Forcing those behind the same interface makes things *harder* for > > architectures > > and/or makes the common code more complicated (if that ends up having to > > track > > all those different requirements). From my PoV it'd be much better to have > > separate kprobes_alloc_*() functions for kprobes which an architecture can > > then > > choose to implement using a common library if it wants to. > > > > I took a look at doing that using the core ifdeffery fixups from Jarkko's > > v6, > > and it looks pretty clean to me (and works in testing on arm64): > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=kprobes/without-modules > > > > Could we please start with that approach, with kprobe-specific alloc/free > > code > > provided by the architecture? > > OK, as far as I can read the code, this method also works and neat! > (and minimum intrusion). I actually found that exposing CONFIG_ALLOC_EXECMEM > to user does not help, it should be an internal change. So hiding this change > from user is better choice. Then there is no reason to introduce the new > alloc_execmem, but just expand kprobe_alloc_insn_page() is reasonable. > > Mark, can you send this series here, so that others can review/test it? I'm totally fine with this but yeah best would be if it could carry the riscv part. Mark, even if you have only possibility to compile test that part I can check that it works. BR, Jarkko
Re: [PATCH v5 1/2] kprobes: textmem API
On Tue Mar 26, 2024 at 5:05 PM EET, Masami Hiramatsu (Google) wrote: > > According to kconfig-language.txt: > > > > "select should be used with care. select will force a symbol to a value > > without visiting the dependencies." > > > > So the problem here lies in KPROBES config entry using select statement > > to pick ALLOC_EXECMEM. It will not take the depends on statement into > > account and thus will allow to select kprobes without any allocator in > > place. > > OK, in that case "depend on" is good. Yeah, did not remember this at all. Only recalled when I started to get linking errors when compiling just the first patch... It's a bit uninituitive twist in kconfig :-) BR, Jarkko
Re: [PATCH v5 2/2] arch/riscv: Enable kprobes when CONFIG_MODULES=n
On Tue Mar 26, 2024 at 6:49 PM EET, Jarkko Sakkinen wrote: > On Tue Mar 26, 2024 at 3:57 PM EET, Alexandre Ghiti wrote: > > Hi Jarkko, > > > > On 25/03/2024 22:55, Jarkko Sakkinen wrote: > > > Tacing with kprobes while running a monolithic kernel is currently > > > impossible due the kernel module allocator dependency. > > > > > > Address the issue by implementing textmem API for RISC-V. > > > > > > Link: https://www.sochub.fi # for power on testing new SoC's with a > > > minimal stack > > > Link: > > > https://lore.kernel.org/all/2022060814.3054333-1-jar...@profian.com/ > > > # continuation > > > Signed-off-by: Jarkko Sakkinen > > > --- > > > v5: > > > - No changes, expect removing alloc_execmem() call which should have > > >been part of the previous patch. > > > v4: > > > - Include linux/execmem.h. > > > v3: > > > - Architecture independent parts have been split to separate patches. > > > - Do not change arch/riscv/kernel/module.c as it is out of scope for > > >this patch set now. > > > v2: > > > - Better late than never right? :-) > > > - Focus only to RISC-V for now to make the patch more digestable. This > > >is the arch where I use the patch on a daily basis to help with QA. > > > - Introduce HAVE_KPROBES_ALLOC flag to help with more gradual migration. > > > --- > > > arch/riscv/Kconfig | 1 + > > > arch/riscv/kernel/Makefile | 3 +++ > > > arch/riscv/kernel/execmem.c | 22 ++ > > > 3 files changed, 26 insertions(+) > > > create mode 100644 arch/riscv/kernel/execmem.c > > > > > > diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig > > > index e3142ce531a0..499512fb17ff 100644 > > > --- a/arch/riscv/Kconfig > > > +++ b/arch/riscv/Kconfig > > > @@ -132,6 +132,7 @@ config RISCV > > > select HAVE_KPROBES if !XIP_KERNEL > > > select HAVE_KPROBES_ON_FTRACE if !XIP_KERNEL > > > select HAVE_KRETPROBES if !XIP_KERNEL > > > + select HAVE_ALLOC_EXECMEM if !XIP_KERNEL > > > # https://github.com/ClangBuiltLinux/linux/issues/1881 > > > select HAVE_LD_DEAD_CODE_DATA_ELIMINATION if !LD_IS_LLD > > > select HAVE_MOVE_PMD > > > diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile > > > index 604d6bf7e476..337797f10d3e 100644 > > > --- a/arch/riscv/kernel/Makefile > > > +++ b/arch/riscv/kernel/Makefile > > > @@ -73,6 +73,9 @@ obj-$(CONFIG_SMP) += cpu_ops.o > > > > > > obj-$(CONFIG_RISCV_BOOT_SPINWAIT) += cpu_ops_spinwait.o > > > obj-$(CONFIG_MODULES) += module.o > > > +ifeq ($(CONFIG_ALLOC_EXECMEM),y) > > > +obj-y+= execmem.o > > > +endif > > > obj-$(CONFIG_MODULE_SECTIONS) += module-sections.o > > > > > > obj-$(CONFIG_CPU_PM)+= suspend_entry.o suspend.o > > > diff --git a/arch/riscv/kernel/execmem.c b/arch/riscv/kernel/execmem.c > > > new file mode 100644 > > > index ..3e52522ead32 > > > --- /dev/null > > > +++ b/arch/riscv/kernel/execmem.c > > > @@ -0,0 +1,22 @@ > > > +// SPDX-License-Identifier: GPL-2.0-or-later > > > + > > > +#include > > > +#include > > > +#include > > > +#include > > > + > > > +void *alloc_execmem(unsigned long size, gfp_t /* gfp */) > > Need to have the parameter name here. I guess this could just as well > pass through gfp to vmalloc from the caller as kprobes does call > module_alloc() with GFP_KERNEL set in RISC-V. > > > > +{ > > > + return __vmalloc_node_range(size, 1, MODULES_VADDR, > > > + MODULES_END, GFP_KERNEL, > > > + PAGE_KERNEL, 0, NUMA_NO_NODE, > > > + __builtin_return_address(0)); > > > +} > > > > > > The __vmalloc_node_range() line ^^ must be from an old kernel since we > > added VM_FLUSH_RESET_PERMS in 6.8, see 749b94b08005 ("riscv: Fix > > module_alloc() that did not reset the linear mapping permissions"). > > > > In addition, I guess module_alloc() should now use alloc_execmem() right? > > Ack for the first comment. For the 2nd it is up to arch/ to choose > whether to have shared or separate allocators. > > So if you want I can change it that way but did not want
Re: [PATCH v2] arch/riscv: Enable kprobes when CONFIG_MODULES=n
On Tue Mar 26, 2024 at 4:46 PM EET, Mark Rutland wrote: > Hi Masami, > > On Mon, Mar 25, 2024 at 11:56:32AM +0900, Masami Hiramatsu wrote: > > Hi Jarkko, > > > > On Sun, 24 Mar 2024 01:29:08 +0200 > > Jarkko Sakkinen wrote: > > > > > Tracing with kprobes while running a monolithic kernel is currently > > > impossible due the kernel module allocator dependency. > > > > > > Address the issue by allowing architectures to implement module_alloc() > > > and module_memfree() independent of the module subsystem. An arch tree > > > can signal this by setting HAVE_KPROBES_ALLOC in its Kconfig file. > > > > > > Realize the feature on RISC-V by separating allocator to module_alloc.c > > > and implementing module_memfree(). > > > > Even though, this involves changes in arch-independent part. So it should > > be solved by generic way. Did you checked Calvin's thread? > > > > https://lore.kernel.org/all/cover.1709676663.git.jcalvinow...@gmail.com/ > > > > I think, we'd better to introduce `alloc_execmem()`, > > CONFIG_HAVE_ALLOC_EXECMEM and CONFIG_ALLOC_EXECMEM at first > > > > config HAVE_ALLOC_EXECMEM > > bool > > > > config ALLOC_EXECMEM > > bool "Executable trampline memory allocation" > > depends on MODULES || HAVE_ALLOC_EXECMEM > > > > And define fallback macro to module_alloc() like this. > > > > #ifndef CONFIG_HAVE_ALLOC_EXECMEM > > #define alloc_execmem(size, gfp)module_alloc(size) > > #endif > > Please can we *not* do this? I think this is abstracting at the wrong level > (as > I mentioned on the prior execmem proposals). > > Different exectuable allocations can have different requirements. For example, > on arm64 modules need to be within 2G of the kernel image, but the kprobes XOL > areas can be anywhere in the kernel VA space. > > Forcing those behind the same interface makes things *harder* for > architectures > and/or makes the common code more complicated (if that ends up having to track > all those different requirements). From my PoV it'd be much better to have > separate kprobes_alloc_*() functions for kprobes which an architecture can > then > choose to implement using a common library if it wants to. > > I took a look at doing that using the core ifdeffery fixups from Jarkko's v6, > and it looks pretty clean to me (and works in testing on arm64): > > > https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=kprobes/without-modules > > Could we please start with that approach, with kprobe-specific alloc/free code > provided by the architecture? How should we move forward? I'm fine with someone picking the pieces of my work as long as also the riscv side is included. Can also continue rotating this, whatever works. > > Thanks, > Mark. BR, Jarkko
Re: [PATCH v5 2/2] arch/riscv: Enable kprobes when CONFIG_MODULES=n
On Tue Mar 26, 2024 at 3:57 PM EET, Alexandre Ghiti wrote: > Hi Jarkko, > > On 25/03/2024 22:55, Jarkko Sakkinen wrote: > > Tacing with kprobes while running a monolithic kernel is currently > > impossible due the kernel module allocator dependency. > > > > Address the issue by implementing textmem API for RISC-V. > > > > Link: https://www.sochub.fi # for power on testing new SoC's with a minimal > > stack > > Link: > > https://lore.kernel.org/all/2022060814.3054333-1-jar...@profian.com/ # > > continuation > > Signed-off-by: Jarkko Sakkinen > > --- > > v5: > > - No changes, expect removing alloc_execmem() call which should have > >been part of the previous patch. > > v4: > > - Include linux/execmem.h. > > v3: > > - Architecture independent parts have been split to separate patches. > > - Do not change arch/riscv/kernel/module.c as it is out of scope for > >this patch set now. > > v2: > > - Better late than never right? :-) > > - Focus only to RISC-V for now to make the patch more digestable. This > >is the arch where I use the patch on a daily basis to help with QA. > > - Introduce HAVE_KPROBES_ALLOC flag to help with more gradual migration. > > --- > > arch/riscv/Kconfig | 1 + > > arch/riscv/kernel/Makefile | 3 +++ > > arch/riscv/kernel/execmem.c | 22 ++ > > 3 files changed, 26 insertions(+) > > create mode 100644 arch/riscv/kernel/execmem.c > > > > diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig > > index e3142ce531a0..499512fb17ff 100644 > > --- a/arch/riscv/Kconfig > > +++ b/arch/riscv/Kconfig > > @@ -132,6 +132,7 @@ config RISCV > > select HAVE_KPROBES if !XIP_KERNEL > > select HAVE_KPROBES_ON_FTRACE if !XIP_KERNEL > > select HAVE_KRETPROBES if !XIP_KERNEL > > + select HAVE_ALLOC_EXECMEM if !XIP_KERNEL > > # https://github.com/ClangBuiltLinux/linux/issues/1881 > > select HAVE_LD_DEAD_CODE_DATA_ELIMINATION if !LD_IS_LLD > > select HAVE_MOVE_PMD > > diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile > > index 604d6bf7e476..337797f10d3e 100644 > > --- a/arch/riscv/kernel/Makefile > > +++ b/arch/riscv/kernel/Makefile > > @@ -73,6 +73,9 @@ obj-$(CONFIG_SMP) += cpu_ops.o > > > > obj-$(CONFIG_RISCV_BOOT_SPINWAIT) += cpu_ops_spinwait.o > > obj-$(CONFIG_MODULES) += module.o > > +ifeq ($(CONFIG_ALLOC_EXECMEM),y) > > +obj-y += execmem.o > > +endif > > obj-$(CONFIG_MODULE_SECTIONS) += module-sections.o > > > > obj-$(CONFIG_CPU_PM) += suspend_entry.o suspend.o > > diff --git a/arch/riscv/kernel/execmem.c b/arch/riscv/kernel/execmem.c > > new file mode 100644 > > index ..3e52522ead32 > > --- /dev/null > > +++ b/arch/riscv/kernel/execmem.c > > @@ -0,0 +1,22 @@ > > +// SPDX-License-Identifier: GPL-2.0-or-later > > + > > +#include > > +#include > > +#include > > +#include > > + > > +void *alloc_execmem(unsigned long size, gfp_t /* gfp */) Need to have the parameter name here. I guess this could just as well pass through gfp to vmalloc from the caller as kprobes does call module_alloc() with GFP_KERNEL set in RISC-V. > > +{ > > + return __vmalloc_node_range(size, 1, MODULES_VADDR, > > + MODULES_END, GFP_KERNEL, > > + PAGE_KERNEL, 0, NUMA_NO_NODE, > > + __builtin_return_address(0)); > > +} > > > The __vmalloc_node_range() line ^^ must be from an old kernel since we > added VM_FLUSH_RESET_PERMS in 6.8, see 749b94b08005 ("riscv: Fix > module_alloc() that did not reset the linear mapping permissions"). > > In addition, I guess module_alloc() should now use alloc_execmem() right? Ack for the first comment. For the 2nd it is up to arch/ to choose whether to have shared or separate allocators. So if you want I can change it that way but did not want to make the call myself. > > > > + > > +void free_execmem(void *region) > > +{ > > + if (in_interrupt()) > > + pr_warn("In interrupt context: vmalloc may not work.\n"); > > + > > + vfree(region); > > +} > > > I remember Mike Rapoport sent a patchset to introduce an API for > executable memory allocation > (https://lore.kernel.org/linux-mm/20230918072955.2507221-1-r...@kernel.org/), > how does this intersect with your work? I don't know the status of his > patchset though
[PATCH v7 2/2] arch/riscv: Enable kprobes when CONFIG_MODULES=n
Tacing with kprobes while running a monolithic kernel is currently impossible due the kernel module allocator dependency. Address the issue by implementing textmem API for RISC-V. Link: https://www.sochub.fi # for power on testing new SoC's with a minimal stack Link: https://lore.kernel.org/all/2022060814.3054333-1-jar...@profian.com/ # continuation Signed-off-by: Jarkko Sakkinen --- v5-v7: - No changes. v4: - Include linux/execmem.h. v3: - Architecture independent parts have been split to separate patches. - Do not change arch/riscv/kernel/module.c as it is out of scope for this patch set now. v2: - Better late than never right? :-) - Focus only to RISC-V for now to make the patch more digestable. This is the arch where I use the patch on a daily basis to help with QA. - Introduce HAVE_KPROBES_ALLOC flag to help with more gradual migration. --- arch/riscv/Kconfig | 1 + arch/riscv/kernel/Makefile | 3 +++ arch/riscv/kernel/execmem.c | 22 ++ 3 files changed, 26 insertions(+) create mode 100644 arch/riscv/kernel/execmem.c diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig index e3142ce531a0..499512fb17ff 100644 --- a/arch/riscv/Kconfig +++ b/arch/riscv/Kconfig @@ -132,6 +132,7 @@ config RISCV select HAVE_KPROBES if !XIP_KERNEL select HAVE_KPROBES_ON_FTRACE if !XIP_KERNEL select HAVE_KRETPROBES if !XIP_KERNEL + select HAVE_ALLOC_EXECMEM if !XIP_KERNEL # https://github.com/ClangBuiltLinux/linux/issues/1881 select HAVE_LD_DEAD_CODE_DATA_ELIMINATION if !LD_IS_LLD select HAVE_MOVE_PMD diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile index 604d6bf7e476..337797f10d3e 100644 --- a/arch/riscv/kernel/Makefile +++ b/arch/riscv/kernel/Makefile @@ -73,6 +73,9 @@ obj-$(CONFIG_SMP) += cpu_ops.o obj-$(CONFIG_RISCV_BOOT_SPINWAIT) += cpu_ops_spinwait.o obj-$(CONFIG_MODULES) += module.o +ifeq ($(CONFIG_ALLOC_EXECMEM),y) +obj-y += execmem.o +endif obj-$(CONFIG_MODULE_SECTIONS) += module-sections.o obj-$(CONFIG_CPU_PM) += suspend_entry.o suspend.o diff --git a/arch/riscv/kernel/execmem.c b/arch/riscv/kernel/execmem.c new file mode 100644 index ..3e52522ead32 --- /dev/null +++ b/arch/riscv/kernel/execmem.c @@ -0,0 +1,22 @@ +// SPDX-License-Identifier: GPL-2.0-or-later + +#include +#include +#include +#include + +void *alloc_execmem(unsigned long size, gfp_t /* gfp */) +{ + return __vmalloc_node_range(size, 1, MODULES_VADDR, + MODULES_END, GFP_KERNEL, + PAGE_KERNEL, 0, NUMA_NO_NODE, + __builtin_return_address(0)); +} + +void free_execmem(void *region) +{ + if (in_interrupt()) + pr_warn("In interrupt context: vmalloc may not work.\n"); + + vfree(region); +} -- 2.44.0
[PATCH v7 1/2] kprobes: Implement trampoline memory allocator for tracing
Tracing with kprobes while running a monolithic kernel is currently impossible because CONFIG_KPROBES depends on CONFIG_MODULES. Introduce alloc_execmem() and free_execmem() for allocating executable memory. If an arch implements these functions, it can mark this up with the HAVE_ALLOC_EXECMEM kconfig flag. The second new kconfig flag is ALLOC_EXECMEM, which can be selected if either MODULES is selected or HAVE_ALLOC_EXECMEM is support by the arch. If HAVE_ALLOC_EXECMEM is not supported by an arch, module_alloc() and module_memfree() are used as a fallback, thus retaining backwards compatibility to earlier kernel versions. This will allow architecture to enable kprobes traces without requiring to enable module. The support can be implemented with four easy steps: 1. Implement alloc_execmem(). 2. Implement free_execmem(). 3. Edit arch//Makefile. 4. Set HAVE_ALLOC_EXECMEM in arch//Kconfig. Link: https://lore.kernel.org/all/20240325115632.04e37297491cadfbbf382...@kernel.org/ Suggested-by: Masami Hiramatsu Signed-off-by: Jarkko Sakkinen --- v7: - Use "depends on" for ALLOC_EXECMEM instead of "select" - Reduced and narrowed CONFIG_MODULES checks further in kprobes.c. v6: - Use null pointer for notifiers and register the module notifier only if IS_ENABLED(CONFIG_MODULES) is set. - Fixed typo in the commit message and wrote more verbose description of the feature. v5: - alloc_execmem() was missing GFP_KERNEL parameter. The patch set did compile because 2/2 had the fixup (leaked there when rebasing the patch set). v4: - Squashed a couple of unrequired CONFIG_MODULES checks. - See https://lore.kernel.org/all/d034m18d63ec.2y11d954ys...@kernel.org/ v3: - A new patch added. - For IS_DEFINED() I need advice as I could not really find that many locations where it would be applicable. --- arch/Kconfig| 17 +++- include/linux/execmem.h | 13 + kernel/kprobes.c| 53 ++--- kernel/trace/trace_kprobe.c | 15 +-- 4 files changed, 73 insertions(+), 25 deletions(-) create mode 100644 include/linux/execmem.h diff --git a/arch/Kconfig b/arch/Kconfig index a5af0edd3eb8..5e9735f60f3c 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -52,8 +52,8 @@ config GENERIC_ENTRY config KPROBES bool "Kprobes" - depends on MODULES depends on HAVE_KPROBES + depends on ALLOC_EXECMEM select KALLSYMS select TASKS_RCU if PREEMPTION help @@ -215,6 +215,21 @@ config HAVE_OPTPROBES config HAVE_KPROBES_ON_FTRACE bool +config HAVE_ALLOC_EXECMEM + bool + help + Architectures that select this option are capable of allocating trampoline + executable memory for tracing subsystems, indepedently of the kernel module + subsystem. + +config ALLOC_EXECMEM + bool "Executable (trampoline) memory allocation" + default y + depends on MODULES || HAVE_ALLOC_EXECMEM + help + Select this for executable (trampoline) memory. Can be enabled when either + module allocator or arch-specific allocator is available. + config ARCH_CORRECT_STACKTRACE_ON_KRETPROBE bool help diff --git a/include/linux/execmem.h b/include/linux/execmem.h new file mode 100644 index ..ae2ff151523a --- /dev/null +++ b/include/linux/execmem.h @@ -0,0 +1,13 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_EXECMEM_H +#define _LINUX_EXECMEM_H + +#ifdef CONFIG_HAVE_ALLOC_EXECMEM +void *alloc_execmem(unsigned long size, gfp_t gfp); +void free_execmem(void *region); +#else +#define alloc_execmem(size, gfp) module_alloc(size) +#define free_execmem(region) module_memfree(region) +#endif + +#endif /* _LINUX_EXECMEM_H */ diff --git a/kernel/kprobes.c b/kernel/kprobes.c index 9d9095e81792..13bef5de315c 100644 --- a/kernel/kprobes.c +++ b/kernel/kprobes.c @@ -44,6 +44,7 @@ #include #include #include +#include #define KPROBE_HASH_BITS 6 #define KPROBE_TABLE_SIZE (1 << KPROBE_HASH_BITS) @@ -113,17 +114,17 @@ enum kprobe_slot_state { void __weak *alloc_insn_page(void) { /* -* Use module_alloc() so this page is within +/- 2GB of where the +* Use alloc_execmem() so this page is within +/- 2GB of where the * kernel image and loaded module images reside. This is required * for most of the architectures. * (e.g. x86-64 needs this to handle the %rip-relative fixups.) */ - return module_alloc(PAGE_SIZE); + return alloc_execmem(PAGE_SIZE, GFP_KERNEL); } static void free_insn_page(void *page) { - module_memfree(page); + free_execmem(page); } struct kprobe_insn_cache kprobe_insn_slots = { @@ -1592,6 +1593,7 @@ static int check_kprobe_address_safe(struct kprobe *p, goto out; } +#ifdef CONFIG_MODULES /* *
Re: [PATCH v5 1/2] kprobes: textmem API
On Tue Mar 26, 2024 at 4:01 AM EET, Jarkko Sakkinen wrote: > On Tue Mar 26, 2024 at 3:31 AM EET, Jarkko Sakkinen wrote: > > > > +#endif /* _LINUX_EXECMEM_H */ > > > > diff --git a/kernel/kprobes.c b/kernel/kprobes.c > > > > index 9d9095e81792..87fd8c14a938 100644 > > > > --- a/kernel/kprobes.c > > > > +++ b/kernel/kprobes.c > > > > @@ -44,6 +44,7 @@ > > > > #include > > > > #include > > > > #include > > > > +#include > > > > > > > > #define KPROBE_HASH_BITS 6 > > > > #define KPROBE_TABLE_SIZE (1 << KPROBE_HASH_BITS) > > > > @@ -113,17 +114,17 @@ enum kprobe_slot_state { > > > > void __weak *alloc_insn_page(void) > > > > { > > > > /* > > > > -* Use module_alloc() so this page is within +/- 2GB of where > > > > the > > > > +* Use alloc_execmem() so this page is within +/- 2GB of where > > > > the > > > > * kernel image and loaded module images reside. This is > > > > required > > > > * for most of the architectures. > > > > * (e.g. x86-64 needs this to handle the %rip-relative fixups.) > > > > */ > > > > - return module_alloc(PAGE_SIZE); > > > > + return alloc_execmem(PAGE_SIZE, GFP_KERNEL); > > > > } > > > > > > > > static void free_insn_page(void *page) > > > > { > > > > - module_memfree(page); > > > > + free_execmem(page); > > > > } > > > > > > > > struct kprobe_insn_cache kprobe_insn_slots = { > > > > @@ -1580,6 +1581,7 @@ static int check_kprobe_address_safe(struct > > > > kprobe *p, > > > > goto out; > > > > } > > > > > > > > +#ifdef CONFIG_MODULES > > > > > > You don't need this block, because these APIs have dummy functions. > > > > Hmm... I'll verify this tomorrow. > > It depends on having struct module available given "(*probed_mod)->state". > > It is non-existent unless CONFIG_MODULES is set given how things are > flagged in include/linux/module.h. Hey, noticed kconfig issue. According to kconfig-language.txt: "select should be used with care. select will force a symbol to a value without visiting the dependencies." So the problem here lies in KPROBES config entry using select statement to pick ALLOC_EXECMEM. It will not take the depends on statement into account and thus will allow to select kprobes without any allocator in place. So to address this I'd suggest to use depends on statement also for describing relation between KPROBES and ALLOC_EXECMEM. It does not make life worse than before for anyone because even with the current kernel you have to select MODULES before you can move forward with kprobes. BR, Jarkko
Re: [PATCH v4 02/14] mm: Switch mm->get_unmapped_area() to a flag
On Tue Mar 26, 2024 at 4:16 AM EET, Rick Edgecombe wrote: > The mm_struct contains a function pointer *get_unmapped_area(), which > is set to either arch_get_unmapped_area() or > arch_get_unmapped_area_topdown() during the initialization of the mm. In which conditions which path is used during the initialization of mm and why is this the case? It is an open claim in the current form. That would be nice to have documented for the sake of being complete description. I have zero doubts of the claim being untrue. BR, Jarkko
Re: [PATCH v5 1/2] kprobes: textmem API
On Tue Mar 26, 2024 at 3:31 AM EET, Jarkko Sakkinen wrote: > > > +#endif /* _LINUX_EXECMEM_H */ > > > diff --git a/kernel/kprobes.c b/kernel/kprobes.c > > > index 9d9095e81792..87fd8c14a938 100644 > > > --- a/kernel/kprobes.c > > > +++ b/kernel/kprobes.c > > > @@ -44,6 +44,7 @@ > > > #include > > > #include > > > #include > > > +#include > > > > > > #define KPROBE_HASH_BITS 6 > > > #define KPROBE_TABLE_SIZE (1 << KPROBE_HASH_BITS) > > > @@ -113,17 +114,17 @@ enum kprobe_slot_state { > > > void __weak *alloc_insn_page(void) > > > { > > > /* > > > - * Use module_alloc() so this page is within +/- 2GB of where the > > > + * Use alloc_execmem() so this page is within +/- 2GB of where the > > >* kernel image and loaded module images reside. This is required > > >* for most of the architectures. > > >* (e.g. x86-64 needs this to handle the %rip-relative fixups.) > > >*/ > > > - return module_alloc(PAGE_SIZE); > > > + return alloc_execmem(PAGE_SIZE, GFP_KERNEL); > > > } > > > > > > static void free_insn_page(void *page) > > > { > > > - module_memfree(page); > > > + free_execmem(page); > > > } > > > > > > struct kprobe_insn_cache kprobe_insn_slots = { > > > @@ -1580,6 +1581,7 @@ static int check_kprobe_address_safe(struct kprobe > > > *p, > > > goto out; > > > } > > > > > > +#ifdef CONFIG_MODULES > > > > You don't need this block, because these APIs have dummy functions. > > Hmm... I'll verify this tomorrow. It depends on having struct module available given "(*probed_mod)->state". It is non-existent unless CONFIG_MODULES is set given how things are flagged in include/linux/module.h. BR, Jarkko
Re: [PATCH v5 1/2] kprobes: textmem API
On Tue Mar 26, 2024 at 2:58 AM EET, Masami Hiramatsu (Google) wrote: > On Mon, 25 Mar 2024 23:55:01 +0200 > Jarkko Sakkinen wrote: > > > Tracing with kprobes while running a monolithic kernel is currently > > impossible because CONFIG_KPROBES depends on CONFIG_MODULES because it uses > > the kernel module allocator. > > > > Introduce alloc_textmem() and free_textmem() for allocating executable > > memory. If an arch implements these functions, it can mark this up with > > the HAVE_ALLOC_EXECMEM kconfig flag. > > > > At first this feature will be used for enabling kprobes without > > modules support for arch/riscv. > > > > Link: > > https://lore.kernel.org/all/20240325115632.04e37297491cadfbbf382...@kernel.org/ > > Suggested-by: Masami Hiramatsu > > Signed-off-by: Jarkko Sakkinen > > --- > > v5: > > - alloc_execmem() was missing GFP_KERNEL parameter. The patch set did > > compile because 2/2 had the fixup (leaked there when rebasing the > > patch set). > > v4: > > - Squashed a couple of unrequired CONFIG_MODULES checks. > > - See https://lore.kernel.org/all/d034m18d63ec.2y11d954ys...@kernel.org/ > > v3: > > - A new patch added. > > - For IS_DEFINED() I need advice as I could not really find that many > > locations where it would be applicable. > > --- > > arch/Kconfig| 16 +++- > > include/linux/execmem.h | 13 + > > kernel/kprobes.c| 17 ++--- > > kernel/trace/trace_kprobe.c | 8 > > 4 files changed, 50 insertions(+), 4 deletions(-) > > create mode 100644 include/linux/execmem.h > > > > diff --git a/arch/Kconfig b/arch/Kconfig > > index a5af0edd3eb8..33ba68b7168f 100644 > > --- a/arch/Kconfig > > +++ b/arch/Kconfig > > @@ -52,8 +52,8 @@ config GENERIC_ENTRY > > > > config KPROBES > > bool "Kprobes" > > - depends on MODULES > > depends on HAVE_KPROBES > > + select ALLOC_EXECMEM > > select KALLSYMS > > select TASKS_RCU if PREEMPTION > > help > > @@ -215,6 +215,20 @@ config HAVE_OPTPROBES > > config HAVE_KPROBES_ON_FTRACE > > bool > > > > +config HAVE_ALLOC_EXECMEM > > + bool > > + help > > + Architectures that select this option are capable of allocating > > executable > > + memory, which can be used by subsystems but is not dependent of any > > of its > > + clients. > > + > > +config ALLOC_EXECMEM > > + bool "Executable (trampoline) memory allocation" > > + depends on MODULES || HAVE_ALLOC_EXECMEM > > + help > > + Select this for executable (trampoline) memory. Can be enabled when > > either > > + module allocator or arch-specific allocator is available. > > + > > config ARCH_CORRECT_STACKTRACE_ON_KRETPROBE > > bool > > help > > diff --git a/include/linux/execmem.h b/include/linux/execmem.h > > new file mode 100644 > > index ..ae2ff151523a > > --- /dev/null > > +++ b/include/linux/execmem.h > > @@ -0,0 +1,13 @@ > > +/* SPDX-License-Identifier: GPL-2.0 */ > > +#ifndef _LINUX_EXECMEM_H > > +#define _LINUX_EXECMEM_H > > + > > +#ifdef CONFIG_HAVE_ALLOC_EXECMEM > > +void *alloc_execmem(unsigned long size, gfp_t gfp); > > +void free_execmem(void *region); > > +#else > > +#define alloc_execmem(size, gfp) module_alloc(size) > > +#define free_execmem(region) module_memfree(region) > > +#endif > > + > > +#endif /* _LINUX_EXECMEM_H */ > > diff --git a/kernel/kprobes.c b/kernel/kprobes.c > > index 9d9095e81792..87fd8c14a938 100644 > > --- a/kernel/kprobes.c > > +++ b/kernel/kprobes.c > > @@ -44,6 +44,7 @@ > > #include > > #include > > #include > > +#include > > > > #define KPROBE_HASH_BITS 6 > > #define KPROBE_TABLE_SIZE (1 << KPROBE_HASH_BITS) > > @@ -113,17 +114,17 @@ enum kprobe_slot_state { > > void __weak *alloc_insn_page(void) > > { > > /* > > -* Use module_alloc() so this page is within +/- 2GB of where the > > +* Use alloc_execmem() so this page is within +/- 2GB of where the > > * kernel image and loaded module images reside. This is required > > * for most of the architectures. > > * (e.g. x86-64 needs this to handle the %rip-relative fixups.) > > */ > > - return module_alloc(PAGE_SIZE); > > + return alloc_execmem(PAGE_SIZE, GFP_KERNEL); > >
[PATCH v6 2/2] arch/riscv: Enable kprobes when CONFIG_MODULES=n
Tacing with kprobes while running a monolithic kernel is currently impossible due the kernel module allocator dependency. Address the issue by implementing textmem API for RISC-V. Link: https://www.sochub.fi # for power on testing new SoC's with a minimal stack Link: https://lore.kernel.org/all/2022060814.3054333-1-jar...@profian.com/ # continuation Signed-off-by: Jarkko Sakkinen --- v4: - Include linux/execmem.h. v3: - Architecture independent parts have been split to separate patches. - Do not change arch/riscv/kernel/module.c as it is out of scope for this patch set now. v2: - Better late than never right? :-) - Focus only to RISC-V for now to make the patch more digestable. This is the arch where I use the patch on a daily basis to help with QA. - Introduce HAVE_KPROBES_ALLOC flag to help with more gradual migration. --- arch/riscv/Kconfig | 1 + arch/riscv/kernel/Makefile | 3 +++ arch/riscv/kernel/execmem.c | 22 ++ 3 files changed, 26 insertions(+) create mode 100644 arch/riscv/kernel/execmem.c diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig index e3142ce531a0..499512fb17ff 100644 --- a/arch/riscv/Kconfig +++ b/arch/riscv/Kconfig @@ -132,6 +132,7 @@ config RISCV select HAVE_KPROBES if !XIP_KERNEL select HAVE_KPROBES_ON_FTRACE if !XIP_KERNEL select HAVE_KRETPROBES if !XIP_KERNEL + select HAVE_ALLOC_EXECMEM if !XIP_KERNEL # https://github.com/ClangBuiltLinux/linux/issues/1881 select HAVE_LD_DEAD_CODE_DATA_ELIMINATION if !LD_IS_LLD select HAVE_MOVE_PMD diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile index 604d6bf7e476..337797f10d3e 100644 --- a/arch/riscv/kernel/Makefile +++ b/arch/riscv/kernel/Makefile @@ -73,6 +73,9 @@ obj-$(CONFIG_SMP) += cpu_ops.o obj-$(CONFIG_RISCV_BOOT_SPINWAIT) += cpu_ops_spinwait.o obj-$(CONFIG_MODULES) += module.o +ifeq ($(CONFIG_ALLOC_EXECMEM),y) +obj-y += execmem.o +endif obj-$(CONFIG_MODULE_SECTIONS) += module-sections.o obj-$(CONFIG_CPU_PM) += suspend_entry.o suspend.o diff --git a/arch/riscv/kernel/execmem.c b/arch/riscv/kernel/execmem.c new file mode 100644 index ..3e52522ead32 --- /dev/null +++ b/arch/riscv/kernel/execmem.c @@ -0,0 +1,22 @@ +// SPDX-License-Identifier: GPL-2.0-or-later + +#include +#include +#include +#include + +void *alloc_execmem(unsigned long size, gfp_t /* gfp */) +{ + return __vmalloc_node_range(size, 1, MODULES_VADDR, + MODULES_END, GFP_KERNEL, + PAGE_KERNEL, 0, NUMA_NO_NODE, + __builtin_return_address(0)); +} + +void free_execmem(void *region) +{ + if (in_interrupt()) + pr_warn("In interrupt context: vmalloc may not work.\n"); + + vfree(region); +} -- 2.44.0
[PATCH v6 1/2] kprobes: implemente trampoline memory allocator
Tracing with kprobes while running a monolithic kernel is currently impossible because CONFIG_KPROBES depends on CONFIG_MODULES. Introduce alloc_execmem() and free_execmem() for allocating executable memory. If an arch implements these functions, it can mark this up with the HAVE_ALLOC_EXECMEM kconfig flag. The second new kconfig flag is ALLOC_EXECMEM, which can be selected if either MODULES is selected or HAVE_ALLOC_EXECMEM is support by the arch. If HAVE_ALLOC_EXECMEM is not supported by an arch, module_alloc() and module_memfree() are used as a fallback, thus retaining backwards compatibility to earlier kernel versions. This will allow architecture to enable kprobes traces without requiring to enable module. The support can be implemented with four easy steps: 1. Implement alloc_execmem(). 2. Implement free_execmem(). 3. Edit arch//Makefile. 4. Set HAVE_ALLOC_EXECMEM in arch//Kconfig. Link: https://lore.kernel.org/all/20240325115632.04e37297491cadfbbf382...@kernel.org/ Suggested-by: Masami Hiramatsu Signed-off-by: Jarkko Sakkinen --- v6: - Use null pointer for notifiers and register the module notifier only if IS_ENABLED(CONFIG_MODULES) is set. - Fixed typo in the commit message and wrote more verbose description of the feature. v5: - alloc_execmem() was missing GFP_KERNEL parameter. The patch set did compile because 2/2 had the fixup (leaked there when rebasing the patch set). v4: - Squashed a couple of unrequired CONFIG_MODULES checks. - See https://lore.kernel.org/all/d034m18d63ec.2y11d954ys...@kernel.org/ v3: - A new patch added. - For IS_DEFINED() I need advice as I could not really find that many locations where it would be applicable. --- arch/Kconfig| 16 +++- include/linux/execmem.h | 13 + kernel/kprobes.c| 19 +++ kernel/trace/trace_kprobe.c | 15 +-- 4 files changed, 56 insertions(+), 7 deletions(-) create mode 100644 include/linux/execmem.h diff --git a/arch/Kconfig b/arch/Kconfig index a5af0edd3eb8..33ba68b7168f 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -52,8 +52,8 @@ config GENERIC_ENTRY config KPROBES bool "Kprobes" - depends on MODULES depends on HAVE_KPROBES + select ALLOC_EXECMEM select KALLSYMS select TASKS_RCU if PREEMPTION help @@ -215,6 +215,20 @@ config HAVE_OPTPROBES config HAVE_KPROBES_ON_FTRACE bool +config HAVE_ALLOC_EXECMEM + bool + help + Architectures that select this option are capable of allocating executable + memory, which can be used by subsystems but is not dependent of any of its + clients. + +config ALLOC_EXECMEM + bool "Executable (trampoline) memory allocation" + depends on MODULES || HAVE_ALLOC_EXECMEM + help + Select this for executable (trampoline) memory. Can be enabled when either + module allocator or arch-specific allocator is available. + config ARCH_CORRECT_STACKTRACE_ON_KRETPROBE bool help diff --git a/include/linux/execmem.h b/include/linux/execmem.h new file mode 100644 index ..ae2ff151523a --- /dev/null +++ b/include/linux/execmem.h @@ -0,0 +1,13 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_EXECMEM_H +#define _LINUX_EXECMEM_H + +#ifdef CONFIG_HAVE_ALLOC_EXECMEM +void *alloc_execmem(unsigned long size, gfp_t gfp); +void free_execmem(void *region); +#else +#define alloc_execmem(size, gfp) module_alloc(size) +#define free_execmem(region) module_memfree(region) +#endif + +#endif /* _LINUX_EXECMEM_H */ diff --git a/kernel/kprobes.c b/kernel/kprobes.c index 9d9095e81792..9ed07a4bf9e3 100644 --- a/kernel/kprobes.c +++ b/kernel/kprobes.c @@ -44,6 +44,7 @@ #include #include #include +#include #define KPROBE_HASH_BITS 6 #define KPROBE_TABLE_SIZE (1 << KPROBE_HASH_BITS) @@ -113,17 +114,17 @@ enum kprobe_slot_state { void __weak *alloc_insn_page(void) { /* -* Use module_alloc() so this page is within +/- 2GB of where the +* Use alloc_execmem() so this page is within +/- 2GB of where the * kernel image and loaded module images reside. This is required * for most of the architectures. * (e.g. x86-64 needs this to handle the %rip-relative fixups.) */ - return module_alloc(PAGE_SIZE); + return alloc_execmem(PAGE_SIZE, GFP_KERNEL); } static void free_insn_page(void *page) { - module_memfree(page); + free_execmem(page); } struct kprobe_insn_cache kprobe_insn_slots = { @@ -1580,6 +1581,7 @@ static int check_kprobe_address_safe(struct kprobe *p, goto out; } +#ifdef CONFIG_MODULES /* Check if 'p' is probing a module. */ *probed_mod = __module_text_address((unsigned long) p->addr); if (*probed_mod) { @@ -1603,6 +1605,8 @@ static int check_kprobe_address_safe(struct kprobe *p,
Re: [PATCH v3 1/2] kprobes: textmem API
On Mon Mar 25, 2024 at 10:37 PM EET, Jarkko Sakkinen wrote: > - if (ret == -ENOENT && !trace_kprobe_module_exist(tk)) { > +#ifdef CONFIG_MODULES > + if (ret == -ENOENT && trace_kprobe_module_exist(tk)) > + ret = 0; > +#endif /* CONFIG_MODULES */ For this we could have #ifndef CONFIG_MODULES #define trace_kprobe_module_exist(tk) false #endif That would clean up at least two locations requiring no changes. Should I go forward this or not? BR, Jarkko
Re: [PATCH v5 1/2] kprobes: textmem API
On Tue Mar 26, 2024 at 1:50 AM EET, Masami Hiramatsu (Google) wrote: > On Tue, 26 Mar 2024 00:09:42 +0200 > "Jarkko Sakkinen" wrote: > > > On Mon Mar 25, 2024 at 11:55 PM EET, Jarkko Sakkinen wrote: > > > +#ifdef CONFIG_MODULES > > > if (register_module_notifier(_kprobe_module_nb)) > > > return -EINVAL; > > > +#endif /* CONFIG_MODULES */ > > > > register_module_notifier() does have "dummy" version but what > > would I pass to it. It makes more mess than it cleans to declare > > also a "dummy" version of trace_kprobe_module_nb. > > That is better than having #ifdef in the function. > > > > > The callback itself has too tight module subsystem bindings so > > that they could be simply flagged with IS_DEFINED() (or correct > > if I'm mistaken, this the conclusion I've ended up with). > > Please try this. > > - > diff --git a/kernel/kprobes.c b/kernel/kprobes.c > index 70dc6179086e..bc98db14927f 100644 > --- a/kernel/kprobes.c > +++ b/kernel/kprobes.c > @@ -2625,6 +2625,7 @@ static void remove_module_kprobe_blacklist(struct > module *mod) > } > } > > +#ifdef CONFIG_MODULES > /* Module notifier call back, checking kprobes on the module */ > static int kprobes_module_callback(struct notifier_block *nb, > unsigned long val, void *data) > @@ -2675,6 +2676,9 @@ static int kprobes_module_callback(struct > notifier_block *nb, > mutex_unlock(_mutex); > return NOTIFY_DONE; > } > +#else > +#define kprobes_module_callback (NULL) > +#endif > > static struct notifier_block kprobe_module_nb = { > .notifier_call = kprobes_module_callback, > @@ -2739,7 +2743,7 @@ static int __init init_kprobes(void) > err = arch_init_kprobes(); > if (!err) > err = register_die_notifier(_exceptions_nb); > - if (!err) > + if (!err && IS_ENABLED(CONFIG_MODULES)) > err = register_module_notifier(_module_nb); > > kprobes_initialized = (err == 0); OK, thanks for the suggestion WFM. I'll give this also a spin with VisionFive2 RISC-V SBC before sending v6. BR, Jarkko
Re: [PATCH v2] arch/riscv: Enable kprobes when CONFIG_MODULES=n
On Mon Mar 25, 2024 at 4:56 AM EET, Masami Hiramatsu (Google) wrote: > Hi Jarkko, > > On Sun, 24 Mar 2024 01:29:08 +0200 > Jarkko Sakkinen wrote: > > > Tracing with kprobes while running a monolithic kernel is currently > > impossible due the kernel module allocator dependency. > > > > Address the issue by allowing architectures to implement module_alloc() > > and module_memfree() independent of the module subsystem. An arch tree > > can signal this by setting HAVE_KPROBES_ALLOC in its Kconfig file. > > > > Realize the feature on RISC-V by separating allocator to module_alloc.c > > and implementing module_memfree(). > > Even though, this involves changes in arch-independent part. So it should > be solved by generic way. Did you checked Calvin's thread? > > https://lore.kernel.org/all/cover.1709676663.git.jcalvinow...@gmail.com/ Nope, has not been in my radar but for sure will check it. I don't mind making this more generic. The point of this version was to put focus on single architecture and do as little as possible how the code works right now so that it is easier to give feedback on direction. > I think, we'd better to introduce `alloc_execmem()`, > CONFIG_HAVE_ALLOC_EXECMEM and CONFIG_ALLOC_EXECMEM at first > > config HAVE_ALLOC_EXECMEM > bool > > config ALLOC_EXECMEM > bool "Executable trampline memory allocation" > depends on MODULES || HAVE_ALLOC_EXECMEM Right so this is logically the same as I have just with ALLOC_EXECMEM added to cover both MODULES and HAVE_ALLOC_EXECMEM (which is essentially the same as HAVE_ALLOC_KPROBES just with a different name). Not at all against this. I think this factor more understandable structuring, just "peer checking" that I understand what I'm reading :-) > And define fallback macro to module_alloc() like this. > > #ifndef CONFIG_HAVE_ALLOC_EXECMEM > #define alloc_execmem(size, gfp) module_alloc(size) > #endif > > Then, introduce a new dependency to kprobes > > config KPROBES > bool "Kprobes" > select ALLOC_EXECMEM OK I think this is good but now I see actually two logical chunks of work becuse this select changes how KPROBES kconfig option works. It has previously required to manually select MODULES first. So I'll "select MODULES" as separate patch to keep all logical changes transparent... > > and update kprobes to use alloc_execmem and remove module related > code from it. > > You also should consider using IS_ENABLED(CONFIG_MODULE) in the code to > avoid using #ifdefs. > > Finally, you can add RISCV implementation patch of HAVE_ALLOC_EXECMEM in the > next patch. OK, I think the suggestions are sane and not that much drift what I have now so works for me. > Thank you, BR, Jarkko
Re: [RFC][PATCH 0/4] Make bpf_jit and kprobes work with CONFIG_MODULES=n
On Wed Mar 6, 2024 at 10:05 PM EET, Calvin Owens wrote: > Hello all, > > This patchset makes it possible to use bpftrace with kprobes on kernels > built without loadable module support. > > On a Raspberry Pi 4b, this saves about 700KB of memory where BPF is > needed but loadable module support is not. These two kernels had > identical configurations, except CONFIG_MODULE was off in the second: > >- Linux version 6.8.0-rc7 >- Memory: 3330672K/4050944K available (16576K kernel code, 2390K rwdata, >- 12364K rodata, 5632K init, 675K bss, 195984K reserved, 524288K > cma-reserved) >+ Linux version 6.8.0-rc7-3-g2af01251ca21 >+ Memory: 3331400K/4050944K available (16512K kernel code, 2384K rwdata, >+ 11728K rodata, 5632K init, 673K bss, 195256K reserved, 524288K > cma-reserved) > > I don't intend to present an exhaustive list of !MODULES usecases, since > I'm sure there are many I'm not aware of. Performance is a common one, > the primary justification being that static text is mapped on hugepages > and module text is not. Security is another, since rootkits are much > harder to implement without modules. > > The first patch is the interesting one: it moves module_alloc() into its > own file with its own Kconfig option, so it can be utilized even when > loadable module support is disabled. I got the idea from an unmerged > patch from a few years ago I found on lkml (see [1/4] for details). I > think this also has value in its own right, since I suspect there are > potential users beyond bpf, hopefully we will hear from some. > > Patches 2-3 are proofs of concept to demonstrate the first patch is > sufficient to achieve my goal (full ebpf functionality without modules). > > Patch 4 adds a new "-n" argument to vmtest.sh to run the BPF selftests > without modules, so the prior three patches can be rigorously tested. > > If something like the first patch were to eventually be merged, the rest > could go through the normal bpf-next process as I clean them up: I've > only based them on Linus' tree and combined them into a series here to > introduce the idea. > > If you prefer to fetch the patches via git: > > [1/4] https://github.com/jcalvinowens/linux.git work/module-alloc > +[2/4]+[3/4] https://github.com/jcalvinowens/linux.git work/nomodule-bpf > +[4/4] https://github.com/jcalvinowens/linux.git testing/nomodule-bpf-ci > > In addition to the automated BPF selftests, I've lightly tested this on > my laptop (x86_64), a Raspberry Pi 4b (arm64), and a Raspberry Pi Zero W > (arm). The other architectures have only been compile tested. > > I didn't want to spam all the arch maintainers with what I expect will > be a discussion mostly about modules and bpf, so I've left them off this > first submission. I will be sure to add them on future submissions of > the first patch. Of course, feedback on the arch bits is welcome here. > > In addition to feedback on the patches themselves, I'm interested in > hearing from anybody else who might find this functionality useful. > > Thanks, > Calvin > > > Calvin Owens (4): > module: mm: Make module_alloc() generally available > bpf: Allow BPF_JIT with CONFIG_MODULES=n > kprobes: Allow kprobes with CONFIG_MODULES=n > selftests/bpf: Support testing the !MODULES case > > arch/Kconfig | 4 +- > arch/arm/kernel/module.c | 35 - > arch/arm/mm/Makefile | 2 + > arch/arm/mm/module_alloc.c| 40 ++ > arch/arm64/kernel/module.c| 127 - > arch/arm64/mm/Makefile| 1 + > arch/arm64/mm/module_alloc.c | 130 ++ > arch/loongarch/kernel/module.c| 6 - > arch/loongarch/mm/Makefile| 2 + > arch/loongarch/mm/module_alloc.c | 10 ++ > arch/mips/kernel/module.c | 10 -- > arch/mips/mm/Makefile | 2 + > arch/mips/mm/module_alloc.c | 13 ++ > arch/nios2/kernel/module.c| 20 --- > arch/nios2/mm/Makefile| 2 + > arch/nios2/mm/module_alloc.c | 22 +++ > arch/parisc/kernel/module.c | 12 -- > arch/parisc/mm/Makefile | 1 + > arch/parisc/mm/module_alloc.c | 15 ++ > arch/powerpc/kernel/module.c | 36 - > arch/powerpc/mm/Makefile | 1 + > arch/powerpc/mm/module_alloc.c| 41 ++ > arch/riscv/kernel/module.c| 11 -- > arch/riscv/mm/Makefile| 1 + > arch/riscv/mm/module_alloc.c | 17 +++ > arch/s390/kernel/module.c | 37 - > arch/s390/mm/Makefile | 1 + > arch/s390/mm/module_alloc.c | 42 ++ > arch/sparc/kernel/module.c
Re: [PATCH v5 1/2] kprobes: textmem API
On Tue Mar 26, 2024 at 12:09 AM EET, Jarkko Sakkinen wrote: > On Mon Mar 25, 2024 at 11:55 PM EET, Jarkko Sakkinen wrote: > > +#ifdef CONFIG_MODULES > > if (register_module_notifier(_kprobe_module_nb)) > > return -EINVAL; > > +#endif /* CONFIG_MODULES */ > > register_module_notifier() does have "dummy" version but what > would I pass to it. It makes more mess than it cleans to declare > also a "dummy" version of trace_kprobe_module_nb. > > The callback itself has too tight module subsystem bindings so > that they could be simply flagged with IS_DEFINED() (or correct > if I'm mistaken, this the conclusion I've ended up with). One way to clean that up would be to create trace_kprobe_module.c and move kernel module specific code over there and then change kernel/trace/Makefile as follows: ifeq ($(CONFIG_PERF_EVENTS),y) obj-y += trace_kprobe.o obj-$(CONFIG_MODULES) += trace_kprobe_module.o endif and define trace_kprobe_module_init() or similar to do all the dance with notifiers etc. This crossed my mind but did not want to do it without feedback. BR, Jarkko
Re: [PATCH v5 1/2] kprobes: textmem API
On Mon Mar 25, 2024 at 11:55 PM EET, Jarkko Sakkinen wrote: > +#ifdef CONFIG_MODULES > if (register_module_notifier(_kprobe_module_nb)) > return -EINVAL; > +#endif /* CONFIG_MODULES */ register_module_notifier() does have "dummy" version but what would I pass to it. It makes more mess than it cleans to declare also a "dummy" version of trace_kprobe_module_nb. The callback itself has too tight module subsystem bindings so that they could be simply flagged with IS_DEFINED() (or correct if I'm mistaken, this the conclusion I've ended up with). BR, Jarkko
Re: [PATCH v5 1/2] kprobes: textmem API
s/textmem/execmem/ (also in long description) will hold sending a new version as not a functional issue, will fix after review. BR, Jarkko
[PATCH v5 2/2] arch/riscv: Enable kprobes when CONFIG_MODULES=n
Tacing with kprobes while running a monolithic kernel is currently impossible due the kernel module allocator dependency. Address the issue by implementing textmem API for RISC-V. Link: https://www.sochub.fi # for power on testing new SoC's with a minimal stack Link: https://lore.kernel.org/all/2022060814.3054333-1-jar...@profian.com/ # continuation Signed-off-by: Jarkko Sakkinen --- v5: - No changes, expect removing alloc_execmem() call which should have been part of the previous patch. v4: - Include linux/execmem.h. v3: - Architecture independent parts have been split to separate patches. - Do not change arch/riscv/kernel/module.c as it is out of scope for this patch set now. v2: - Better late than never right? :-) - Focus only to RISC-V for now to make the patch more digestable. This is the arch where I use the patch on a daily basis to help with QA. - Introduce HAVE_KPROBES_ALLOC flag to help with more gradual migration. --- arch/riscv/Kconfig | 1 + arch/riscv/kernel/Makefile | 3 +++ arch/riscv/kernel/execmem.c | 22 ++ 3 files changed, 26 insertions(+) create mode 100644 arch/riscv/kernel/execmem.c diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig index e3142ce531a0..499512fb17ff 100644 --- a/arch/riscv/Kconfig +++ b/arch/riscv/Kconfig @@ -132,6 +132,7 @@ config RISCV select HAVE_KPROBES if !XIP_KERNEL select HAVE_KPROBES_ON_FTRACE if !XIP_KERNEL select HAVE_KRETPROBES if !XIP_KERNEL + select HAVE_ALLOC_EXECMEM if !XIP_KERNEL # https://github.com/ClangBuiltLinux/linux/issues/1881 select HAVE_LD_DEAD_CODE_DATA_ELIMINATION if !LD_IS_LLD select HAVE_MOVE_PMD diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile index 604d6bf7e476..337797f10d3e 100644 --- a/arch/riscv/kernel/Makefile +++ b/arch/riscv/kernel/Makefile @@ -73,6 +73,9 @@ obj-$(CONFIG_SMP) += cpu_ops.o obj-$(CONFIG_RISCV_BOOT_SPINWAIT) += cpu_ops_spinwait.o obj-$(CONFIG_MODULES) += module.o +ifeq ($(CONFIG_ALLOC_EXECMEM),y) +obj-y += execmem.o +endif obj-$(CONFIG_MODULE_SECTIONS) += module-sections.o obj-$(CONFIG_CPU_PM) += suspend_entry.o suspend.o diff --git a/arch/riscv/kernel/execmem.c b/arch/riscv/kernel/execmem.c new file mode 100644 index ..3e52522ead32 --- /dev/null +++ b/arch/riscv/kernel/execmem.c @@ -0,0 +1,22 @@ +// SPDX-License-Identifier: GPL-2.0-or-later + +#include +#include +#include +#include + +void *alloc_execmem(unsigned long size, gfp_t /* gfp */) +{ + return __vmalloc_node_range(size, 1, MODULES_VADDR, + MODULES_END, GFP_KERNEL, + PAGE_KERNEL, 0, NUMA_NO_NODE, + __builtin_return_address(0)); +} + +void free_execmem(void *region) +{ + if (in_interrupt()) + pr_warn("In interrupt context: vmalloc may not work.\n"); + + vfree(region); +} -- 2.44.0
[PATCH v5 1/2] kprobes: textmem API
Tracing with kprobes while running a monolithic kernel is currently impossible because CONFIG_KPROBES depends on CONFIG_MODULES because it uses the kernel module allocator. Introduce alloc_textmem() and free_textmem() for allocating executable memory. If an arch implements these functions, it can mark this up with the HAVE_ALLOC_EXECMEM kconfig flag. At first this feature will be used for enabling kprobes without modules support for arch/riscv. Link: https://lore.kernel.org/all/20240325115632.04e37297491cadfbbf382...@kernel.org/ Suggested-by: Masami Hiramatsu Signed-off-by: Jarkko Sakkinen --- v5: - alloc_execmem() was missing GFP_KERNEL parameter. The patch set did compile because 2/2 had the fixup (leaked there when rebasing the patch set). v4: - Squashed a couple of unrequired CONFIG_MODULES checks. - See https://lore.kernel.org/all/d034m18d63ec.2y11d954ys...@kernel.org/ v3: - A new patch added. - For IS_DEFINED() I need advice as I could not really find that many locations where it would be applicable. --- arch/Kconfig| 16 +++- include/linux/execmem.h | 13 + kernel/kprobes.c| 17 ++--- kernel/trace/trace_kprobe.c | 8 4 files changed, 50 insertions(+), 4 deletions(-) create mode 100644 include/linux/execmem.h diff --git a/arch/Kconfig b/arch/Kconfig index a5af0edd3eb8..33ba68b7168f 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -52,8 +52,8 @@ config GENERIC_ENTRY config KPROBES bool "Kprobes" - depends on MODULES depends on HAVE_KPROBES + select ALLOC_EXECMEM select KALLSYMS select TASKS_RCU if PREEMPTION help @@ -215,6 +215,20 @@ config HAVE_OPTPROBES config HAVE_KPROBES_ON_FTRACE bool +config HAVE_ALLOC_EXECMEM + bool + help + Architectures that select this option are capable of allocating executable + memory, which can be used by subsystems but is not dependent of any of its + clients. + +config ALLOC_EXECMEM + bool "Executable (trampoline) memory allocation" + depends on MODULES || HAVE_ALLOC_EXECMEM + help + Select this for executable (trampoline) memory. Can be enabled when either + module allocator or arch-specific allocator is available. + config ARCH_CORRECT_STACKTRACE_ON_KRETPROBE bool help diff --git a/include/linux/execmem.h b/include/linux/execmem.h new file mode 100644 index ..ae2ff151523a --- /dev/null +++ b/include/linux/execmem.h @@ -0,0 +1,13 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_EXECMEM_H +#define _LINUX_EXECMEM_H + +#ifdef CONFIG_HAVE_ALLOC_EXECMEM +void *alloc_execmem(unsigned long size, gfp_t gfp); +void free_execmem(void *region); +#else +#define alloc_execmem(size, gfp) module_alloc(size) +#define free_execmem(region) module_memfree(region) +#endif + +#endif /* _LINUX_EXECMEM_H */ diff --git a/kernel/kprobes.c b/kernel/kprobes.c index 9d9095e81792..87fd8c14a938 100644 --- a/kernel/kprobes.c +++ b/kernel/kprobes.c @@ -44,6 +44,7 @@ #include #include #include +#include #define KPROBE_HASH_BITS 6 #define KPROBE_TABLE_SIZE (1 << KPROBE_HASH_BITS) @@ -113,17 +114,17 @@ enum kprobe_slot_state { void __weak *alloc_insn_page(void) { /* -* Use module_alloc() so this page is within +/- 2GB of where the +* Use alloc_execmem() so this page is within +/- 2GB of where the * kernel image and loaded module images reside. This is required * for most of the architectures. * (e.g. x86-64 needs this to handle the %rip-relative fixups.) */ - return module_alloc(PAGE_SIZE); + return alloc_execmem(PAGE_SIZE, GFP_KERNEL); } static void free_insn_page(void *page) { - module_memfree(page); + free_execmem(page); } struct kprobe_insn_cache kprobe_insn_slots = { @@ -1580,6 +1581,7 @@ static int check_kprobe_address_safe(struct kprobe *p, goto out; } +#ifdef CONFIG_MODULES /* Check if 'p' is probing a module. */ *probed_mod = __module_text_address((unsigned long) p->addr); if (*probed_mod) { @@ -1603,6 +1605,8 @@ static int check_kprobe_address_safe(struct kprobe *p, ret = -ENOENT; } } +#endif + out: preempt_enable(); jump_label_unlock(); @@ -2482,6 +2486,7 @@ int kprobe_add_area_blacklist(unsigned long start, unsigned long end) return 0; } +#ifdef CONFIG_MODULES /* Remove all symbols in given area from kprobe blacklist */ static void kprobe_remove_area_blacklist(unsigned long start, unsigned long end) { @@ -2499,6 +2504,7 @@ static void kprobe_remove_ksym_blacklist(unsigned long entry) { kprobe_remove_area_blacklist(entry, entry + 1); } +#endif /* CONFIG_MODULES */ int __weak arch_kprobe_get_kallsym(unsigned int
[PATCH v3 1/2] kprobes: textmem API
Tracing with kprobes while running a monolithic kernel is currently impossible because CONFIG_KPROBES depends on CONFIG_MODULES because it uses the kernel module allocator. Introduce alloc_textmem() and free_textmem() for allocating executable memory. If an arch implements these functions, it can mark this up with the HAVE_ALLOC_EXECMEM kconfig flag. At first this feature will be used for enabling kprobes without modules support for arch/riscv. Link: https://lore.kernel.org/all/20240325115632.04e37297491cadfbbf382...@kernel.org/ Suggested-by: Masami Hiramatsu Signed-off-by: Jarkko Sakkinen --- v3: - A new patch added. - For IS_DEFINED() I need advice as I could not really find that many locations where it would be applicable. --- arch/Kconfig| 16 +++- include/linux/execmem.h | 13 + kernel/kprobes.c| 17 ++--- kernel/trace/trace_kprobe.c | 18 -- 4 files changed, 58 insertions(+), 6 deletions(-) create mode 100644 include/linux/execmem.h diff --git a/arch/Kconfig b/arch/Kconfig index a5af0edd3eb8..33ba68b7168f 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -52,8 +52,8 @@ config GENERIC_ENTRY config KPROBES bool "Kprobes" - depends on MODULES depends on HAVE_KPROBES + select ALLOC_EXECMEM select KALLSYMS select TASKS_RCU if PREEMPTION help @@ -215,6 +215,20 @@ config HAVE_OPTPROBES config HAVE_KPROBES_ON_FTRACE bool +config HAVE_ALLOC_EXECMEM + bool + help + Architectures that select this option are capable of allocating executable + memory, which can be used by subsystems but is not dependent of any of its + clients. + +config ALLOC_EXECMEM + bool "Executable (trampoline) memory allocation" + depends on MODULES || HAVE_ALLOC_EXECMEM + help + Select this for executable (trampoline) memory. Can be enabled when either + module allocator or arch-specific allocator is available. + config ARCH_CORRECT_STACKTRACE_ON_KRETPROBE bool help diff --git a/include/linux/execmem.h b/include/linux/execmem.h new file mode 100644 index ..ae2ff151523a --- /dev/null +++ b/include/linux/execmem.h @@ -0,0 +1,13 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_EXECMEM_H +#define _LINUX_EXECMEM_H + +#ifdef CONFIG_HAVE_ALLOC_EXECMEM +void *alloc_execmem(unsigned long size, gfp_t gfp); +void free_execmem(void *region); +#else +#define alloc_execmem(size, gfp) module_alloc(size) +#define free_execmem(region) module_memfree(region) +#endif + +#endif /* _LINUX_EXECMEM_H */ diff --git a/kernel/kprobes.c b/kernel/kprobes.c index 9d9095e81792..a1a547723c3c 100644 --- a/kernel/kprobes.c +++ b/kernel/kprobes.c @@ -44,6 +44,7 @@ #include #include #include +#include #define KPROBE_HASH_BITS 6 #define KPROBE_TABLE_SIZE (1 << KPROBE_HASH_BITS) @@ -113,17 +114,17 @@ enum kprobe_slot_state { void __weak *alloc_insn_page(void) { /* -* Use module_alloc() so this page is within +/- 2GB of where the +* Use alloc_execmem() so this page is within +/- 2GB of where the * kernel image and loaded module images reside. This is required * for most of the architectures. * (e.g. x86-64 needs this to handle the %rip-relative fixups.) */ - return module_alloc(PAGE_SIZE); + return alloc_execmem(PAGE_SIZE); } static void free_insn_page(void *page) { - module_memfree(page); + free_execmem(page); } struct kprobe_insn_cache kprobe_insn_slots = { @@ -1580,6 +1581,7 @@ static int check_kprobe_address_safe(struct kprobe *p, goto out; } +#ifdef CONFIG_MODULES /* Check if 'p' is probing a module. */ *probed_mod = __module_text_address((unsigned long) p->addr); if (*probed_mod) { @@ -1603,6 +1605,8 @@ static int check_kprobe_address_safe(struct kprobe *p, ret = -ENOENT; } } +#endif + out: preempt_enable(); jump_label_unlock(); @@ -2482,6 +2486,7 @@ int kprobe_add_area_blacklist(unsigned long start, unsigned long end) return 0; } +#ifdef CONFIG_MODULES /* Remove all symbols in given area from kprobe blacklist */ static void kprobe_remove_area_blacklist(unsigned long start, unsigned long end) { @@ -2499,6 +2504,7 @@ static void kprobe_remove_ksym_blacklist(unsigned long entry) { kprobe_remove_area_blacklist(entry, entry + 1); } +#endif /* CONFIG_MODULES */ int __weak arch_kprobe_get_kallsym(unsigned int *symnum, unsigned long *value, char *type, char *sym) @@ -2564,6 +2570,7 @@ static int __init populate_kprobe_blacklist(unsigned long *start, return ret ? : arch_populate_kprobe_blacklist(); } +#ifdef CONFIG_MODULES static void add_module_kprobe_blacklist(struct mod
[PATCH v4 2/2] arch/riscv: Enable kprobes when CONFIG_MODULES=n
Tacing with kprobes while running a monolithic kernel is currently impossible due the kernel module allocator dependency. Address the issue by implementing textmem API for RISC-V. Link: https://www.sochub.fi # for power on testing new SoC's with a minimal stack Link: https://lore.kernel.org/all/2022060814.3054333-1-jar...@profian.com/ # continuation Signed-off-by: Jarkko Sakkinen --- v4: - Include linux/execmem.h. v3: - Architecture independent parts have been split to separate patches. - Do not change arch/riscv/kernel/module.c as it is out of scope for this patch set now. v2: - Better late than never right? :-) - Focus only to RISC-V for now to make the patch more digestable. This is the arch where I use the patch on a daily basis to help with QA. - Introduce HAVE_KPROBES_ALLOC flag to help with more gradual migration. --- arch/riscv/Kconfig | 1 + arch/riscv/kernel/Makefile | 3 +++ arch/riscv/kernel/execmem.c | 22 ++ kernel/kprobes.c| 2 +- 4 files changed, 27 insertions(+), 1 deletion(-) create mode 100644 arch/riscv/kernel/execmem.c diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig index e3142ce531a0..499512fb17ff 100644 --- a/arch/riscv/Kconfig +++ b/arch/riscv/Kconfig @@ -132,6 +132,7 @@ config RISCV select HAVE_KPROBES if !XIP_KERNEL select HAVE_KPROBES_ON_FTRACE if !XIP_KERNEL select HAVE_KRETPROBES if !XIP_KERNEL + select HAVE_ALLOC_EXECMEM if !XIP_KERNEL # https://github.com/ClangBuiltLinux/linux/issues/1881 select HAVE_LD_DEAD_CODE_DATA_ELIMINATION if !LD_IS_LLD select HAVE_MOVE_PMD diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile index 604d6bf7e476..337797f10d3e 100644 --- a/arch/riscv/kernel/Makefile +++ b/arch/riscv/kernel/Makefile @@ -73,6 +73,9 @@ obj-$(CONFIG_SMP) += cpu_ops.o obj-$(CONFIG_RISCV_BOOT_SPINWAIT) += cpu_ops_spinwait.o obj-$(CONFIG_MODULES) += module.o +ifeq ($(CONFIG_ALLOC_EXECMEM),y) +obj-y += execmem.o +endif obj-$(CONFIG_MODULE_SECTIONS) += module-sections.o obj-$(CONFIG_CPU_PM) += suspend_entry.o suspend.o diff --git a/arch/riscv/kernel/execmem.c b/arch/riscv/kernel/execmem.c new file mode 100644 index ..3e52522ead32 --- /dev/null +++ b/arch/riscv/kernel/execmem.c @@ -0,0 +1,22 @@ +// SPDX-License-Identifier: GPL-2.0-or-later + +#include +#include +#include +#include + +void *alloc_execmem(unsigned long size, gfp_t /* gfp */) +{ + return __vmalloc_node_range(size, 1, MODULES_VADDR, + MODULES_END, GFP_KERNEL, + PAGE_KERNEL, 0, NUMA_NO_NODE, + __builtin_return_address(0)); +} + +void free_execmem(void *region) +{ + if (in_interrupt()) + pr_warn("In interrupt context: vmalloc may not work.\n"); + + vfree(region); +} diff --git a/kernel/kprobes.c b/kernel/kprobes.c index a1a547723c3c..87fd8c14a938 100644 --- a/kernel/kprobes.c +++ b/kernel/kprobes.c @@ -119,7 +119,7 @@ void __weak *alloc_insn_page(void) * for most of the architectures. * (e.g. x86-64 needs this to handle the %rip-relative fixups.) */ - return alloc_execmem(PAGE_SIZE); + return alloc_execmem(PAGE_SIZE, GFP_KERNEL); } static void free_insn_page(void *page) -- 2.44.0
[PATCH v4 1/2] kprobes: textmem API
Tracing with kprobes while running a monolithic kernel is currently impossible because CONFIG_KPROBES depends on CONFIG_MODULES because it uses the kernel module allocator. Introduce alloc_textmem() and free_textmem() for allocating executable memory. If an arch implements these functions, it can mark this up with the HAVE_ALLOC_EXECMEM kconfig flag. At first this feature will be used for enabling kprobes without modules support for arch/riscv. Link: https://lore.kernel.org/all/20240325115632.04e37297491cadfbbf382...@kernel.org/ Suggested-by: Masami Hiramatsu Signed-off-by: Jarkko Sakkinen --- v4: - Squashed a couple of unrequired CONFIG_MODULES checks. - See https://lore.kernel.org/all/d034m18d63ec.2y11d954ys...@kernel.org/ v3: - A new patch added. - For IS_DEFINED() I need advice as I could not really find that many locations where it would be applicable. --- arch/Kconfig| 16 +++- include/linux/execmem.h | 13 + kernel/kprobes.c| 17 ++--- kernel/trace/trace_kprobe.c | 8 4 files changed, 50 insertions(+), 4 deletions(-) create mode 100644 include/linux/execmem.h diff --git a/arch/Kconfig b/arch/Kconfig index a5af0edd3eb8..33ba68b7168f 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -52,8 +52,8 @@ config GENERIC_ENTRY config KPROBES bool "Kprobes" - depends on MODULES depends on HAVE_KPROBES + select ALLOC_EXECMEM select KALLSYMS select TASKS_RCU if PREEMPTION help @@ -215,6 +215,20 @@ config HAVE_OPTPROBES config HAVE_KPROBES_ON_FTRACE bool +config HAVE_ALLOC_EXECMEM + bool + help + Architectures that select this option are capable of allocating executable + memory, which can be used by subsystems but is not dependent of any of its + clients. + +config ALLOC_EXECMEM + bool "Executable (trampoline) memory allocation" + depends on MODULES || HAVE_ALLOC_EXECMEM + help + Select this for executable (trampoline) memory. Can be enabled when either + module allocator or arch-specific allocator is available. + config ARCH_CORRECT_STACKTRACE_ON_KRETPROBE bool help diff --git a/include/linux/execmem.h b/include/linux/execmem.h new file mode 100644 index ..ae2ff151523a --- /dev/null +++ b/include/linux/execmem.h @@ -0,0 +1,13 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_EXECMEM_H +#define _LINUX_EXECMEM_H + +#ifdef CONFIG_HAVE_ALLOC_EXECMEM +void *alloc_execmem(unsigned long size, gfp_t gfp); +void free_execmem(void *region); +#else +#define alloc_execmem(size, gfp) module_alloc(size) +#define free_execmem(region) module_memfree(region) +#endif + +#endif /* _LINUX_EXECMEM_H */ diff --git a/kernel/kprobes.c b/kernel/kprobes.c index 9d9095e81792..a1a547723c3c 100644 --- a/kernel/kprobes.c +++ b/kernel/kprobes.c @@ -44,6 +44,7 @@ #include #include #include +#include #define KPROBE_HASH_BITS 6 #define KPROBE_TABLE_SIZE (1 << KPROBE_HASH_BITS) @@ -113,17 +114,17 @@ enum kprobe_slot_state { void __weak *alloc_insn_page(void) { /* -* Use module_alloc() so this page is within +/- 2GB of where the +* Use alloc_execmem() so this page is within +/- 2GB of where the * kernel image and loaded module images reside. This is required * for most of the architectures. * (e.g. x86-64 needs this to handle the %rip-relative fixups.) */ - return module_alloc(PAGE_SIZE); + return alloc_execmem(PAGE_SIZE); } static void free_insn_page(void *page) { - module_memfree(page); + free_execmem(page); } struct kprobe_insn_cache kprobe_insn_slots = { @@ -1580,6 +1581,7 @@ static int check_kprobe_address_safe(struct kprobe *p, goto out; } +#ifdef CONFIG_MODULES /* Check if 'p' is probing a module. */ *probed_mod = __module_text_address((unsigned long) p->addr); if (*probed_mod) { @@ -1603,6 +1605,8 @@ static int check_kprobe_address_safe(struct kprobe *p, ret = -ENOENT; } } +#endif + out: preempt_enable(); jump_label_unlock(); @@ -2482,6 +2486,7 @@ int kprobe_add_area_blacklist(unsigned long start, unsigned long end) return 0; } +#ifdef CONFIG_MODULES /* Remove all symbols in given area from kprobe blacklist */ static void kprobe_remove_area_blacklist(unsigned long start, unsigned long end) { @@ -2499,6 +2504,7 @@ static void kprobe_remove_ksym_blacklist(unsigned long entry) { kprobe_remove_area_blacklist(entry, entry + 1); } +#endif /* CONFIG_MODULES */ int __weak arch_kprobe_get_kallsym(unsigned int *symnum, unsigned long *value, char *type, char *sym) @@ -2564,6 +2570,7 @@ static int __init populate_kprobe_blacklist(unsigned lon
Re: [PATCH v3 2/2] arch/riscv: Enable kprobes when CONFIG_MODULES=n
On Mon Mar 25, 2024 at 10:37 PM EET, Jarkko Sakkinen wrote: > Tacing with kprobes while running a monolithic kernel is currently > impossible due the kernel module allocator dependency. > > Address the issue by implementing textmem API for RISC-V. > > Link: https://www.sochub.fi # for power on testing new SoC's with a minimal > stack > Link: > https://lore.kernel.org/all/2022060814.3054333-1-jar...@profian.com/ # > continuation > Signed-off-by: Jarkko Sakkinen I think that for any use case it is best for overall good to realize it like this. I.e. only create patch sets related to the topic that change behavior for arch's that are in your heavy use. For me that mean x86 and RISC-V. That is why I shrinked this from to focus into more narrow scope. For microarch's more alien to one, it is just too easy to make sloppy mistakes, which could cause unwanted harm. E.g. it is for best of arch/sh that someone involved with that microarchitecture does later on the shenanigans. BR, Jarkko
[PATCH v3 2/2] arch/riscv: Enable kprobes when CONFIG_MODULES=n
Tacing with kprobes while running a monolithic kernel is currently impossible due the kernel module allocator dependency. Address the issue by implementing textmem API for RISC-V. Link: https://www.sochub.fi # for power on testing new SoC's with a minimal stack Link: https://lore.kernel.org/all/2022060814.3054333-1-jar...@profian.com/ # continuation Signed-off-by: Jarkko Sakkinen --- v3: - Architecture independent parts have been split to separate patches. - Do not change arch/riscv/kernel/module.c as it is out of scope for this patch set now. v2: - Better late than never right? :-) - Focus only to RISC-V for now to make the patch more digestable. This is the arch where I use the patch on a daily basis to help with QA. - Introduce HAVE_KPROBES_ALLOC flag to help with more gradual migration. --- arch/riscv/Kconfig | 1 + arch/riscv/kernel/Makefile | 3 +++ arch/riscv/kernel/execmem.c | 22 ++ kernel/kprobes.c| 2 +- 4 files changed, 27 insertions(+), 1 deletion(-) create mode 100644 arch/riscv/kernel/execmem.c diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig index e3142ce531a0..499512fb17ff 100644 --- a/arch/riscv/Kconfig +++ b/arch/riscv/Kconfig @@ -132,6 +132,7 @@ config RISCV select HAVE_KPROBES if !XIP_KERNEL select HAVE_KPROBES_ON_FTRACE if !XIP_KERNEL select HAVE_KRETPROBES if !XIP_KERNEL + select HAVE_ALLOC_EXECMEM if !XIP_KERNEL # https://github.com/ClangBuiltLinux/linux/issues/1881 select HAVE_LD_DEAD_CODE_DATA_ELIMINATION if !LD_IS_LLD select HAVE_MOVE_PMD diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile index 604d6bf7e476..337797f10d3e 100644 --- a/arch/riscv/kernel/Makefile +++ b/arch/riscv/kernel/Makefile @@ -73,6 +73,9 @@ obj-$(CONFIG_SMP) += cpu_ops.o obj-$(CONFIG_RISCV_BOOT_SPINWAIT) += cpu_ops_spinwait.o obj-$(CONFIG_MODULES) += module.o +ifeq ($(CONFIG_ALLOC_EXECMEM),y) +obj-y += execmem.o +endif obj-$(CONFIG_MODULE_SECTIONS) += module-sections.o obj-$(CONFIG_CPU_PM) += suspend_entry.o suspend.o diff --git a/arch/riscv/kernel/execmem.c b/arch/riscv/kernel/execmem.c new file mode 100644 index ..4191251476d0 --- /dev/null +++ b/arch/riscv/kernel/execmem.c @@ -0,0 +1,22 @@ +// SPDX-License-Identifier: GPL-2.0-or-later + +#include +#include +#include +#include + +void *alloc_execmem(unsigned long size, gfp_t /* gfp */) +{ + return __vmalloc_node_range(size, 1, MODULES_VADDR, + MODULES_END, GFP_KERNEL, + PAGE_KERNEL, 0, NUMA_NO_NODE, + __builtin_return_address(0)); +} + +void free_execmem(void *region) +{ + if (in_interrupt()) + pr_warn("In interrupt context: vmalloc may not work.\n"); + + vfree(region); +} diff --git a/kernel/kprobes.c b/kernel/kprobes.c index a1a547723c3c..87fd8c14a938 100644 --- a/kernel/kprobes.c +++ b/kernel/kprobes.c @@ -119,7 +119,7 @@ void __weak *alloc_insn_page(void) * for most of the architectures. * (e.g. x86-64 needs this to handle the %rip-relative fixups.) */ - return alloc_execmem(PAGE_SIZE); + return alloc_execmem(PAGE_SIZE, GFP_KERNEL); } static void free_insn_page(void *page) -- 2.44.0
Re: [PATCH v2] arch/riscv: Enable kprobes when CONFIG_MODULES=n
On Mon Mar 25, 2024 at 9:11 PM EET, Jarkko Sakkinen wrote: > On Mon Mar 25, 2024 at 8:37 PM EET, Jarkko Sakkinen wrote: > > > You also should consider using IS_ENABLED(CONFIG_MODULE) in the code to > > > avoid using #ifdefs. > > Hmm... I need make a couple of remarks but open for feedback ofc. > > First, trace_kprobe_module_exist depends on find_module() > > Second, there is a notifier callback that heavily binds to the module > subsystem. > > In both cases using IS_ENABLED would emit a lot of compilation errors. Also I think adding 'gfp' makes sense exactly at the point as it has a use case, i.e. two call sites with differing flags. It makes sense but should be IMHO added exactly at that time. Leaving it from my patch set does not do any measurable harm but please correct if I'm missing something. BR, Jarkko
Re: [PATCH v2] arch/riscv: Enable kprobes when CONFIG_MODULES=n
On Mon Mar 25, 2024 at 8:37 PM EET, Jarkko Sakkinen wrote: > > You also should consider using IS_ENABLED(CONFIG_MODULE) in the code to > > avoid using #ifdefs. Hmm... I need make a couple of remarks but open for feedback ofc. First, trace_kprobe_module_exist depends on find_module() Second, there is a notifier callback that heavily binds to the module subsystem. In both cases using IS_ENABLED would emit a lot of compilation errors. BR, Jarkko
Re: [PATCH v2] arch/riscv: Enable kprobes when CONFIG_MODULES=n
On Sun Mar 24, 2024 at 2:37 AM EET, Randy Dunlap wrote: > > > On 3/23/24 16:29, Jarkko Sakkinen wrote: > > +config HAVE_KPROBES_ALLOC > > + bool > > + help > > + Architectures that select this option are capable of allocating memory > > + for kprobes withou the kernel module allocator. > > without Thanks, will fix. BR, Jarkko
Re: [PATCH v2] arch/riscv: Enable kprobes when CONFIG_MODULES=n
On Sun Mar 24, 2024 at 1:29 AM EET, Jarkko Sakkinen wrote: > Tracing with kprobes while running a monolithic kernel is currently > impossible due the kernel module allocator dependency. > > Address the issue by allowing architectures to implement module_alloc() > and module_memfree() independent of the module subsystem. An arch tree > can signal this by setting HAVE_KPROBES_ALLOC in its Kconfig file. > > Realize the feature on RISC-V by separating allocator to module_alloc.c > and implementing module_memfree(). > > Link: https://www.sochub.fi # for power on testing new SoC's with a minimal > stack > Link: > https://lore.kernel.org/all/2022060814.3054333-1-jar...@profian.com/ # > continuation > Signed-off-by: Jarkko Sakkinen As for testing I tried the kprobes example for boottime tracing dcoumentation: https://www.kernel.org/doc/html/v5.7/trace/boottime-trace.html I.e. ftrace.event { kprobes.vfs_read { probes = "vfs_read $arg1 $arg2" filter = "common_pid < 100" enable } } kernel { console = hvc0 earlycon = sbi trace_options = sym-addr trace_event = "initcall:*" tp_printk dump_on_oops = 2 trace_buf_size = 1M } BR, Jarkko
[PATCH v2] arch/riscv: Enable kprobes when CONFIG_MODULES=n
Tracing with kprobes while running a monolithic kernel is currently impossible due the kernel module allocator dependency. Address the issue by allowing architectures to implement module_alloc() and module_memfree() independent of the module subsystem. An arch tree can signal this by setting HAVE_KPROBES_ALLOC in its Kconfig file. Realize the feature on RISC-V by separating allocator to module_alloc.c and implementing module_memfree(). Link: https://www.sochub.fi # for power on testing new SoC's with a minimal stack Link: https://lore.kernel.org/all/2022060814.3054333-1-jar...@profian.com/ # continuation Signed-off-by: Jarkko Sakkinen --- v2: - Better late than never right? :-) - Focus only to RISC-V for now to make the patch more digestable. This is the arch where I use the patch on a daily basis to help with QA. - Introduce HAVE_KPROBES_ALLOC flag to help with more gradual migration. --- arch/Kconfig | 8 +++- arch/riscv/Kconfig | 1 + arch/riscv/kernel/Makefile | 5 + arch/riscv/kernel/module.c | 11 --- arch/riscv/kernel/module_alloc.c | 28 kernel/kprobes.c | 10 ++ kernel/trace/trace_kprobe.c | 18 -- 7 files changed, 67 insertions(+), 14 deletions(-) create mode 100644 arch/riscv/kernel/module_alloc.c diff --git a/arch/Kconfig b/arch/Kconfig index a5af0edd3eb8..c931f1de98a7 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -52,7 +52,7 @@ config GENERIC_ENTRY config KPROBES bool "Kprobes" - depends on MODULES + depends on MODULES || HAVE_KPROBES_ALLOC depends on HAVE_KPROBES select KALLSYMS select TASKS_RCU if PREEMPTION @@ -215,6 +215,12 @@ config HAVE_OPTPROBES config HAVE_KPROBES_ON_FTRACE bool +config HAVE_KPROBES_ALLOC + bool + help + Architectures that select this option are capable of allocating memory + for kprobes withou the kernel module allocator. + config ARCH_CORRECT_STACKTRACE_ON_KRETPROBE bool help diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig index e3142ce531a0..4f1b925e83d8 100644 --- a/arch/riscv/Kconfig +++ b/arch/riscv/Kconfig @@ -132,6 +132,7 @@ config RISCV select HAVE_KPROBES if !XIP_KERNEL select HAVE_KPROBES_ON_FTRACE if !XIP_KERNEL select HAVE_KRETPROBES if !XIP_KERNEL + select HAVE_KPROBES_ALLOC if !XIP_KERNEL # https://github.com/ClangBuiltLinux/linux/issues/1881 select HAVE_LD_DEAD_CODE_DATA_ELIMINATION if !LD_IS_LLD select HAVE_MOVE_PMD diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile index 604d6bf7e476..46318194bce1 100644 --- a/arch/riscv/kernel/Makefile +++ b/arch/riscv/kernel/Makefile @@ -73,6 +73,11 @@ obj-$(CONFIG_SMP)+= cpu_ops.o obj-$(CONFIG_RISCV_BOOT_SPINWAIT) += cpu_ops_spinwait.o obj-$(CONFIG_MODULES) += module.o +ifeq ($(CONFIG_MODULES),y) +obj-y += module_alloc.o +else +obj-$(CONFIG_KPROBES) += module_alloc.o +endif obj-$(CONFIG_MODULE_SECTIONS) += module-sections.o obj-$(CONFIG_CPU_PM) += suspend_entry.o suspend.o diff --git a/arch/riscv/kernel/module.c b/arch/riscv/kernel/module.c index 5e5a82644451..cc324b450f2e 100644 --- a/arch/riscv/kernel/module.c +++ b/arch/riscv/kernel/module.c @@ -905,17 +905,6 @@ int apply_relocate_add(Elf_Shdr *sechdrs, const char *strtab, return 0; } -#if defined(CONFIG_MMU) && defined(CONFIG_64BIT) -void *module_alloc(unsigned long size) -{ - return __vmalloc_node_range(size, 1, MODULES_VADDR, - MODULES_END, GFP_KERNEL, - PAGE_KERNEL, VM_FLUSH_RESET_PERMS, - NUMA_NO_NODE, - __builtin_return_address(0)); -} -#endif - int module_finalize(const Elf_Ehdr *hdr, const Elf_Shdr *sechdrs, struct module *me) diff --git a/arch/riscv/kernel/module_alloc.c b/arch/riscv/kernel/module_alloc.c new file mode 100644 index ..3d9aa8dbca8a --- /dev/null +++ b/arch/riscv/kernel/module_alloc.c @@ -0,0 +1,28 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Copyright (c) 2017 Zihao Yu + * Copyright (c) 2024 Jarkko Sakkinen + */ + +#include +#include +#include +#include + +#if defined(CONFIG_MMU) && defined(CONFIG_64BIT) +void *module_alloc(unsigned long size) +{ + return __vmalloc_node_range(size, 1, MODULES_VADDR, + MODULES_END, GFP_KERNEL, + PAGE_KERNEL, 0, NUMA_NO_NODE, + __builtin_return_address(0)); +} + +void module_memfree(void *module_region) +{ + if (in_interrupt()) + pr_warn("In interrupt context: vmalloc may not work.\n"); + + vfree(module_region); +} +
Re: [PATCH v9 10/15] x86/sgx: Add EPC reclamation in cgroup try_charge()
On Mon Feb 26, 2024 at 11:56 PM EET, Dave Hansen wrote: > On 2/26/24 13:48, Haitao Huang wrote: > > In case of overcomitting, i.e., sum of limits greater than the EPC > > capacity, if one group has a fault, and its usage is not above its own > > limit (try_charge() passes), yet total usage of the system has exceeded > > the capacity, whether we do global reclaim or just reclaim pages in the > > current faulting group. > > I don't see _any_ reason to limit reclaim to the current faulting cgroup. > > >> Last, what is the simplest (least amount of code) thing that the SGX > >> cgroup controller could implement here? > > > > I still think the current approach of doing global reclaim is reasonable > > and simple: try_charge() checks cgroup limit and reclaim within the > > group if needed, then do EPC page allocation, reclaim globally if > > allocation fails due to global usage reaches the capacity. > > > > I'm not sure how not doing global reclaiming in this case would bring > > any benefit. > I tend to agree. > > Kai, I think your examples sound a little bit contrived. Have actual > users expressed a strong intent for doing anything with this series > other than limiting bad actors from eating all the EPC? I'd consider this from the viewpoint is there anything in the user space visible portion of the patch set that would limit tuning the performance later on, if required let's say by a workload that acts sub-optimally. If not, then most of performance related issues can be only identified by actual use of the code. BR, Jarkko
Re: [RFC PATCH] x86/sgx: Remove 'reclaim' boolean parameters
On Mon Feb 19, 2024 at 10:25 PM UTC, Haitao Huang wrote: > On Mon, 19 Feb 2024 14:42:29 -0600, Jarkko Sakkinen > wrote: > > > On Mon Feb 19, 2024 at 3:56 PM UTC, Dave Hansen wrote: > >> On 2/19/24 07:39, Haitao Huang wrote: > >> > Remove all boolean parameters for 'reclaim' from the function > >> > sgx_alloc_epc_page() and its callers by making two versions of each > >> > function. > >> > > >> > Also opportunistically remove non-static declaration of > >> > __sgx_alloc_epc_page() and a typo > >> > > >> > Signed-off-by: Haitao Huang > >> > Suggested-by: Jarkko Sakkinen > >> > --- > >> > arch/x86/kernel/cpu/sgx/encl.c | 56 +-- > >> > arch/x86/kernel/cpu/sgx/encl.h | 6 ++- > >> > arch/x86/kernel/cpu/sgx/ioctl.c | 23 --- > >> > arch/x86/kernel/cpu/sgx/main.c | 68 > >> ++--- > >> > arch/x86/kernel/cpu/sgx/sgx.h | 4 +- > >> > arch/x86/kernel/cpu/sgx/virt.c | 2 +- > >> > 6 files changed, 115 insertions(+), 44 deletions(-) > >> > >> Jarkko, did this turn out how you expected? > >> > >> I think passing around a function pointer to *only* communicate 1 bit of > >> information is a _bit_ overkill here. > >> > >> Simply replacing the bool with: > >> > >> enum sgx_reclaim { > >>SGX_NO_RECLAIM, > >>SGX_DO_RECLAIM > >> }; > >> > >> would do the same thing. Right? > >> > >> Are you sure you want a function pointer for this? > > > > To look this in context I drafted quickly two branches representing > > imaginary next version of the patch set. > > > > I guess this would simpler and totally sufficient approach. > > > > With this approach I'd then change also: > > > > [PATCH v9 04/15] x86/sgx: Implement basic EPC misc cgroup functionality > > > > And add the enum-parameter already in that patch with just "no reclaim" > > enum. I.e. then 10/15 will add only "do reclaim" and the new > > functionality. > > > > BR, Jarkko > > > > Thanks. My understanding is: > > 1) For this patch, replace the boolean with the enum as Dave suggested. No > two versions of the same functions. And this is a prerequisite for the > cgroup series, positioned before [PATCH v9 04/15] > > 2) For [PATCH v9 04/15], pass a hard coded SGX_NO_RECLAIM to > sgx_epc_cg_try_charge() from sgx_alloc_epc_page(). Yup, this will make the whole patch set also a bit leaner as the API does not change in the middle. > > 3) For [PATCH v9 10/15], remove the hard coded value, pass the reclaim > enum parameter value from sgx_alloc_epc_page() to sgx_epc_cg_try_charge() > and add the reclaim logic. > > I'll send patches soon. But please let me know if I misunderstood. BR, Jarkko
Re: [RFC PATCH] x86/sgx: Remove 'reclaim' boolean parameters
On Mon Feb 19, 2024 at 3:56 PM UTC, Dave Hansen wrote: > On 2/19/24 07:39, Haitao Huang wrote: > > Remove all boolean parameters for 'reclaim' from the function > > sgx_alloc_epc_page() and its callers by making two versions of each > > function. > > > > Also opportunistically remove non-static declaration of > > __sgx_alloc_epc_page() and a typo > > > > Signed-off-by: Haitao Huang > > Suggested-by: Jarkko Sakkinen > > --- > > arch/x86/kernel/cpu/sgx/encl.c | 56 +-- > > arch/x86/kernel/cpu/sgx/encl.h | 6 ++- > > arch/x86/kernel/cpu/sgx/ioctl.c | 23 --- > > arch/x86/kernel/cpu/sgx/main.c | 68 ++--- > > arch/x86/kernel/cpu/sgx/sgx.h | 4 +- > > arch/x86/kernel/cpu/sgx/virt.c | 2 +- > > 6 files changed, 115 insertions(+), 44 deletions(-) > > Jarkko, did this turn out how you expected? > > I think passing around a function pointer to *only* communicate 1 bit of > information is a _bit_ overkill here. > > Simply replacing the bool with: > > enum sgx_reclaim { > SGX_NO_RECLAIM, > SGX_DO_RECLAIM > }; > > would do the same thing. Right? > > Are you sure you want a function pointer for this? To look this in context I drafted quickly two branches representing imaginary next version of the patch set. I guess this would simpler and totally sufficient approach. With this approach I'd then change also: [PATCH v9 04/15] x86/sgx: Implement basic EPC misc cgroup functionality And add the enum-parameter already in that patch with just "no reclaim" enum. I.e. then 10/15 will add only "do reclaim" and the new functionality. BR, Jarkko
Re: [RFC PATCH] x86/sgx: Remove 'reclaim' boolean parameters
On Mon Feb 19, 2024 at 3:39 PM UTC, Haitao Huang wrote: > Remove all boolean parameters for 'reclaim' from the function > sgx_alloc_epc_page() and its callers by making two versions of each > function. > > Also opportunistically remove non-static declaration of > __sgx_alloc_epc_page() and a typo > > Signed-off-by: Haitao Huang > Suggested-by: Jarkko Sakkinen I think this is for better. My view point for kernel patches overally is that: 1. A feature should leave the subsystem in cleaner state as far as the existing framework of doing things goes. 2. A bugfix can sometimes do the opposite if corner case requires some weird dance to perform. BR, Jarkko
Re: [PATCH v9 10/15] x86/sgx: Add EPC reclamation in cgroup try_charge()
On Mon Feb 19, 2024 at 3:12 PM UTC, Haitao Huang wrote: > On Tue, 13 Feb 2024 19:52:25 -0600, Jarkko Sakkinen > wrote: > > > On Tue Feb 13, 2024 at 1:15 AM EET, Haitao Huang wrote: > >> Hi Jarkko > >> > >> On Mon, 12 Feb 2024 13:55:46 -0600, Jarkko Sakkinen > >> wrote: > >> > >> > On Mon Feb 5, 2024 at 11:06 PM EET, Haitao Huang wrote: > >> >> From: Kristen Carlson Accardi > >> >> > >> >> When the EPC usage of a cgroup is near its limit, the cgroup needs to > >> >> reclaim pages used in the same cgroup to make room for new > >> allocations. > >> >> This is analogous to the behavior that the global reclaimer is > >> triggered > >> >> when the global usage is close to total available EPC. > >> >> > >> >> Add a Boolean parameter for sgx_epc_cgroup_try_charge() to indicate > >> >> whether synchronous reclaim is allowed or not. And trigger the > >> >> synchronous/asynchronous reclamation flow accordingly. > >> >> > >> >> Note at this point, all reclaimable EPC pages are still tracked in > >> the > >> >> global LRU and per-cgroup LRUs are empty. So no per-cgroup > >> reclamation > >> >> is activated yet. > >> >> > >> >> Co-developed-by: Sean Christopherson > >> > >> >> Signed-off-by: Sean Christopherson > >> >> Signed-off-by: Kristen Carlson Accardi > >> >> Co-developed-by: Haitao Huang > >> >> Signed-off-by: Haitao Huang > >> >> --- > >> >> V7: > >> >> - Split this out from the big patch, #10 in V6. (Dave, Kai) > >> >> --- > >> >> arch/x86/kernel/cpu/sgx/epc_cgroup.c | 26 -- > >> >> arch/x86/kernel/cpu/sgx/epc_cgroup.h | 4 ++-- > >> >> arch/x86/kernel/cpu/sgx/main.c | 2 +- > >> >> 3 files changed, 27 insertions(+), 5 deletions(-) > >> >> > >> >> diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c > >> >> b/arch/x86/kernel/cpu/sgx/epc_cgroup.c > >> >> index d399fda2b55e..abf74fdb12b4 100644 > >> >> --- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c > >> >> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c > >> >> @@ -184,13 +184,35 @@ static void > >> >> sgx_epc_cgroup_reclaim_work_func(struct work_struct *work) > >> >> /** > >> >> * sgx_epc_cgroup_try_charge() - try to charge cgroup for a single > >> EPC > >> >> page > >> >> * @epc_cg:The EPC cgroup to be charged for the page. > >> >> + * @reclaim: Whether or not synchronous reclaim is allowed > >> >> * Return: > >> >> * * %0 - If successfully charged. > >> >> * * -errno - for failures. > >> >> */ > >> >> -int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg) > >> >> +int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg, bool > >> >> reclaim) > >> >> { > >> >> - return misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, > >> PAGE_SIZE); > >> >> + for (;;) { > >> >> + if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, > >> >> + PAGE_SIZE)) > >> >> + break; > >> >> + > >> >> + if (sgx_epc_cgroup_lru_empty(epc_cg->cg)) > >> >> + return -ENOMEM; > >> >> + + if (signal_pending(current)) > >> >> + return -ERESTARTSYS; > >> >> + > >> >> + if (!reclaim) { > >> >> + queue_work(sgx_epc_cg_wq, > >> >> _cg->reclaim_work); > >> >> + return -EBUSY; > >> >> + } > >> >> + > >> >> + if (!sgx_epc_cgroup_reclaim_pages(epc_cg->cg, false)) > >> >> + /* All pages were too young to reclaim, try > >> >> again a little later > >> */ > >> >> + schedule(); > >> > > >> > This will be total pain to backtrack after a while when something > >> > needs to be changed so there definitely should be inlin
Re: [PATCH v9 10/15] x86/sgx: Add EPC reclamation in cgroup try_charge()
On Tue Feb 13, 2024 at 1:15 AM EET, Haitao Huang wrote: > Hi Jarkko > > On Mon, 12 Feb 2024 13:55:46 -0600, Jarkko Sakkinen > wrote: > > > On Mon Feb 5, 2024 at 11:06 PM EET, Haitao Huang wrote: > >> From: Kristen Carlson Accardi > >> > >> When the EPC usage of a cgroup is near its limit, the cgroup needs to > >> reclaim pages used in the same cgroup to make room for new allocations. > >> This is analogous to the behavior that the global reclaimer is triggered > >> when the global usage is close to total available EPC. > >> > >> Add a Boolean parameter for sgx_epc_cgroup_try_charge() to indicate > >> whether synchronous reclaim is allowed or not. And trigger the > >> synchronous/asynchronous reclamation flow accordingly. > >> > >> Note at this point, all reclaimable EPC pages are still tracked in the > >> global LRU and per-cgroup LRUs are empty. So no per-cgroup reclamation > >> is activated yet. > >> > >> Co-developed-by: Sean Christopherson > >> Signed-off-by: Sean Christopherson > >> Signed-off-by: Kristen Carlson Accardi > >> Co-developed-by: Haitao Huang > >> Signed-off-by: Haitao Huang > >> --- > >> V7: > >> - Split this out from the big patch, #10 in V6. (Dave, Kai) > >> --- > >> arch/x86/kernel/cpu/sgx/epc_cgroup.c | 26 -- > >> arch/x86/kernel/cpu/sgx/epc_cgroup.h | 4 ++-- > >> arch/x86/kernel/cpu/sgx/main.c | 2 +- > >> 3 files changed, 27 insertions(+), 5 deletions(-) > >> > >> diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c > >> b/arch/x86/kernel/cpu/sgx/epc_cgroup.c > >> index d399fda2b55e..abf74fdb12b4 100644 > >> --- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c > >> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c > >> @@ -184,13 +184,35 @@ static void > >> sgx_epc_cgroup_reclaim_work_func(struct work_struct *work) > >> /** > >> * sgx_epc_cgroup_try_charge() - try to charge cgroup for a single EPC > >> page > >> * @epc_cg: The EPC cgroup to be charged for the page. > >> + * @reclaim: Whether or not synchronous reclaim is allowed > >> * Return: > >> * * %0 - If successfully charged. > >> * * -errno - for failures. > >> */ > >> -int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg) > >> +int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg, bool > >> reclaim) > >> { > >> - return misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE); > >> + for (;;) { > >> + if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, > >> + PAGE_SIZE)) > >> + break; > >> + > >> + if (sgx_epc_cgroup_lru_empty(epc_cg->cg)) > >> + return -ENOMEM; > >> + +if (signal_pending(current)) > >> + return -ERESTARTSYS; > >> + > >> + if (!reclaim) { > >> + queue_work(sgx_epc_cg_wq, _cg->reclaim_work); > >> + return -EBUSY; > >> + } > >> + > >> + if (!sgx_epc_cgroup_reclaim_pages(epc_cg->cg, false)) > >> + /* All pages were too young to reclaim, try again a > >> little later */ > >> + schedule(); > > > > This will be total pain to backtrack after a while when something > > needs to be changed so there definitely should be inline comments > > addressing each branch condition. > > > > I'd rethink this as: > > > > 1. Create static __sgx_epc_cgroup_try_charge() for addressing single > >iteration with the new "reclaim" parameter. > > 2. Add a new sgx_epc_group_try_charge_reclaim() function. > > > > There's a bit of redundancy with sgx_epc_cgroup_try_charge() and > > sgx_epc_cgroup_try_charge_reclaim() because both have almost the > > same loop calling internal __sgx_epc_cgroup_try_charge() with > > different parameters. That is totally acceptable. > > > > Please also add my suggested-by. > > > > BR, Jarkko > > > > BR, Jarkko > > > For #2: > The only caller of this function, sgx_alloc_epc_page(), has the same > boolean which is passed into this this function. I know. This would be good opportunity to fix that up. Large patch sets should try to make the space for its feature best possible and thus also clean up the code base overally. > If we separate it into sgx_epc_cgroup_try_charge() and > sgx_epc_cgroup_try_charge_reclaim(), then the caller has to have the > if/else branches. So separation here seems not help? Of course it does. It makes the code in that location self-documenting and easier to remember what it does. BR, Jarkko
Re: [PATCH v2] x86/sgx: fix kernel-doc comment misuse
On Sun Feb 11, 2024 at 8:24 AM EET, Randy Dunlap wrote: > Don't use "/**" for a non-kernel-doc comment. This prevents a warning > from scripts/kernel-doc: > > main.c:740: warning: expecting prototype for A section metric is concatenated > in a way that @low bits 12(). Prototype was for sgx_calc_section_metric() > instead > > Cc: Jarkko Sakkinen > Cc: Dave Hansen > Cc: linux-...@vger.kernel.org > Cc: x...@kernel.org > Reviewed-by: Kai Huang > Signed-off-by: Randy Dunlap > --- > v2: add Rev-by: Kai Huang > > arch/x86/kernel/cpu/sgx/main.c |2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff -- a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c > --- a/arch/x86/kernel/cpu/sgx/main.c > +++ b/arch/x86/kernel/cpu/sgx/main.c > @@ -731,7 +731,7 @@ out: > return 0; > } > > -/** > +/* > * A section metric is concatenated in a way that @low bits 12-31 define the > * bits 12-31 of the metric and @high bits 0-19 define the bits 32-51 of the > * metric. Reviewed-by: Jarkko Sakkinen BR, Jarkko