from:"Jarkko Sakkinen"

Re: [RFC PATCH v3 1/1] x86/sgx: Explicitly give up the CPU in EDMM's ioctl() to avoid softlockup

2024-05-16 Thread Jarkko Sakkinen

On Thu May 16, 2024 at 1:29 AM EEST, Haitao Huang wrote:
> On Wed, 15 May 2024 16:55:59 -0500, Haitao Huang  
>  wrote:
>
> > On Wed, 15 May 2024 01:55:21 -0500, Bojun Zhu  
> >  wrote:
> >
> >> EDMM's ioctl()s support batch operations, which may be
> >> time-consuming. Try to explicitly give up the CPU as the prefix
> >> operation at the every begin of "for loop" in
> >> sgx_enclave_{ modify_types | restrict_permissions | remove_pages}
> >> to give other tasks a chance to run, and avoid softlockup warning.
> >>
> >> Additionally perform pending signals check as the prefix operation,
> >> and introduce sgx_check_signal_and_resched(),
> >> which wraps all the checks.
> >>
> >> The following has been observed on Linux v6.9-rc5 with kernel
> >> preemptions disabled(by configuring "PREEMPT_NONE=y"), when kernel
> >> is requested to restrict page permissions of a large number of EPC  
> >> pages.
> >>
> >> [ cut here ]
> >> watchdog: BUG: soft lockup - CPU#45 stuck for 22s!
> >> ...
> >> RIP: 0010:sgx_enclave_restrict_permissions+0xba/0x1f0
> >> ...
> >> Call Trace:
> >>  sgx_ioctl
> >>  __x64_sys_ioctl
> >>  x64_sys_call
> >>  do_syscall_64
> >>  entry_SYSCALL_64_after_hwframe
> >> [ end trace ]
> >>
> >> Signed-off-by: Bojun Zhu 
> >> ---
> >>  arch/x86/kernel/cpu/sgx/ioctl.c | 40 +++--
> >>  1 file changed, 28 insertions(+), 12 deletions(-)
> >>
> >> diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c  
> >> b/arch/x86/kernel/cpu/sgx/ioctl.c
> >> index b65ab214bdf5..6199f483143e 100644
> >> --- a/arch/x86/kernel/cpu/sgx/ioctl.c
> >> +++ b/arch/x86/kernel/cpu/sgx/ioctl.c
> >> @@ -365,6 +365,20 @@ static int sgx_validate_offset_length(struct  
> >> sgx_encl *encl,
> >>return 0;
> >>  }
> >> +/*
> >> + * Check signals and invoke scheduler. Return true for a pending  
> >> signal.
> >> + */
> >> +static bool sgx_check_signal_and_resched(void)
> >> +{
> >> +  if (signal_pending(current))
> >> +  return true;
> >> +
> >> +  if (need_resched())
> >> +  cond_resched();
> >> +
> >> +  return false;
> >> +}
> >> +
> >>  /**
> >>   * sgx_ioc_enclave_add_pages() - The handler for  
> >> %SGX_IOC_ENCLAVE_ADD_PAGES
> >>   * @encl:   an enclave pointer
> >> @@ -409,7 +423,7 @@ static long sgx_ioc_enclave_add_pages(struct  
> >> sgx_encl *encl, void __user *arg)
> >>struct sgx_enclave_add_pages add_arg;
> >>struct sgx_secinfo secinfo;
> >>unsigned long c;
> >> -  int ret;
> >> +  int ret = -ERESTARTSYS;
> >>if (!test_bit(SGX_ENCL_CREATED, >flags) ||
> >>test_bit(SGX_ENCL_INITIALIZED, >flags))
> >> @@ -432,15 +446,8 @@ static long sgx_ioc_enclave_add_pages(struct  
> >> sgx_encl *encl, void __user *arg)
> >>return -EINVAL;
> >>for (c = 0 ; c < add_arg.length; c += PAGE_SIZE) {
> >> -  if (signal_pending(current)) {
> >> -  if (!c)
> >> -  ret = -ERESTARTSYS;
> >> -
> >> +  if (sgx_check_signal_and_resched())
> >>break;
> >> -  }
> >
> > ERESTARTSYS is only appropriate if we have not EADDed any pages yet.
> > If we got interrupted in the middle, we should return 0. User space  
> > would check the 'count' returned and decide to recall this ioctl() with  
> > 'offset'  reset to the next page, and adjust length.
>
> NVM, I misread it. ret will be changed to zero in subsequent iteration.
>
> Reviewed-by: Haitao Huang 

Duh, and I responded too quickly. OK, I revisited the original
patch and yes ret gets reseted. Ignore my previous response ;-)

My tags still hold, sorry.

BR, Jarkko

Re: [RFC PATCH v3 1/1] x86/sgx: Explicitly give up the CPU in EDMM's ioctl() to avoid softlockup

2024-05-16 Thread Jarkko Sakkinen

On Thu May 16, 2024 at 12:55 AM EEST, Haitao Huang wrote:
> On Wed, 15 May 2024 01:55:21 -0500, Bojun Zhu   
> wrote:
>
> > EDMM's ioctl()s support batch operations, which may be
> > time-consuming. Try to explicitly give up the CPU as the prefix
> > operation at the every begin of "for loop" in
> > sgx_enclave_{ modify_types | restrict_permissions | remove_pages}
> > to give other tasks a chance to run, and avoid softlockup warning.
> >
> > Additionally perform pending signals check as the prefix operation,
> > and introduce sgx_check_signal_and_resched(),
> > which wraps all the checks.
> >
> > The following has been observed on Linux v6.9-rc5 with kernel
> > preemptions disabled(by configuring "PREEMPT_NONE=y"), when kernel
> > is requested to restrict page permissions of a large number of EPC pages.
> >
> > [ cut here ]
> > watchdog: BUG: soft lockup - CPU#45 stuck for 22s!
> > ...
> > RIP: 0010:sgx_enclave_restrict_permissions+0xba/0x1f0
> > ...
> > Call Trace:
> >  sgx_ioctl
> >  __x64_sys_ioctl
> >  x64_sys_call
> >  do_syscall_64
> >  entry_SYSCALL_64_after_hwframe
> > [ end trace ]
> >
> > Signed-off-by: Bojun Zhu 
> > ---
> >  arch/x86/kernel/cpu/sgx/ioctl.c | 40 +++--
> >  1 file changed, 28 insertions(+), 12 deletions(-)
> >
> > diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c  
> > b/arch/x86/kernel/cpu/sgx/ioctl.c
> > index b65ab214bdf5..6199f483143e 100644
> > --- a/arch/x86/kernel/cpu/sgx/ioctl.c
> > +++ b/arch/x86/kernel/cpu/sgx/ioctl.c
> > @@ -365,6 +365,20 @@ static int sgx_validate_offset_length(struct  
> > sgx_encl *encl,
> > return 0;
> >  }
> > +/*
> > + * Check signals and invoke scheduler. Return true for a pending signal.
> > + */
> > +static bool sgx_check_signal_and_resched(void)
> > +{
> > +   if (signal_pending(current))
> > +   return true;
> > +
> > +   if (need_resched())
> > +   cond_resched();
> > +
> > +   return false;
> > +}
> > +
> >  /**
> >   * sgx_ioc_enclave_add_pages() - The handler for  
> > %SGX_IOC_ENCLAVE_ADD_PAGES
> >   * @encl:   an enclave pointer
> > @@ -409,7 +423,7 @@ static long sgx_ioc_enclave_add_pages(struct  
> > sgx_encl *encl, void __user *arg)
> > struct sgx_enclave_add_pages add_arg;
> > struct sgx_secinfo secinfo;
> > unsigned long c;
> > -   int ret;
> > +   int ret = -ERESTARTSYS;
> > if (!test_bit(SGX_ENCL_CREATED, >flags) ||
> > test_bit(SGX_ENCL_INITIALIZED, >flags))
> > @@ -432,15 +446,8 @@ static long sgx_ioc_enclave_add_pages(struct  
> > sgx_encl *encl, void __user *arg)
> > return -EINVAL;
> > for (c = 0 ; c < add_arg.length; c += PAGE_SIZE) {
> > -   if (signal_pending(current)) {
> > -   if (!c)
> > -   ret = -ERESTARTSYS;
> > -
> > +   if (sgx_check_signal_and_resched())
> > break;
> > -   }
>
> ERESTARTSYS is only appropriate if we have not EADDed any pages yet.
> If we got interrupted in the middle, we should return 0. User space would  
> check the 'count' returned and decide to recall this ioctl() with  
> 'offset'  reset to the next page, and adjust length.

Good catch! Thanks Haitao for the remark.

BR, Jarkko

Re: [PATCH v2 1/2] x86/sgx: Resolve EAUG race where losing thread returns SIGBUS

2024-05-15 Thread Jarkko Sakkinen

> Thank you very much. I understand the changelog is still being discussed
> and those changes look good to me, to which you can add:
>
> Reviewed-by: Reinette Chatre 

also for this (with changelog tweak Dave suggested) so that we don't
need a new round:

Reviewed-by: Jarkko Sakkinen 

BR, Jarkko

Re: [PATCH v2 2/2] x86/sgx: Resolve EREMOVE page vs EAUG page data race

2024-05-15 Thread Jarkko Sakkinen

644
> --- a/arch/x86/kernel/cpu/sgx/encl.c
> +++ b/arch/x86/kernel/cpu/sgx/encl.c
> @@ -257,7 +257,8 @@ static struct sgx_encl_page *__sgx_encl_load_page(struct 
> sgx_encl *encl,
>  
>   /* Entry successfully located. */
>   if (entry->epc_page) {
> - if (entry->desc & SGX_ENCL_PAGE_BEING_RECLAIMED)
> + if (entry->desc & (SGX_ENCL_PAGE_BEING_RECLAIMED |
> +SGX_ENCL_PAGE_BEING_REMOVED))
>   return ERR_PTR(-EBUSY);
>  
>   return entry;
> diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h
> index f94ff14c9486..fff5f2293ae7 100644
> --- a/arch/x86/kernel/cpu/sgx/encl.h
> +++ b/arch/x86/kernel/cpu/sgx/encl.h
> @@ -25,6 +25,9 @@
>  /* 'desc' bit marking that the page is being reclaimed. */
>  #define SGX_ENCL_PAGE_BEING_RECLAIMEDBIT(3)
>  
> +/* 'desc' bit marking that the page is being removed. */
> +#define SGX_ENCL_PAGE_BEING_REMOVED  BIT(2)
> +
>  struct sgx_encl_page {
>   unsigned long desc;
>   unsigned long vm_max_prot_bits:8;
> diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
> index b65ab214bdf5..c542d4dd3e64 100644
> --- a/arch/x86/kernel/cpu/sgx/ioctl.c
> +++ b/arch/x86/kernel/cpu/sgx/ioctl.c
> @@ -1142,6 +1142,7 @@ static long sgx_encl_remove_pages(struct sgx_encl *encl,
>* Do not keep encl->lock because of dependency on
>* mmap_lock acquired in sgx_zap_enclave_ptes().
>*/
> + entry->desc |= SGX_ENCL_PAGE_BEING_REMOVED;
>   mutex_unlock(>lock);
>  
>   sgx_zap_enclave_ptes(encl, addr);

Makes perfect sense:

Reviewed-by: Jarkko Sakkinen 

BR, Jarkko

Re: [PATCH v2 1/2] x86/sgx: Resolve EAUG race where losing thread returns SIGBUS

2024-05-15 Thread Jarkko Sakkinen

On Wed May 15, 2024 at 5:15 PM EEST, Dave Hansen wrote:
> On 5/15/24 06:54, Jarkko Sakkinen wrote:
> > I'd cut out 90% of the description out and just make the argument of
> > the wrong error code, and done. The sequence is great for showing
> > how this could happen. The prose makes my head hurt tbh.
>
> The changelog is too long, but not fatally so.  I'd much rather have a
> super verbose description than something super sparse.
>
> Would something like this make more sense to folks?
>
>   Imagine an mmap()'d file. Two threads touch the same address at
>   the same time and fault. Both allocate a physical page and race
>   to install a PTE for that page. Only one will win the race. The
>   loser frees its page, but still continues handling the fault as
>   a success and returns VM_FAULT_NOPAGE from the fault handler.
>
>   The same race can happen with SGX. But there's a bug: the loser
>   in the SGX steers into a failure path. The loser EREMOVE's the
>   winner's EPC page, then returns SIGBUS, likely killing the app.
>
>   Fix the SGX loser's behavior. Change the return code to
>   VM_FAULT_NOPAGE to avoid SIGBUS and call sgx_free_epc_page()
>   which avoids EREMOVE'ing the winner's page and only frees the
>   page that the loser allocated.

Yes!

I did read the whole thing. My comment was only related to the
chain of maintainers who also have to deal with this patch
eventually.

BR, Jarkko

Re: [PATCH v2 1/2] x86/sgx: Resolve EAUG race where losing thread returns SIGBUS

2024-05-15 Thread Jarkko Sakkinen

On Wed May 15, 2024 at 4:54 PM EEST, Jarkko Sakkinen wrote:
> On Wed May 15, 2024 at 4:12 PM EEST, Dmitrii Kuvaiskii wrote:
> > diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
> > index 279148e72459..41f14b1a3025 100644
> > --- a/arch/x86/kernel/cpu/sgx/encl.c
> > +++ b/arch/x86/kernel/cpu/sgx/encl.c
> > @@ -382,8 +382,11 @@ static vm_fault_t sgx_encl_eaug_page(struct 
> > vm_area_struct *vma,
> >  * If ret == -EBUSY then page was created in another flow while
> >  * running without encl->lock
> >  */
> > -   if (ret)
> > +   if (ret) {
> > +   if (ret == -EBUSY)
> > +   vmret = VM_FAULT_NOPAGE;
> > goto err_out_shrink;
> > +   }
>
> I agree that there is a bug but it does not categorize as race
> condition.
>
> The bug is simply that for a valid page SIGBUS might be returned.
> The fix is correct but the claim is not.
>
> >  
> > pginfo.secs = (unsigned long)sgx_get_epc_virt_addr(encl->secs.epc_page);
> > pginfo.addr = encl_page->desc & PAGE_MASK;
> > @@ -419,7 +422,7 @@ static vm_fault_t sgx_encl_eaug_page(struct 
> > vm_area_struct *vma,
> >  err_out_shrink:
> > sgx_encl_shrink(encl, va_page);
> >  err_out_epc:
> > -   sgx_encl_free_epc_page(epc_page);
> > +   sgx_free_epc_page(epc_page);
> >  err_out_unlock:
> > mutex_unlock(>lock);
> > kfree(encl_page);
>
> Agree with code change 100% but not with the description.
>
> I'd cut out 90% of the description out and just make the argument of
> the wrong error code, and done. The sequence is great for showing
> how this could happen. The prose makes my head hurt tbh.

Also please remember that stable maintainers need to read all of that
if this is a bug fix (it is a bug fix!) :-) So shorted possible legit
argument, no prose and the sequence was awesome :-)

BR, Jarkko

Re: [PATCH v2 1/2] x86/sgx: Resolve EAUG race where losing thread returns SIGBUS

2024-05-15 Thread Jarkko Sakkinen

On Wed May 15, 2024 at 4:12 PM EEST, Dmitrii Kuvaiskii wrote:
> diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
> index 279148e72459..41f14b1a3025 100644
> --- a/arch/x86/kernel/cpu/sgx/encl.c
> +++ b/arch/x86/kernel/cpu/sgx/encl.c
> @@ -382,8 +382,11 @@ static vm_fault_t sgx_encl_eaug_page(struct 
> vm_area_struct *vma,
>* If ret == -EBUSY then page was created in another flow while
>* running without encl->lock
>*/
> - if (ret)
> + if (ret) {
> + if (ret == -EBUSY)
> + vmret = VM_FAULT_NOPAGE;
>   goto err_out_shrink;
> + }

I agree that there is a bug but it does not categorize as race
condition.

The bug is simply that for a valid page SIGBUS might be returned.
The fix is correct but the claim is not.

>  
>   pginfo.secs = (unsigned long)sgx_get_epc_virt_addr(encl->secs.epc_page);
>   pginfo.addr = encl_page->desc & PAGE_MASK;
> @@ -419,7 +422,7 @@ static vm_fault_t sgx_encl_eaug_page(struct 
> vm_area_struct *vma,
>  err_out_shrink:
>   sgx_encl_shrink(encl, va_page);
>  err_out_epc:
> - sgx_encl_free_epc_page(epc_page);
> + sgx_free_epc_page(epc_page);
>  err_out_unlock:
>   mutex_unlock(>lock);
>   kfree(encl_page);

Agree with code change 100% but not with the description.

I'd cut out 90% of the description out and just make the argument of
the wrong error code, and done. The sequence is great for showing
how this could happen. The prose makes my head hurt tbh.

BR, Jarkko

Re: [RFC PATCH v3 1/1] x86/sgx: Explicitly give up the CPU in EDMM's ioctl() to avoid softlockup

2024-05-15 Thread Jarkko Sakkinen

On Wed May 15, 2024 at 9:55 AM EEST, Bojun Zhu wrote:
> EDMM's ioctl()s support batch operations, which may be
> time-consuming. Try to explicitly give up the CPU as the prefix
> operation at the every begin of "for loop" in
> sgx_enclave_{ modify_types | restrict_permissions | remove_pages}
> to give other tasks a chance to run, and avoid softlockup warning.
>
> Additionally perform pending signals check as the prefix operation,
> and introduce sgx_check_signal_and_resched(),
> which wraps all the checks.
>
> The following has been observed on Linux v6.9-rc5 with kernel
> preemptions disabled(by configuring "PREEMPT_NONE=y"), when kernel
> is requested to restrict page permissions of a large number of EPC pages.
>
> [ cut here ]
> watchdog: BUG: soft lockup - CPU#45 stuck for 22s!
> ...
> RIP: 0010:sgx_enclave_restrict_permissions+0xba/0x1f0
> ...
> Call Trace:
>  sgx_ioctl
>  __x64_sys_ioctl
>  x64_sys_call
>  do_syscall_64
>      entry_SYSCALL_64_after_hwframe
> [ end trace ]
>

Suggested-by: Jarkko Sakkinen 

> Signed-off-by: Bojun Zhu 
> ---
>  arch/x86/kernel/cpu/sgx/ioctl.c | 40 +++--
>  1 file changed, 28 insertions(+), 12 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
> index b65ab214bdf5..6199f483143e 100644
> --- a/arch/x86/kernel/cpu/sgx/ioctl.c
> +++ b/arch/x86/kernel/cpu/sgx/ioctl.c
> @@ -365,6 +365,20 @@ static int sgx_validate_offset_length(struct sgx_encl 
> *encl,
>   return 0;
>  }
>  
> +/*
> + * Check signals and invoke scheduler. Return true for a pending signal.
> + */
> +static bool sgx_check_signal_and_resched(void)
> +{
> + if (signal_pending(current))
> + return true;
> +
> + if (need_resched())
> + cond_resched();
> +
> + return false;
> +}
> +
>  /**
>   * sgx_ioc_enclave_add_pages() - The handler for %SGX_IOC_ENCLAVE_ADD_PAGES
>   * @encl:   an enclave pointer
> @@ -409,7 +423,7 @@ static long sgx_ioc_enclave_add_pages(struct sgx_encl 
> *encl, void __user *arg)
>   struct sgx_enclave_add_pages add_arg;
>   struct sgx_secinfo secinfo;
>   unsigned long c;
> - int ret;
> + int ret = -ERESTARTSYS;
>  
>   if (!test_bit(SGX_ENCL_CREATED, >flags) ||
>   test_bit(SGX_ENCL_INITIALIZED, >flags))
> @@ -432,15 +446,8 @@ static long sgx_ioc_enclave_add_pages(struct sgx_encl 
> *encl, void __user *arg)
>   return -EINVAL;
>  
>   for (c = 0 ; c < add_arg.length; c += PAGE_SIZE) {
> - if (signal_pending(current)) {
> - if (!c)
> - ret = -ERESTARTSYS;
> -
> + if (sgx_check_signal_and_resched())
>   break;
> - }
> -
> - if (need_resched())
> - cond_resched();
>  
>   ret = sgx_encl_add_page(encl, add_arg.src + c, add_arg.offset + 
> c,
>   , add_arg.flags);
> @@ -740,12 +747,15 @@ sgx_enclave_restrict_permissions(struct sgx_encl *encl,
>   unsigned long addr;
>   unsigned long c;
>   void *epc_virt;
> - int ret;
> + int ret = -ERESTARTSYS;
>  
>   memset(, 0, sizeof(secinfo));
>   secinfo.flags = modp->permissions & SGX_SECINFO_PERMISSION_MASK;
>  
>   for (c = 0 ; c < modp->length; c += PAGE_SIZE) {
> + if (sgx_check_signal_and_resched())
> + goto out;
> +
>   addr = encl->base + modp->offset + c;
>  
>   sgx_reclaim_direct();
> @@ -898,7 +908,7 @@ static long sgx_enclave_modify_types(struct sgx_encl 
> *encl,
>   unsigned long addr;
>   unsigned long c;
>   void *epc_virt;
> - int ret;
> + int ret = -ERESTARTSYS;
>  
>   page_type = modt->page_type & SGX_PAGE_TYPE_MASK;
>  
> @@ -913,6 +923,9 @@ static long sgx_enclave_modify_types(struct sgx_encl 
> *encl,
>   secinfo.flags = page_type << 8;
>  
>   for (c = 0 ; c < modt->length; c += PAGE_SIZE) {
> + if (sgx_check_signal_and_resched())
> + goto out;
> +
>   addr = encl->base + modt->offset + c;
>  
>   sgx_reclaim_direct();
> @@ -1095,12 +1108,15 @@ static long sgx_encl_remove_pages(struct sgx_encl 
> *encl,
>   unsigned long addr;
>   unsigned long c;
>   void *epc_virt;
> - int ret;
> + int ret = -ERESTARTSYS;
>  
>   memset(, 0, sizeof(secinfo));
>   secinfo.flags = SGX_SECINFO_R | SGX_SECINFO_W | SGX_SECINFO_X;
>  
>   for (c = 0 ; c < params->length; c += PAGE_SIZE) {
> + if (sgx_check_signal_and_resched())
> + goto out;
> +
>   addr = encl->base + params->offset + c;
>  
>   sgx_reclaim_direct();

Reviewed-by: Jarkko Sakkinen 

BR, Jarkko

Re: [PATCH] mm: Remove mm argument from mm_get_unmapped_area()

2024-05-07 Thread Jarkko Sakkinen

+++ b/mm/mmap.c
> @@ -1901,16 +1901,15 @@ __get_unmapped_area(struct file *file, unsigned long 
> addr, unsigned long len,
>   return error ? error : addr;
>  }
>  
> -unsigned long
> -mm_get_unmapped_area(struct mm_struct *mm, struct file *file,
> -  unsigned long addr, unsigned long len,
> -  unsigned long pgoff, unsigned long flags)
> +unsigned long current_get_unmapped_area(struct file *file, unsigned long 
> addr,
> + unsigned long len, unsigned long pgoff,
> + unsigned long flags)
>  {
> - if (test_bit(MMF_TOPDOWN, >flags))
> + if (test_bit(MMF_TOPDOWN, >mm->flags))
>   return arch_get_unmapped_area_topdown(file, addr, len, pgoff, 
> flags);
>   return arch_get_unmapped_area(file, addr, len, pgoff, flags);
>  }
> -EXPORT_SYMBOL(mm_get_unmapped_area);
> +EXPORT_SYMBOL(current_get_unmapped_area);
>  
>  /**
>   * find_vma_intersection() - Look up the first VMA which intersects the 
> interval
> diff --git a/mm/shmem.c b/mm/shmem.c
> index f5d60436b604..c0acd7db93c8 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2276,8 +2276,7 @@ unsigned long shmem_get_unmapped_area(struct file *file,
>   if (len > TASK_SIZE)
>   return -ENOMEM;
>  
> - addr = mm_get_unmapped_area(current->mm, file, uaddr, len, pgoff,
> - flags);
> + addr = current_get_unmapped_area(file, uaddr, len, pgoff, flags);
>  
>   if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
>   return addr;
> @@ -2334,8 +2333,8 @@ unsigned long shmem_get_unmapped_area(struct file *file,
>   if (inflated_len < len)
>   return addr;
>  
> - inflated_addr = mm_get_unmapped_area(current->mm, NULL, uaddr,
> -  inflated_len, 0, flags);
> + inflated_addr = current_get_unmapped_area(NULL, uaddr,
> +   inflated_len, 0, flags);
>   if (IS_ERR_VALUE(inflated_addr))
>   return addr;
>   if (inflated_addr & ~PAGE_MASK)
> @@ -4799,7 +4798,7 @@ unsigned long shmem_get_unmapped_area(struct file *file,
> unsigned long addr, unsigned long len,
> unsigned long pgoff, unsigned long flags)
>  {
> - return mm_get_unmapped_area(current->mm, file, addr, len, pgoff, flags);
> + return current_get_unmapped_area(file, addr, len, pgoff, flags);
>  }
>  #endif
>  
>
> base-commit: 9221b2819b8a4196eecf5476d66201be60fbcf29

Reviewed-by: Jarkko Sakkinen 

BR, Jarkko

Re: [PATCH v13 14/14] selftests/sgx: Add scripts for EPC cgroup testing

2024-05-02 Thread Jarkko Sakkinen

On Tue Apr 30, 2024 at 10:51 PM EEST, Haitao Huang wrote:
> With different cgroups, the script starts one or multiple concurrent SGX
> selftests (test_sgx), each to run the unclobbered_vdso_oversubscribed
> test case, which loads an enclave of EPC size equal to the EPC capacity
> available on the platform. The script checks results against the
> expectation set for each cgroup and reports success or failure.
>
> The script creates 3 different cgroups at the beginning with following
> expectations:
>
> 1) small - intentionally small enough to fail the test loading an
> enclave of size equal to the capacity.
> 2) large - large enough to run up to 4 concurrent tests but fail some if
> more than 4 concurrent tests are run. The script starts 4 expecting at
> least one test to pass, and then starts 5 expecting at least one test
> to fail.
> 3) larger - limit is the same as the capacity, large enough to run lots of
> concurrent tests. The script starts 8 of them and expects all pass.
> Then it reruns the same test with one process randomly killed and
> usage checked to be zero after all processes exit.
>
> The script also includes a test with low mem_cg limit and large sgx_epc
> limit to verify that the RAM used for per-cgroup reclamation is charged
> to a proper mem_cg. For this test, it turns off swapping before start,
> and turns swapping back on afterwards.
>
> Add README to document how to run the tests.
>
> Signed-off-by: Haitao Huang 

Here's the transcript:

make: Entering directory '/home/jarkko/linux-tpmdd/tools/testing/selftests/sgx'
gcc -Wall -Werror -g 
-I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include -fPIC 
-c main.c -o /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/main.o
gcc -Wall -Werror -g 
-I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include -fPIC 
-c load.c -o /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/load.o
gcc -Wall -Werror -g 
-I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include -fPIC 
-c sigstruct.c -o 
/home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/sigstruct.o
gcc -Wall -Werror -g 
-I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include -fPIC 
-c call.S -o /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/call.o
gcc -Wall -Werror -g 
-I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include -fPIC 
-c sign_key.S -o /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/sign_key.o
gcc -Wall -Werror -g 
-I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include -fPIC 
-o /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/test_sgx 
/home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/main.o 
/home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/load.o 
/home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/sigstruct.o 
/home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/call.o 
/home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/sign_key.o -z noexecstack 
-lcrypto
gcc -Wall -Werror -static-pie -nostdlib -ffreestanding -fPIE 
-fno-stack-protector -mrdrnd 
-I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include 
test_encl.c test_encl_bootstrap.S -o 
/home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/test_encl.elf 
-Wl,-T,test_encl.lds,--build-id=none
/usr/lib64/gcc/x86_64-suse-linux/13/../../../../x86_64-suse-linux/bin/ld: 
warning: /tmp/ccToNCLw.o: missing .note.GNU-stack section implies executable 
stack
/usr/lib64/gcc/x86_64-suse-linux/13/../../../../x86_64-suse-linux/bin/ld: NOTE: 
This behaviour is deprecated and will be removed in a future version of the 
linker
TAP version 13
1..2
# timeout set to 120
# selftests: sgx: test_sgx
# TAP version 13
# 1..16
# # Starting 16 tests from 1 test cases.
# #  RUN   enclave.unclobbered_vdso ...
# #OK  enclave.unclobbered_vdso
# ok 1 enclave.unclobbered_vdso
# #  RUN   enclave.unclobbered_vdso_oversubscribed ...
# #OK  enclave.unclobbered_vdso_oversubscribed
# ok 2 enclave.unclobbered_vdso_oversubscribed
# #  RUN   enclave.unclobbered_vdso_oversubscribed_remove ...
# # main.c:402:unclobbered_vdso_oversubscribed_remove:Creating an enclave with 
98566144 bytes heap may take a while ...
# # main.c:457:unclobbered_vdso_oversubscribed_remove:Changing type of 98566144 
bytes to trimmed may take a while ...
# # main.c:473:unclobbered_vdso_oversubscribed_remove:Entering enclave to run 
EACCEPT for each page of 98566144 bytes may take a while ...
# # main.c:494:unclobbered_vdso_oversubscribed_remove:Removing 98566144 bytes 
from enclave may take a while ...
# #OK  enclave.unclobbered_vdso_oversubscribed_remove
# ok 3 enclave.unclobbered_vdso_oversubscribed_remove
# #  RUN   enclave.clobbered_vdso ...
# #OK  enclave.clobbered_vdso
# ok 4 enclave.clobbered_vdso
# #  RUN   enclave.clobbered_vdso_and_user_function ...
# #OK  enclave.clobbered_vdso_and_user_function
# ok 5

Re: [PATCH v12 14/14] selftests/sgx: Add scripts for EPC cgroup testing

2024-04-29 Thread Jarkko Sakkinen

On Mon Apr 29, 2024 at 7:18 PM EEST, Haitao Huang wrote:
> Hi Jarkko
>
> On Sun, 28 Apr 2024 17:03:17 -0500, Jarkko Sakkinen   
> wrote:
>
> > On Fri Apr 26, 2024 at 5:28 PM EEST, Dave Hansen wrote:
> >> On 4/16/24 07:15, Jarkko Sakkinen wrote:
> >> > On Tue Apr 16, 2024 at 8:42 AM EEST, Huang, Kai wrote:
> >> > Yes, exactly. I'd take one week break and cycle the kselftest part
> >> > internally a bit as I said my previous response. I'm sure that there
> >> > is experise inside Intel how to implement it properly. I.e. take some
> >> > time to find the right person, and wait as long as that person has a
> >> > bit of bandwidth to go through the test and suggest modifications.
> >>
> >> Folks, I worry that this series is getting bogged down in the selftests.
> >>  Yes, selftests are important.  But getting _some_ tests in the kernel
> >> is substantially more important than getting perfect tests.
> >>
> >> I don't think Haitao needs to "cycle" this back inside Intel.
> >
> > The problem with the tests was that they are hard to run anything else
> > than Ubuntu (and perhaps Debian). It is hopefully now taken care of.
> > Selftests do not have to be perfect but at minimum they need to be
> > runnable.
> >
> > I need ret-test the latest series because it is possible that I did not
> > have right flags (I was travelling few days thus have not done it yet).
> >
> > BR, Jarkko
> >
>
> Let me know if you want me to send v13 before testing or you can just use  
> the sgx_cg_upstream_v12_plus branch in my repo.
>
> Also thanks for the "Reviewed-by" tags for other patches. But I've not got  
> "Reviewed-by" from you for patches #8-12 (not sure I missed). Could you go  
> through those alsoe when you get chance?

So, I compiled v12 branch. Was the only difference in selftests?

I can just copy them to the device.

BR, Jarkko

Re: [PATCH 2/2] x86/sgx: Resolve EREMOVE page vs EAUG page data race

2024-04-29 Thread Jarkko Sakkinen

On Mon Apr 29, 2024 at 1:43 PM EEST, Dmitrii Kuvaiskii wrote:
> Two enclave threads may try to add and remove the same enclave page
> simultaneously (e.g., if the SGX runtime supports both lazy allocation
> and `MADV_DONTNEED` semantics). Consider this race:
>
> 1. T1 performs page removal in sgx_encl_remove_pages() and stops right
>after removing the page table entry and right before re-acquiring the
>enclave lock to EREMOVE and xa_erase(>page_array) the page.
> 2. T2 tries to access the page, and #PF[not_present] is raised. The
>condition to EAUG in sgx_vma_fault() is not satisfied because the
>page is still present in encl->page_array, thus the SGX driver
>assumes that the fault happened because the page was swapped out. The
>driver continues on a code path that installs a page table entry
>*without* performing EAUG.
> 3. The enclave page metadata is in inconsistent state: the PTE is
>installed but there was no EAUG. Thus, T2 in userspace infinitely
>receives SIGSEGV on this page (and EACCEPT always fails).
>
> Fix this by making sure that T1 (the page-removing thread) always wins
> this data race. In particular, the page-being-removed is marked as such,
> and T2 retries until the page is fully removed.
>
> Fixes: 9849bb27152c ("x86/sgx: Support complete page removal")
> Cc: sta...@vger.kernel.org
> Signed-off-by: Dmitrii Kuvaiskii 
> ---
>  arch/x86/kernel/cpu/sgx/encl.c  | 3 ++-
>  arch/x86/kernel/cpu/sgx/encl.h  | 3 +++
>  arch/x86/kernel/cpu/sgx/ioctl.c | 1 +
>  3 files changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
> index 41f14b1a3025..7ccd8b2fce5f 100644
> --- a/arch/x86/kernel/cpu/sgx/encl.c
> +++ b/arch/x86/kernel/cpu/sgx/encl.c
> @@ -257,7 +257,8 @@ static struct sgx_encl_page *__sgx_encl_load_page(struct 
> sgx_encl *encl,
>  
>   /* Entry successfully located. */
>   if (entry->epc_page) {
> - if (entry->desc & SGX_ENCL_PAGE_BEING_RECLAIMED)
> + if (entry->desc & (SGX_ENCL_PAGE_BEING_RECLAIMED |
> +SGX_ENCL_PAGE_BEING_REMOVED))
>   return ERR_PTR(-EBUSY);
>  
>   return entry;
> diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h
> index f94ff14c9486..fff5f2293ae7 100644
> --- a/arch/x86/kernel/cpu/sgx/encl.h
> +++ b/arch/x86/kernel/cpu/sgx/encl.h
> @@ -25,6 +25,9 @@
>  /* 'desc' bit marking that the page is being reclaimed. */
>  #define SGX_ENCL_PAGE_BEING_RECLAIMEDBIT(3)
>  
> +/* 'desc' bit marking that the page is being removed. */
> +#define SGX_ENCL_PAGE_BEING_REMOVED  BIT(2)
> +
>  struct sgx_encl_page {
>   unsigned long desc;
>   unsigned long vm_max_prot_bits:8;
> diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
> index b65ab214bdf5..c542d4dd3e64 100644
> --- a/arch/x86/kernel/cpu/sgx/ioctl.c
> +++ b/arch/x86/kernel/cpu/sgx/ioctl.c
> @@ -1142,6 +1142,7 @@ static long sgx_encl_remove_pages(struct sgx_encl *encl,
>* Do not keep encl->lock because of dependency on
>* mmap_lock acquired in sgx_zap_enclave_ptes().
>*/
> + entry->desc |= SGX_ENCL_PAGE_BEING_REMOVED;
>   mutex_unlock(>lock);
>  
>   sgx_zap_enclave_ptes(encl, addr);

It is somewhat trivial to NAK this as the commit message does
not do any effort describing the new flag. By default at least
I have strong opposition against any new flags related to
reclaiming even if it needs a bit of extra synchronization
work in the user space.

One way to describe concurrency scenarios would be to take
example from https://www.kernel.org/doc/Documentation/memory-barriers.txt

I.e. see the examples with CPU 1 and CPU 2.

BR, Jarkko

Re: [PATCH 1/2] x86/sgx: Resolve EAUG race where losing thread returns SIGBUS

2024-04-29 Thread Jarkko Sakkinen

On Mon Apr 29, 2024 at 4:22 PM EEST, Jarkko Sakkinen wrote:
> On Mon Apr 29, 2024 at 4:04 PM EEST, Jarkko Sakkinen wrote:
> > > Fix these two bugs (1) by returning VM_FAULT_NOPAGE to the generic Linux
> > > fault handler so that no signal is sent to userspace, and (2) by
> > > replacing sgx_encl_free_epc_page() with sgx_free_epc_page() so that no
> > > EREMOVE is performed.
> >
> > What is the collateral damage caused by ENCLS[EREMOVE]?
>
> Have you measured cost of eremove on an empty page?
>
> I tried to lookup for a thread from lore because I have a faint memory
> that it was concluded that its cost irrelevant. Please correct if I'm
> wrong.

Also pseudocode for EREMOVE supports this as it just returns without
actually doing anything.

BR, Jarkko

Re: [PATCH 1/2] x86/sgx: Resolve EAUG race where losing thread returns SIGBUS

2024-04-29 Thread Jarkko Sakkinen

On Mon Apr 29, 2024 at 4:04 PM EEST, Jarkko Sakkinen wrote:
> > Fix these two bugs (1) by returning VM_FAULT_NOPAGE to the generic Linux
> > fault handler so that no signal is sent to userspace, and (2) by
> > replacing sgx_encl_free_epc_page() with sgx_free_epc_page() so that no
> > EREMOVE is performed.
>
> What is the collateral damage caused by ENCLS[EREMOVE]?

Have you measured cost of eremove on an empty page?

I tried to lookup for a thread from lore because I have a faint memory
that it was concluded that its cost irrelevant. Please correct if I'm
wrong.

BR, Jarkko

Re: [PATCH 0/2] x86/sgx: Fix two data races in EAUG/EREMOVE flows

2024-04-29 Thread Jarkko Sakkinen

On Mon Apr 29, 2024 at 1:43 PM EEST, Dmitrii Kuvaiskii wrote:
> SGX runtimes such as Gramine may implement EDMM-based lazy allocation of
> enclave pages and may support MADV_DONTNEED semantics [1]. The former
> implies #PF-based page allocation, and the latter implies the usage of
> SGX_IOC_ENCLAVE_REMOVE_PAGES ioctl.
>
> A trivial program like below (run under Gramine and with EDMM enabled)
> stresses these two flows in the SGX driver and hangs:
>
> /* repeatedly touch different enclave pages at random and mix with
>  * `madvise(MADV_DONTNEED)` to stress EAUG/EREMOVE flows */
> static void* thread_func(void* arg) {
> size_t num_pages = 0xA000 / page_size;
> for (int i = 0; i < 5000; i++) {
> size_t page = get_random_ulong() % num_pages;
> char data = READ_ONCE(((char*)arg)[page * page_size]);
>
> page = get_random_ulong() % num_pages;
> madvise(arg + page * page_size, page_size, MADV_DONTNEED);
> }
> }
>
> addr = mmap(NULL, 0xA000, PROT_READ | PROT_WRITE, MAP_ANONYMOUS, -1, 0);
> pthread_t threads[16];
> for (int i = 0; i < 16; i++)
> pthread_create([i], NULL, thread_func, addr);

I'm not convinced that kernel is the problem here but it could be also
how Gramine is implemented.

So maybe you could make a better case of that. The example looks a bit
artificial to me.

>
> This program uncovers two data races in the SGX driver. The remaining
> patches describe and fix these races.
>
> I performed several stress tests to verify that there are no other data
> races (at least with the test program above):
>
> - On Icelake server with 128GB of PRMRR (EPC), without madvise(). This
>   stresses the first data race. A Gramine SGX test suite running in the
>   background for additional stressing. Result: 1,000 runs without hangs
>   (result without the first bug fix: hangs every time).
> - On Icelake server with 128GB of PRMRR (EPC), with madvise(). This
>   stresses the second data race. A Gramine SGX test suite running in the
>   background for additional stressing. Result: 1,000 runs without hangs
>   (result with the first bug fix but without the second bug fix: hangs
>   approx. once in 50 runs).
> - On Icelake server with 4GB of PRMRR (EPC), with madvise(). This
>   additionally stresses the enclave page swapping flows. Two Gramine SGX
>   test suites running in the background for additional stressing of
>   swapping (I observe 100% CPU utilization from ksgxd which confirms that
>   swapping happens). Result: 1,000 runs without hangs.
>
> (Sorry for the previous copy of this email, accidentally sent to
> sta...@vger.kernel.org. Failed to use `--suppress-cc` during a test send.)
>
> Dmitrii Kuvaiskii (2):
>   x86/sgx: Resolve EAUG race where losing thread returns SIGBUS
>   x86/sgx: Resolve EREMOVE page vs EAUG page data race
>
>  arch/x86/kernel/cpu/sgx/encl.c  | 10 +++---
>  arch/x86/kernel/cpu/sgx/encl.h  |  3 +++
>  arch/x86/kernel/cpu/sgx/ioctl.c |  1 +
>  3 files changed, 11 insertions(+), 3 deletions(-)

BR, Jarkko

Re: [PATCH 1/2] x86/sgx: Resolve EAUG race where losing thread returns SIGBUS

2024-04-29 Thread Jarkko Sakkinen

On Mon Apr 29, 2024 at 1:43 PM EEST, Dmitrii Kuvaiskii wrote:
> Two enclave threads may try to access the same non-present enclave page
> simultaneously (e.g., if the SGX runtime supports lazy allocation). The
> threads will end up in sgx_encl_eaug_page(), racing to acquire the
> enclave lock. The winning thread will perform EAUG, set up the page
> table entry, and insert the page into encl->page_array. The losing
> thread will then get -EBUSY on xa_insert(>page_array) and proceed
> to error handling path.

And that path removes page. Not sure I got gist of this tbh.

> This error handling path contains two bugs: (1) SIGBUS is sent to
> userspace even though the enclave page is correctly installed by another
> thread, and (2) sgx_encl_free_epc_page() is called that performs EREMOVE
> even though the enclave page was never intended to be removed. The first
> bug is less severe because it impacts only the user space; the second
> bug is more severe because it also impacts the OS state by ripping the
> page (added by the winning thread) from the enclave.
>
> Fix these two bugs (1) by returning VM_FAULT_NOPAGE to the generic Linux
> fault handler so that no signal is sent to userspace, and (2) by
> replacing sgx_encl_free_epc_page() with sgx_free_epc_page() so that no
> EREMOVE is performed.

What is the collateral damage caused by ENCLS[EREMOVE]?

>
> Fixes: 5a90d2c3f5ef ("x86/sgx: Support adding of pages to an initialized 
> enclave")
> Cc: sta...@vger.kernel.org
> Reported-by: Marcelina Kościelnicka 
> Suggested-by: Reinette Chatre 
> Signed-off-by: Dmitrii Kuvaiskii 
> ---
>  arch/x86/kernel/cpu/sgx/encl.c | 7 +--
>  1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
> index 279148e72459..41f14b1a3025 100644
> --- a/arch/x86/kernel/cpu/sgx/encl.c
> +++ b/arch/x86/kernel/cpu/sgx/encl.c
> @@ -382,8 +382,11 @@ static vm_fault_t sgx_encl_eaug_page(struct 
> vm_area_struct *vma,
>* If ret == -EBUSY then page was created in another flow while
>* running without encl->lock
>*/
> - if (ret)
> + if (ret) {
> + if (ret == -EBUSY)
> + vmret = VM_FAULT_NOPAGE;
>   goto err_out_shrink;
> + }
>  
>   pginfo.secs = (unsigned long)sgx_get_epc_virt_addr(encl->secs.epc_page);
>   pginfo.addr = encl_page->desc & PAGE_MASK;
> @@ -419,7 +422,7 @@ static vm_fault_t sgx_encl_eaug_page(struct 
> vm_area_struct *vma,
>  err_out_shrink:
>   sgx_encl_shrink(encl, va_page);
>  err_out_epc:
> - sgx_encl_free_epc_page(epc_page);
> + sgx_free_epc_page(epc_page);

This ignores check for the page being reclaimer tracked, i.e. it does
changes that have been ignored in the commit message.

>  err_out_unlock:
>   mutex_unlock(>lock);
>   kfree(encl_page);


BR, Jarkko

Re: [RFC PATCH v2 1/1] x86/sgx: Explicitly give up the CPU in EDMM's ioctl() to avoid softlockup

2024-04-28 Thread Jarkko Sakkinen

On Fri Apr 26, 2024 at 5:18 PM EEST, Bojun Zhu wrote:
> EDMM's ioctl()s support batch operations, which may be
> time-consuming. Try to explicitly give up the CPU as the prefix
> operation at the every begin of "for loop" in
> sgx_enclave_{ modify_types | restrict_permissions | remove_pages}
> to give other tasks a chance to run, and avoid softlockup warning.
>
> Additionally perform pending signals check as the prefix operation,
> and introduce sgx_check_signal_and_resched(),
> which wraps all the checks.
>
> The following has been observed on Linux v6.9-rc5 with kernel
> preemptions disabled(by configuring "PREEMPT_NONE=y"), when kernel
> is requested to restrict page permissions of a large number of EPC pages.
>
> [ cut here ]
> watchdog: BUG: soft lockup - CPU#45 stuck for 22s!
> ...
> RIP: 0010:sgx_enclave_restrict_permissions+0xba/0x1f0
> ...
> Call Trace:
>  sgx_ioctl
>  __x64_sys_ioctl
>  x64_sys_call
>  do_syscall_64
>  entry_SYSCALL_64_after_hwframe
> [ end trace ]
>
> Signed-off-by: Bojun Zhu 
> ---
>  arch/x86/kernel/cpu/sgx/ioctl.c | 40 +
>  1 file changed, 36 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
> index b65ab214bdf5..e0645920bcb5 100644
> --- a/arch/x86/kernel/cpu/sgx/ioctl.c
> +++ b/arch/x86/kernel/cpu/sgx/ioctl.c
> @@ -365,6 +365,20 @@ static int sgx_validate_offset_length(struct sgx_encl 
> *encl,
>   return 0;
>  }
>  
> +/*
> + * Check signals and invoke scheduler. Return true for a pending signal.
> + */
> +static bool sgx_check_signal_and_resched(void)
> +{
> + if (signal_pending(current))
> + return true;
> +
> + if (need_resched())
> + cond_resched();
> +
> + return false;
> +}
> +
>  /**
>   * sgx_ioc_enclave_add_pages() - The handler for %SGX_IOC_ENCLAVE_ADD_PAGES
>   * @encl:   an enclave pointer
> @@ -432,16 +446,13 @@ static long sgx_ioc_enclave_add_pages(struct sgx_encl 
> *encl, void __user *arg)
>   return -EINVAL;
>  
>   for (c = 0 ; c < add_arg.length; c += PAGE_SIZE) {
> - if (signal_pending(current)) {
> + if (sgx_check_signal_and_resched()) {
>   if (!c)
>   ret = -ERESTARTSYS;
>  
>   break;
>   }
>  
> - if (need_resched())
> - cond_resched();
> -
>   ret = sgx_encl_add_page(encl, add_arg.src + c, add_arg.offset + 
> c,
>   , add_arg.flags);
>   if (ret)
> @@ -746,6 +757,13 @@ sgx_enclave_restrict_permissions(struct sgx_encl *encl,
>   secinfo.flags = modp->permissions & SGX_SECINFO_PERMISSION_MASK;
>  
>   for (c = 0 ; c < modp->length; c += PAGE_SIZE) {
> + if (sgx_check_signal_and_resched()) {
> + if (!c)
> + ret = -ERESTARTSYS;
> +
> + goto out;
> + }
> +
>   addr = encl->base + modp->offset + c;
>  
>   sgx_reclaim_direct();
> @@ -913,6 +931,13 @@ static long sgx_enclave_modify_types(struct sgx_encl 
> *encl,
>   secinfo.flags = page_type << 8;
>  
>   for (c = 0 ; c < modt->length; c += PAGE_SIZE) {
> + if (sgx_check_signal_and_resched()) {
> + if (!c)
> + ret = -ERESTARTSYS;
> +
> + goto out;
> + }
> +
>   addr = encl->base + modt->offset + c;
>  
>   sgx_reclaim_direct();
> @@ -1101,6 +1126,13 @@ static long sgx_encl_remove_pages(struct sgx_encl 
> *encl,
>   secinfo.flags = SGX_SECINFO_R | SGX_SECINFO_W | SGX_SECINFO_X;
>  
>   for (c = 0 ; c < params->length; c += PAGE_SIZE) {
> + if (sgx_check_signal_and_resched()) {
> + if (!c)
> + ret = -ERESTARTSYS;
> +
> + goto out;
> + }
> +
>   addr = encl->base + params->offset + c;
>  
>   sgx_reclaim_direct();

I think Dave's suggestions make sense, so unfortunately needs yet
another spin. 

BR, Jarkko

Re: [PATCH v12 14/14] selftests/sgx: Add scripts for EPC cgroup testing

2024-04-28 Thread Jarkko Sakkinen

On Fri Apr 26, 2024 at 5:28 PM EEST, Dave Hansen wrote:
> On 4/16/24 07:15, Jarkko Sakkinen wrote:
> > On Tue Apr 16, 2024 at 8:42 AM EEST, Huang, Kai wrote:
> > Yes, exactly. I'd take one week break and cycle the kselftest part
> > internally a bit as I said my previous response. I'm sure that there
> > is experise inside Intel how to implement it properly. I.e. take some
> > time to find the right person, and wait as long as that person has a
> > bit of bandwidth to go through the test and suggest modifications.
>
> Folks, I worry that this series is getting bogged down in the selftests.
>  Yes, selftests are important.  But getting _some_ tests in the kernel
> is substantially more important than getting perfect tests.
>
> I don't think Haitao needs to "cycle" this back inside Intel.

The problem with the tests was that they are hard to run anything else
than Ubuntu (and perhaps Debian). It is hopefully now taken care of.
Selftests do not have to be perfect but at minimum they need to be
runnable.

I need ret-test the latest series because it is possible that I did not
have right flags (I was travelling few days thus have not done it yet).

BR, Jarkko

Re: [PATCH v12 14/14] selftests/sgx: Add scripts for EPC cgroup testing

2024-04-24 Thread Jarkko Sakkinen

On Wed Apr 24, 2024 at 10:42 PM EEST, Haitao Huang wrote:
> Hi Jarkko
> On Tue, 16 Apr 2024 11:08:11 -0500, Jarkko Sakkinen   
> wrote:
>
> > On Tue Apr 16, 2024 at 5:54 PM EEST, Haitao Huang wrote:
> >> I did declare the configs in the config file but I missed it in my patch
> >> as stated earlier. IIUC, that would not cause this error though.
> >>
> >> Maybe I should exit with the skip code if no CGROUP_MISC (no more
> >> CGROUP_SGX_EPC) is configured?
> > OK, so I wanted to do a distro kernel test here, and used the default
> > OpenSUSE kernel config. I need to check if it has CGROUP_MISC set.
>
> I couldn't figure out why this failure you have encountered. I think  
> OpenSUSE kernel most likely config CGROUP_MISC.
>
> Also if CGROUP_MISC not set, then there should be error happen earlier on  
> echoing "+misc" to cgroup.subtree_control at line 20. But your log  
> indicates only error on echoing "sgx_epc ..." to  
> /sys/fs/cgroup/...//misc.max.
>
> I can only speculate that can could happen (if sgx epc cgroup was compiled  
> in) when the cgroup-fs subdirectories in question already have populated  
> config that is conflicting with the scripts.
>
> Could you double check or start from a clean environment?
> Thanks
> Haitao

I can re-check next week once I'm back from Estonia. I'm travelling now
to https://lu.ma/uncloud.

BR, Jarkko

Re: [RFC PATCH 1/1] x86/sgx: Explicitly give up the CPU in EDMM's ioctl() to avoid softlockup

2024-04-24 Thread Jarkko Sakkinen

On Wed Apr 24, 2024 at 8:44 PM EEST, Jarkko Sakkinen wrote:
> On Wed Apr 24, 2024 at 2:50 PM EEST, Bojun Zhu wrote:
> > I still have some questions:
> >
> > It seems that the variable "ret" is set to 0 if there is **some** EPC pages 
> > have been 
> > added when interrupted by signal(Supposed that sgx_encl_add_page() 
> > always returns successfully).
>
> Ah, ok.
>
> Returning zero is right thing to do because it also returns count of
> pages successfully added. I.e. the function does not guarantee that
> all pages are processsed but it does guarantee that the system is in
> predictable state.
>
> It could be that e.g. sgx_alloc_epc_page() calls fails.
>
> So, it is a bit like how read system call works.

Good that you asked that had rethink for a while.

What I would suggest that you just put out 2nd verson out and then we
see where it is going.

For sgx_ioc_encl_add_pages() it is important to maintain exact semantics
as we need to maintain backwards compatibility.

BR, Jarkko

Re: [RFC PATCH 1/1] x86/sgx: Explicitly give up the CPU in EDMM's ioctl() to avoid softlockup

2024-04-24 Thread Jarkko Sakkinen

On Wed Apr 24, 2024 at 2:50 PM EEST, Bojun Zhu wrote:
> I still have some questions:
>
> It seems that the variable "ret" is set to 0 if there is **some** EPC pages 
> have been 
> added when interrupted by signal(Supposed that sgx_encl_add_page() 
> always returns successfully).

Ah, ok.

Returning zero is right thing to do because it also returns count of
pages successfully added. I.e. the function does not guarantee that
all pages are processsed but it does guarantee that the system is in
predictable state.

It could be that e.g. sgx_alloc_epc_page() calls fails.

So, it is a bit like how read system call works.

BR, Jarkko

Re: [RFC PATCH 1/1] x86/sgx: Explicitly give up the CPU in EDMM's ioctl() to avoid softlockup

2024-04-24 Thread Jarkko Sakkinen

On Wed Apr 24, 2024 at 10:02 AM EEST, Jarkko Sakkinen wrote:
> On Wed Apr 24, 2024 at 9:46 AM EEST, Bojun Zhu wrote:
> > Based on the the discussion among you, Jarkko and Reinette,
> > I will keep the need_resched() and wrap the logic in using sgx_resched(),
> > as suggested by Jarkko.
>
> Sounds like a plan :-)

In sgx_ioc_enclave_add_pages() "if (!c)" check might cause possibly
some  confusion.

Reason for it is that in "transaction sense" the operation can
be only meaningfully restarted when no pages have not been added
as MRENCLAVE checksum cannot be reset.

BR, Jarkko

Re: [RFC PATCH 1/1] x86/sgx: Explicitly give up the CPU in EDMM's ioctl() to avoid softlockup

2024-04-24 Thread Jarkko Sakkinen

On Wed Apr 24, 2024 at 9:46 AM EEST, Bojun Zhu wrote:
> Based on the the discussion among you, Jarkko and Reinette,
> I will keep the need_resched() and wrap the logic in using sgx_resched(),
> as suggested by Jarkko.

Sounds like a plan :-)

> Regards,
>
> Bojun

BR, Jarkko

Re: [RFC PATCH 1/1] x86/sgx: Explicitly give up the CPU in EDMM's ioctl() to avoid softlockup

2024-04-23 Thread Jarkko Sakkinen

On Tue Apr 23, 2024 at 8:08 PM EEST, Reinette Chatre wrote:
> Hi Kai,
>
> On 4/23/2024 4:50 AM, Huang, Kai wrote:
> >> diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c 
> >> b/arch/x86/kernel/cpu/sgx/ioctl.c
> >> index b65ab214bdf5..2340a82fa796 100644
> >> --- a/arch/x86/kernel/cpu/sgx/ioctl.c
> >> +++ b/arch/x86/kernel/cpu/sgx/ioctl.c
> >> @@ -806,6 +806,9 @@ sgx_enclave_restrict_permissions(struct sgx_encl *encl,
> >>}
> >>  
> >>mutex_unlock(>lock);
> >> +
> >> +  if (need_resched())
> >> +  cond_resched();
> >>}
> >>  
> >>ret = 0;
> >> @@ -1010,6 +1013,9 @@ static long sgx_enclave_modify_types(struct sgx_encl 
> >> *encl,
> >>entry->type = page_type;
> >>  
> >>mutex_unlock(>lock);
> >> +
> >> +  if (need_resched())
> >> +  cond_resched();
> >>}
> >>  
> >>ret = 0;
> >> @@ -1156,6 +1162,9 @@ static long sgx_encl_remove_pages(struct sgx_encl 
> >> *encl,
> >>kfree(entry);
> >>  
> >>mutex_unlock(>lock);
> >> +
> >> +  if (need_resched())
> >> +  cond_resched();
> >>}
> >>
> > 
> > You can remove the need_reshced() in all 3 places above but just call
> > cond_resched() directly.
> > 
>
> This change will call cond_resched() after dealing with each page in a
> potentially large page range (cover mentions 30GB but we have also had to
> make optimizations for enclaves larger than this). Adding a cond_resched()
> here will surely placate the soft lockup detector, but we need to take care
> how changes like this impact the performance of the system and having actions
> on these page ranges take much longer than necessary.
> For reference, please see 7b72c823ddf8 ("x86/sgx: Reduce delay and 
> interference
> of enclave release") that turned frequent cond_resched() into batches
> to address performance issues.
>
> It looks to me like the need_resched() may be a quick check that can be used
> to improve performance? I am not familiar with all use cases that need to be
> considered to determine if a batching solution may be needed.

Ya, well no matter it is the reasoning will need to be documented
because this should have symmetry with sgx_ioc_enclave_add_pages()
(see my response to Kai).

I because this makes dealing with need_resched() a change in code
even if it is left out as a side-effect, I'd support of not removing
which means adding need_resched() as a side-effect.

>From this follows that *if* need_resched() needs to be removed then
that is not really part of the bug fix, so in all cases the bug fix
itself must include need_resched() :-)

phew, hope you got my logic here, i think it reasonable...

BR, Jarkko

Re: [RFC PATCH 1/1] x86/sgx: Explicitly give up the CPU in EDMM's ioctl() to avoid softlockup

2024-04-23 Thread Jarkko Sakkinen

On Tue Apr 23, 2024 at 2:50 PM EEST, Huang, Kai wrote:
> On Tue, 2024-04-23 at 17:25 +0800, 朱伯君(杰铭) wrote:
> > EDMM's ioctl()s support batch operations, which may be
> > time-consuming. Try to explicitly give up the CPU at
> > the every end of "for loop" in
> > sgx_enclave_{ modify_types | restrict_permissions | remove_pages}
> > to give other tasks a chance to run, and avoid softlockup warning.
> > 
> > The following has been observed on Linux v6.9-rc5 with kernel
> > preemptions disabled(by configuring "PREEMPT_NONE=y"), when kernel
> > is requested to restrict page permissions of a large number of EPC pages.
> > 
> > [ cut here ]
> > watchdog: BUG: soft lockup - CPU#45 stuck for 22s! [occlum-run:3905]
> > ...
> > CPU: 45 PID: 3905 Comm: occlum-run Not tainted 6.9.0-rc5 #7
> > ...
> > RIP: 0010:sgx_enclave_restrict_permissions+0xba/0x1f0
> > Code: 48 c1 e6 05 48 89 d1 48 8d 5c 24 40 b8 0e 00 00 00 48 2b 8e 70 8e 
> > f5 93 48 c1 e9 05 48 c1 e1 0c 48 03 8e 68 8e f5 93 0f 01 cf  00 00 00 
> > 40 0f 85 b2 00 00 00 85 c0 0f 85 db 00 00 00 4c 89 ef
> > RSP: 0018:b55a6591fa80 EFLAGS: 0202
> > RAX:  RBX: b55a6591fac0 RCX: b581e7384000
> > RDX: b59a9e4e8080 RSI: 0020 RDI: 91d69e8cc000
> > RBP: b55a6591fb70 R08: 0002 R09: 91d646e12be0
> > R10: 006e R11: 0002 R12: 00072052d000
> > R13: 91d69e8cc000 R14: b55a6591fbd8 R15: 91d69e8cc020
> > FS:  7fe10dbda740() GS:92163e48() 
> > knlGS:
> > CS:  0010 DS:  ES:  CR0: 80050033
> > CR2: 7fc041811000 CR3: 0040d95c8005 CR4: 00770ef0
> > DR0:  DR1:  DR2: 
> > DR3:  DR6: fffe07f0 DR7: 0400
> > PKRU: 5554
> > Call Trace:
> >  
> >  ? show_regs+0x67/0x70
> >  ? watchdog_timer_fn+0x1f3/0x280
> >  ? __pfx_watchdog_timer_fn+0x10/0x10
> >  ? __hrtimer_run_queues+0xc8/0x220
> >  ? hrtimer_interrupt+0x10c/0x250
> >  ? __sysvec_apic_timer_interrupt+0x53/0x130
> >  ? sysvec_apic_timer_interrupt+0x7b/0x90
> >  
> >  
> >  ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
> >  ? sgx_enclave_restrict_permissions+0xba/0x1f0
> >  ? __pte_offset_map_lock+0x94/0x110
> >  ? sgx_encl_test_and_clear_young_cb+0x40/0x60
> >  sgx_ioctl+0x1ab/0x900
> >  ? do_syscall_64+0x79/0x110
> >  ? apply_to_page_range+0x14/0x20
> >  ? sgx_encl_test_and_clear_young+0x6c/0x80
> >  ? sgx_vma_fault+0x132/0x4f0
> >  __x64_sys_ioctl+0x95/0xd0
> >  x64_sys_call+0x1209/0x20c0
> >  do_syscall_64+0x6d/0x110
> >  ? do_syscall_64+0x79/0x110
> >  ? do_pte_missing+0x2e8/0xcc0
> >  ? __pte_offset_map+0x1c/0x190
> >  ? __handle_mm_fault+0x7b9/0xe60
> >  ? __count_memcg_events+0x70/0x100
> >  ? handle_mm_fault+0x256/0x360
> >  ? do_user_addr_fault+0x3c1/0x860
> >  ? irqentry_exit_to_user_mode+0x67/0x190
> >  ? irqentry_exit+0x3b/0x50
> >  ? exc_page_fault+0x89/0x180
> >  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > RIP: 0033:0x7fe10e2ee5cb
> > Code: 0f 1e fa 48 8b 05 c5 78 0d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff 
> > ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 
> > ff ff 73 01 c3 48 8b 0d 95 78 0d 00 f7 d8 64 89 01 48
> > RSP: 002b:7fffb2c75518 EFLAGS: 0246 ORIG_RAX: 0010
> > RAX: ffda RBX: 00078000 RCX: 7fe10e2ee5cb
> > RDX: 7fffb2c75520 RSI: c028a405 RDI: 0005
> > RBP: 0005 R08:  R09: 7fffb2c75594
> > R10: 7fffb2c755c8 R11: 0246 R12: c028a405
> > R13: 7fffb2c75520 R14: 00078000 R15: 7fe10e1a7980
> >  
> >  [ end trace ]
>
> Could you trim down the trace to only include the relevant part?
>
> E.g., please at least remove the two register dumps at the beginning and
> end of the trace.
>
> Please refer to "Backtraces in commit messages" section in
> Documentation/process/submitting-patches.rst.
>
> > 
> > Signed-off-by: Bojun Zhu 
> > ---
> >  arch/x86/kernel/cpu/sgx/ioctl.c | 9 +
> >  1 file changed, 9 insertions(+)
> > 
> > diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c 
> > b/arch/x86/kernel/cpu/sgx/ioctl.c
> > index b65ab214bdf5..2340a82fa796 100644
> > --- a/arch/x86/kernel/cpu/sgx/ioctl.c
> > +++ b/arch/x86/kernel/cpu/sgx/ioctl.c
> > @@ -806,6 +806,9 @@ sgx_enclave_restrict_permissions(struct sgx_encl *encl,
> > }
> >  
> > mutex_unlock(>lock);
> > +
> > +   if (need_resched())
> > +   cond_resched();
> > }
> >  
> > ret = 0;
> > @@ -1010,6 +1013,9 @@ static long sgx_enclave_modify_types(struct sgx_encl 
> > *encl,
> >

Re: [RFC PATCH 1/1] x86/sgx: Explicitly give up the CPU in EDMM's ioctl() to avoid softlockup

2024-04-23 Thread Jarkko Sakkinen

On Wed Apr 24, 2024 at 12:10 AM EEST, Jarkko Sakkinen wrote:
> On Tue Apr 23, 2024 at 12:25 PM EEST, =?UTF-8?B?5pyx5Lyv5ZCbKOadsOmTrSk=?= 
> wrote:
> > EDMM's ioctl()s support batch operations, which may be
> > time-consuming. Try to explicitly give up the CPU at
> > the every end of "for loop" in
> > sgx_enclave_{ modify_types | restrict_permissions | remove_pages}
> > to give other tasks a chance to run, and avoid softlockup warning.
> >
> > The following has been observed on Linux v6.9-rc5 with kernel
> > preemptions disabled(by configuring "PREEMPT_NONE=y"), when kernel
> > is requested to restrict page permissions of a large number of EPC pages.
> >
> > [ cut here ]
> > watchdog: BUG: soft lockup - CPU#45 stuck for 22s! [occlum-run:3905]
> > ...
> > CPU: 45 PID: 3905 Comm: occlum-run Not tainted 6.9.0-rc5 #7
> > ...
> > RIP: 0010:sgx_enclave_restrict_permissions+0xba/0x1f0
> > Code: 48 c1 e6 05 48 89 d1 48 8d 5c 24 40 b8 0e 00 00 00 48 2b 8e 70 8e 
> > f5 93 48 c1 e9 05 48 c1 e1 0c 48 03 8e 68 8e f5 93 0f 01 cf  00 00 00 
> > 40 0f 85 b2 00 00 00 85 c0 0f 85 db 00 00 00 4c 89 ef
> > RSP: 0018:b55a6591fa80 EFLAGS: 0202
> > RAX:  RBX: b55a6591fac0 RCX: b581e7384000
> > RDX: b59a9e4e8080 RSI: 0020 RDI: 91d69e8cc000
> > RBP: b55a6591fb70 R08: 0002 R09: 91d646e12be0
> > R10: 006e R11: 0002 R12: 00072052d000
> > R13: 91d69e8cc000 R14: b55a6591fbd8 R15: 91d69e8cc020
> > FS:  7fe10dbda740() GS:92163e48() 
> > knlGS:
> > CS:  0010 DS:  ES:  CR0: 80050033
> > CR2: 7fc041811000 CR3: 0040d95c8005 CR4: 00770ef0
> > DR0:  DR1:  DR2: 
> > DR3:  DR6: fffe07f0 DR7: 0400
> > PKRU: 5554
> > Call Trace:
> >  
> >  ? show_regs+0x67/0x70
> >  ? watchdog_timer_fn+0x1f3/0x280
> >  ? __pfx_watchdog_timer_fn+0x10/0x10
> >  ? __hrtimer_run_queues+0xc8/0x220
> >  ? hrtimer_interrupt+0x10c/0x250
> >  ? __sysvec_apic_timer_interrupt+0x53/0x130
> >  ? sysvec_apic_timer_interrupt+0x7b/0x90
> >  
> >  
> >  ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
> >  ? sgx_enclave_restrict_permissions+0xba/0x1f0
> >  ? __pte_offset_map_lock+0x94/0x110
> >  ? sgx_encl_test_and_clear_young_cb+0x40/0x60
> >  sgx_ioctl+0x1ab/0x900
> >  ? do_syscall_64+0x79/0x110
> >  ? apply_to_page_range+0x14/0x20
> >  ? sgx_encl_test_and_clear_young+0x6c/0x80
> >  ? sgx_vma_fault+0x132/0x4f0
> >  __x64_sys_ioctl+0x95/0xd0
> >  x64_sys_call+0x1209/0x20c0
> >  do_syscall_64+0x6d/0x110
> >  ? do_syscall_64+0x79/0x110
> >  ? do_pte_missing+0x2e8/0xcc0
> >  ? __pte_offset_map+0x1c/0x190
> >  ? __handle_mm_fault+0x7b9/0xe60
> >  ? __count_memcg_events+0x70/0x100
> >  ? handle_mm_fault+0x256/0x360
> >  ? do_user_addr_fault+0x3c1/0x860
> >  ? irqentry_exit_to_user_mode+0x67/0x190
> >  ? irqentry_exit+0x3b/0x50
> >  ? exc_page_fault+0x89/0x180
> >  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > RIP: 0033:0x7fe10e2ee5cb
> > Code: 0f 1e fa 48 8b 05 c5 78 0d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff 
> > ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 
> > ff ff 73 01 c3 48 8b 0d 95 78 0d 00 f7 d8 64 89 01 48
> > RSP: 002b:7fffb2c75518 EFLAGS: 0246 ORIG_RAX: 0010
> > RAX: ffda RBX: 00078000 RCX: 7fe10e2ee5cb
> > RDX: 7fffb2c75520 RSI: c028a405 RDI: 0005
> > RBP: 0005 R08:  R09: 7fffb2c75594
> > R10: 7fffb2c755c8 R11: 0246 R12: c028a405
> > R13: 7fffb2c75520 R14: 00078000 R15: 7fe10e1a7980
> >  
> >  [ end trace ]
> >
> > Signed-off-by: Bojun Zhu 
>
> Can you also fixup this as your "firstname lastname" in your emails
> from field? This matters so that author field in git log matches your
> sob.
>
> > ---
> >  arch/x86/kernel/cpu/sgx/ioctl.c | 9 +
> >  1 file changed, 9 insertions(+)
> >
> > diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c 
> > b/arch/x86/kernel

Re: [RFC PATCH 1/1] x86/sgx: Explicitly give up the CPU in EDMM's ioctl() to avoid softlockup

2024-04-23 Thread Jarkko Sakkinen

On Tue Apr 23, 2024 at 12:25 PM EEST, =?UTF-8?B?5pyx5Lyv5ZCbKOadsOmTrSk=?= 
wrote:
> EDMM's ioctl()s support batch operations, which may be
> time-consuming. Try to explicitly give up the CPU at
> the every end of "for loop" in
> sgx_enclave_{ modify_types | restrict_permissions | remove_pages}
> to give other tasks a chance to run, and avoid softlockup warning.
>
> The following has been observed on Linux v6.9-rc5 with kernel
> preemptions disabled(by configuring "PREEMPT_NONE=y"), when kernel
> is requested to restrict page permissions of a large number of EPC pages.
>
> [ cut here ]
> watchdog: BUG: soft lockup - CPU#45 stuck for 22s! [occlum-run:3905]
> ...
> CPU: 45 PID: 3905 Comm: occlum-run Not tainted 6.9.0-rc5 #7
> ...
> RIP: 0010:sgx_enclave_restrict_permissions+0xba/0x1f0
> Code: 48 c1 e6 05 48 89 d1 48 8d 5c 24 40 b8 0e 00 00 00 48 2b 8e 70 8e 
> f5 93 48 c1 e9 05 48 c1 e1 0c 48 03 8e 68 8e f5 93 0f 01 cf  00 00 00 40 
> 0f 85 b2 00 00 00 85 c0 0f 85 db 00 00 00 4c 89 ef
> RSP: 0018:b55a6591fa80 EFLAGS: 0202
> RAX:  RBX: b55a6591fac0 RCX: b581e7384000
> RDX: b59a9e4e8080 RSI: 0020 RDI: 91d69e8cc000
> RBP: b55a6591fb70 R08: 0002 R09: 91d646e12be0
> R10: 006e R11: 0002 R12: 00072052d000
> R13: 91d69e8cc000 R14: b55a6591fbd8 R15: 91d69e8cc020
> FS:  7fe10dbda740() GS:92163e48() 
> knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 7fc041811000 CR3: 0040d95c8005 CR4: 00770ef0
> DR0:  DR1:  DR2: 
> DR3:  DR6: fffe07f0 DR7: 0400
> PKRU: 5554
> Call Trace:
>  
>  ? show_regs+0x67/0x70
>  ? watchdog_timer_fn+0x1f3/0x280
>  ? __pfx_watchdog_timer_fn+0x10/0x10
>  ? __hrtimer_run_queues+0xc8/0x220
>  ? hrtimer_interrupt+0x10c/0x250
>  ? __sysvec_apic_timer_interrupt+0x53/0x130
>  ? sysvec_apic_timer_interrupt+0x7b/0x90
>  
>  
>  ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
>  ? sgx_enclave_restrict_permissions+0xba/0x1f0
>  ? __pte_offset_map_lock+0x94/0x110
>  ? sgx_encl_test_and_clear_young_cb+0x40/0x60
>  sgx_ioctl+0x1ab/0x900
>  ? do_syscall_64+0x79/0x110
>  ? apply_to_page_range+0x14/0x20
>  ? sgx_encl_test_and_clear_young+0x6c/0x80
>  ? sgx_vma_fault+0x132/0x4f0
>  __x64_sys_ioctl+0x95/0xd0
>  x64_sys_call+0x1209/0x20c0
>  do_syscall_64+0x6d/0x110
>  ? do_syscall_64+0x79/0x110
>  ? do_pte_missing+0x2e8/0xcc0
>  ? __pte_offset_map+0x1c/0x190
>  ? __handle_mm_fault+0x7b9/0xe60
>  ? __count_memcg_events+0x70/0x100
>  ? handle_mm_fault+0x256/0x360
>  ? do_user_addr_fault+0x3c1/0x860
>  ? irqentry_exit_to_user_mode+0x67/0x190
>  ? irqentry_exit+0x3b/0x50
>  ? exc_page_fault+0x89/0x180
>  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> RIP: 0033:0x7fe10e2ee5cb
> Code: 0f 1e fa 48 8b 05 c5 78 0d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff 
> ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff 
> ff 73 01 c3 48 8b 0d 95 78 0d 00 f7 d8 64 89 01 48
> RSP: 002b:7fffb2c75518 EFLAGS: 0246 ORIG_RAX: 0010
> RAX: ffda RBX: 00078000 RCX: 7fe10e2ee5cb
> RDX: 7fffb2c75520 RSI: c028a405 RDI: 0005
> RBP: 0005 R08:  R09: 7fffb2c75594
> R10: 7fffb2c755c8 R11: 0246 R12: c028a405
> R13: 7fffb2c75520 R14: 00078000 R15: 7fe10e1a7980
>  
>  [ end trace ]
>
> Signed-off-by: Bojun Zhu 

Can you also fixup this as your "firstname lastname" in your emails
from field? This matters so that author field in git log matches your
sob.

> ---
>  arch/x86/kernel/cpu/sgx/ioctl.c | 9 +
>  1 file changed, 9 insertions(+)
>
> diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
> index b65ab214bdf5..2340a82fa796 100644
> --- a/arch/x86/kernel/cpu/sgx/ioctl.c
> +++ b/arch/x86/kernel/cpu/sgx/ioctl.c
> @@ -806,6 +806,9 @@ sgx_enclave_restrict_permissions(struct sgx_encl *encl,
>   }
>  
>   mutex_unlock(>lock);
> +
> + if (need_resched())
> + cond_resched();
>   }
>  
>   ret = 0;
> @@ -1010,6 +1013,9 @@ static long sgx_enclave_modify_types(struct sgx_encl 
> *encl,
>   entry->type = page_type;
>  
>   mutex_unlock(>lock);
> +
> + if (need_resched())
> + cond_resched();
>   }
>  
>   ret = 0;
> @@ -1156,6 +1162,9 @@ static long sgx_encl_remove_pages(struct sgx_encl *encl,
>   kfree(entry);
>  
>   mutex_unlock(>lock);
> +
> + if

timeout triggered probe

2024-04-19 Thread Jarkko Sakkinen

Hi

I wonder if there is a way to attach a probe to a timeout in process
context?

This spun off from https://social.kernel.org/notice/Ah18GN3aezWLdcxcAK

Also relevant: https://social.kernel.org/notice/Ah2O2qXO32lPRPQkk4

Maybe there is a way already but being able to do this would have
practical use for sure if it does not.

BR, Jarkko

Re: [PATCH v12 14/14] selftests/sgx: Add scripts for EPC cgroup testing

2024-04-17 Thread Jarkko Sakkinen

On Wed Apr 17, 2024 at 1:04 AM EEST, Haitao Huang wrote:
> On Tue, 16 Apr 2024 11:08:11 -0500, Jarkko Sakkinen   
> wrote:
>
> > On Tue Apr 16, 2024 at 5:54 PM EEST, Haitao Huang wrote:
> >> I did declare the configs in the config file but I missed it in my patch
> >> as stated earlier. IIUC, that would not cause this error though.
> >>
> >> Maybe I should exit with the skip code if no CGROUP_MISC (no more
> >> CGROUP_SGX_EPC) is configured?
> >
> > OK, so I wanted to do a distro kernel test here, and used the default
> > OpenSUSE kernel config. I need to check if it has CGROUP_MISC set.
> >
> >> tools/testing/selftests$ find . -name README
> >> ./futex/README
> >> ./tc-testing/README
> >> ./net/forwarding/README
> >> ./powerpc/nx-gzip/README
> >> ./ftrace/README
> >> ./arm64/signal/README
> >> ./arm64/fp/README
> >> ./arm64/README
> >> ./zram/README
> >> ./livepatch/README
> >> ./resctrl/README
> >
> > So is there a README because of override timeout parameter? Maybe it
> > should be just set to a high enough value?
> >
> > BR, Jarkko
> >
>
>
>  From the docs, I think we are supposed to use override.
> See: https://docs.kernel.org/dev-tools/kselftest.html#timeout-for-selftests

OK, fair enough :-) I did not know this.

BR, Jarkko

Re: [PATCH v12 14/14] selftests/sgx: Add scripts for EPC cgroup testing

2024-04-16 Thread Jarkko Sakkinen

On Tue Apr 16, 2024 at 5:54 PM EEST, Haitao Huang wrote:
> I did declare the configs in the config file but I missed it in my patch  
> as stated earlier. IIUC, that would not cause this error though.
>
> Maybe I should exit with the skip code if no CGROUP_MISC (no more  
> CGROUP_SGX_EPC) is configured?

OK, so I wanted to do a distro kernel test here, and used the default
OpenSUSE kernel config. I need to check if it has CGROUP_MISC set.

> tools/testing/selftests$ find . -name README
> ./futex/README
> ./tc-testing/README
> ./net/forwarding/README
> ./powerpc/nx-gzip/README
> ./ftrace/README
> ./arm64/signal/README
> ./arm64/fp/README
> ./arm64/README
> ./zram/README
> ./livepatch/README
> ./resctrl/README

So is there a README because of override timeout parameter? Maybe it
should be just set to a high enough value?

BR, Jarkko

Re: [PATCH v12 14/14] selftests/sgx: Add scripts for EPC cgroup testing

2024-04-16 Thread Jarkko Sakkinen

On Tue Apr 16, 2024 at 8:42 AM EEST, Huang, Kai wrote:
> > 
> > I'll send a fixup for this patch or another version of the series if more  
> > changes are needed.
>
> Hi Haitao,
>
> I don't like to say but in general I think you are sending too frequently.  
> The
> last version was sent April, 11th (my time), so considering the weekend it has
> only been 3 or at most 4 days.  
>
> Please slow down a little bit to give people more time.
>
> More information please also see:
>
> https://www.kernel.org/doc/html/next/process/submitting-patches.html#resend-reminders

+1

Yes, exactly. I'd take one week break and cycle the kselftest part
internally a bit as I said my previous response. I'm sure that there
is experise inside Intel how to implement it properly. I.e. take some
time to find the right person, and wait as long as that person has a
bit of bandwidth to go through the test and suggest modifications.

Cannot blame, as I've done the same mistake a few times in past but
yeah this would be the best possible corrective action to take.

BR, Jarkko

Re: [PATCH v12 14/14] selftests/sgx: Add scripts for EPC cgroup testing

2024-04-16 Thread Jarkko Sakkinen

On Tue Apr 16, 2024 at 5:05 PM EEST, Jarkko Sakkinen wrote:
> On Tue Apr 16, 2024 at 6:20 AM EEST, Haitao Huang wrote:
> > With different cgroups, the script starts one or multiple concurrent SGX
> > selftests (test_sgx), each to run the unclobbered_vdso_oversubscribed
> > test case, which loads an enclave of EPC size equal to the EPC capacity
> > available on the platform. The script checks results against the
> > expectation set for each cgroup and reports success or failure.
> >
> > The script creates 3 different cgroups at the beginning with following
> > expectations:
> >
> > 1) SMALL - intentionally small enough to fail the test loading an
> > enclave of size equal to the capacity.
> > 2) LARGE - large enough to run up to 4 concurrent tests but fail some if
> > more than 4 concurrent tests are run. The script starts 4 expecting at
> > least one test to pass, and then starts 5 expecting at least one test
> > to fail.
> > 3) LARGER - limit is the same as the capacity, large enough to run lots of
> > concurrent tests. The script starts 8 of them and expects all pass.
> > Then it reruns the same test with one process randomly killed and
> > usage checked to be zero after all processes exit.
> >
> > The script also includes a test with low mem_cg limit and LARGE sgx_epc
> > limit to verify that the RAM used for per-cgroup reclamation is charged
> > to a proper mem_cg. For this test, it turns off swapping before start,
> > and turns swapping back on afterwards.
> >
> > Add README to document how to run the tests.
> >
> > Signed-off-by: Haitao Huang 
>
> jarkko@mustatorvisieni:~/linux-tpmdd> sudo make -C 
> tools/testing/selftests/sgx run_tests
> make: Entering directory 
> '/home/jarkko/linux-tpmdd/tools/testing/selftests/sgx'
> gcc -Wall -Werror -g 
> -I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include 
> -fPIC -c main.c -o /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/main.o
> gcc -Wall -Werror -g 
> -I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include 
> -fPIC -c load.c -o /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/load.o
> gcc -Wall -Werror -g 
> -I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include 
> -fPIC -c sigstruct.c -o 
> /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/sigstruct.o
> gcc -Wall -Werror -g 
> -I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include 
> -fPIC -c call.S -o /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/call.o
> gcc -Wall -Werror -g 
> -I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include 
> -fPIC -c sign_key.S -o 
> /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/sign_key.o
> gcc -Wall -Werror -g 
> -I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include 
> -fPIC -o /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/test_sgx 
> /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/main.o 
> /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/load.o 
> /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/sigstruct.o 
> /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/call.o 
> /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/sign_key.o -z 
> noexecstack -lcrypto
> gcc -Wall -Werror -static-pie -nostdlib -ffreestanding -fPIE 
> -fno-stack-protector -mrdrnd 
> -I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include 
> test_encl.c test_encl_bootstrap.S -o 
> /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/test_encl.elf 
> -Wl,-T,test_encl.lds,--build-id=none
> /usr/lib64/gcc/x86_64-suse-linux/13/../../../../x86_64-suse-linux/bin/ld: 
> warning: /tmp/ccqvDJVg.o: missing .note.GNU-stack section implies executable 
> stack
> /usr/lib64/gcc/x86_64-suse-linux/13/../../../../x86_64-suse-linux/bin/ld: 
> NOTE: This behaviour is deprecated and will be removed in a future version of 
> the linker
> TAP version 13
> 1..2
> # timeout set to 45
> # selftests: sgx: test_sgx
> # TAP version 13
> # 1..16
> # # Starting 16 tests from 1 test cases.
> # #  RUN   enclave.unclobbered_vdso ...
> # #OK  enclave.unclobbered_vdso
> # ok 1 enclave.unclobbered_vdso
> # #  RUN   enclave.unclobbered_vdso_oversubscribed ...
> # #OK  enclave.unclobbered_vdso_oversubscribed
> # ok 2 enclave.unclobbered_vdso_oversubscribed
> # #  RUN   enclave.unclobbered_vdso_oversubscribed_remove ...
> # # main.c:402:unclobbered_vdso_oversubscribed_remove:Creating an enclave 
> with 98566144 bytes heap may take a while ...
> # # main.c:457:unclobbered_vdso_oversubscribed_remove:Changing type of 
> 98566144 bytes to trimmed m

Re: [PATCH v12 14/14] selftests/sgx: Add scripts for EPC cgroup testing

2024-04-16 Thread Jarkko Sakkinen

On Tue Apr 16, 2024 at 6:20 AM EEST, Haitao Huang wrote:
> With different cgroups, the script starts one or multiple concurrent SGX
> selftests (test_sgx), each to run the unclobbered_vdso_oversubscribed
> test case, which loads an enclave of EPC size equal to the EPC capacity
> available on the platform. The script checks results against the
> expectation set for each cgroup and reports success or failure.
>
> The script creates 3 different cgroups at the beginning with following
> expectations:
>
> 1) SMALL - intentionally small enough to fail the test loading an
> enclave of size equal to the capacity.
> 2) LARGE - large enough to run up to 4 concurrent tests but fail some if
> more than 4 concurrent tests are run. The script starts 4 expecting at
> least one test to pass, and then starts 5 expecting at least one test
> to fail.
> 3) LARGER - limit is the same as the capacity, large enough to run lots of
> concurrent tests. The script starts 8 of them and expects all pass.
> Then it reruns the same test with one process randomly killed and
> usage checked to be zero after all processes exit.
>
> The script also includes a test with low mem_cg limit and LARGE sgx_epc
> limit to verify that the RAM used for per-cgroup reclamation is charged
> to a proper mem_cg. For this test, it turns off swapping before start,
> and turns swapping back on afterwards.
>
> Add README to document how to run the tests.
>
> Signed-off-by: Haitao Huang 

jarkko@mustatorvisieni:~/linux-tpmdd> sudo make -C tools/testing/selftests/sgx 
run_tests
make: Entering directory '/home/jarkko/linux-tpmdd/tools/testing/selftests/sgx'
gcc -Wall -Werror -g 
-I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include -fPIC 
-c main.c -o /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/main.o
gcc -Wall -Werror -g 
-I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include -fPIC 
-c load.c -o /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/load.o
gcc -Wall -Werror -g 
-I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include -fPIC 
-c sigstruct.c -o 
/home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/sigstruct.o
gcc -Wall -Werror -g 
-I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include -fPIC 
-c call.S -o /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/call.o
gcc -Wall -Werror -g 
-I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include -fPIC 
-c sign_key.S -o /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/sign_key.o
gcc -Wall -Werror -g 
-I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include -fPIC 
-o /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/test_sgx 
/home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/main.o 
/home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/load.o 
/home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/sigstruct.o 
/home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/call.o 
/home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/sign_key.o -z noexecstack 
-lcrypto
gcc -Wall -Werror -static-pie -nostdlib -ffreestanding -fPIE 
-fno-stack-protector -mrdrnd 
-I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include 
test_encl.c test_encl_bootstrap.S -o 
/home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/test_encl.elf 
-Wl,-T,test_encl.lds,--build-id=none
/usr/lib64/gcc/x86_64-suse-linux/13/../../../../x86_64-suse-linux/bin/ld: 
warning: /tmp/ccqvDJVg.o: missing .note.GNU-stack section implies executable 
stack
/usr/lib64/gcc/x86_64-suse-linux/13/../../../../x86_64-suse-linux/bin/ld: NOTE: 
This behaviour is deprecated and will be removed in a future version of the 
linker
TAP version 13
1..2
# timeout set to 45
# selftests: sgx: test_sgx
# TAP version 13
# 1..16
# # Starting 16 tests from 1 test cases.
# #  RUN   enclave.unclobbered_vdso ...
# #OK  enclave.unclobbered_vdso
# ok 1 enclave.unclobbered_vdso
# #  RUN   enclave.unclobbered_vdso_oversubscribed ...
# #OK  enclave.unclobbered_vdso_oversubscribed
# ok 2 enclave.unclobbered_vdso_oversubscribed
# #  RUN   enclave.unclobbered_vdso_oversubscribed_remove ...
# # main.c:402:unclobbered_vdso_oversubscribed_remove:Creating an enclave with 
98566144 bytes heap may take a while ...
# # main.c:457:unclobbered_vdso_oversubscribed_remove:Changing type of 98566144 
bytes to trimmed may take a while ...
# # main.c:473:unclobbered_vdso_oversubscribed_remove:Entering enclave to run 
EACCEPT for each page of 98566144 bytes may take a while ...
# # main.c:494:unclobbered_vdso_oversubscribed_remove:Removing 98566144 bytes 
from enclave may take a while ...
# #OK  enclave.unclobbered_vdso_oversubscribed_remove
# ok 3 enclave.unclobbered_vdso_oversubscribed_remove
# #  RUN   enclave.clobbered_vdso ...
# #OK  enclave.clobbered_vdso
# ok 4 enclave.clobbered_vdso
# #  RUN   enclave.clobbered_vdso_and_user_function ...
# #OK

Re: [PATCH v11 14/14] selftests/sgx: Add scripts for EPC cgroup testing

2024-04-15 Thread Jarkko Sakkinen

On Mon Apr 15, 2024 at 8:32 PM EEST, Haitao Huang wrote:
> On Sat, 13 Apr 2024 16:34:17 -0500, Jarkko Sakkinen   
> wrote:
>
> > On Wed Apr 10, 2024 at 9:25 PM EEST, Haitao Huang wrote:
> >> To run selftests for EPC cgroup:
> >>
> >> sudo ./run_epc_cg_selftests.sh
> >>
> >> To watch misc cgroup 'current' changes during testing, run this in a
> >> separate terminal:
> >>
> >> ./watch_misc_for_tests.sh current
> >>
> >> With different cgroups, the script starts one or multiple concurrent SGX
> >> selftests (test_sgx), each to run the unclobbered_vdso_oversubscribed
> >> test case, which loads an enclave of EPC size equal to the EPC capacity
> >> available on the platform. The script checks results against the
> >> expectation set for each cgroup and reports success or failure.
> >>
> >> The script creates 3 different cgroups at the beginning with following
> >> expectations:
> >>
> >> 1) SMALL - intentionally small enough to fail the test loading an
> >> enclave of size equal to the capacity.
> >> 2) LARGE - large enough to run up to 4 concurrent tests but fail some if
> >> more than 4 concurrent tests are run. The script starts 4 expecting at
> >> least one test to pass, and then starts 5 expecting at least one test
> >> to fail.
> >> 3) LARGER - limit is the same as the capacity, large enough to run lots  
> >> of
> >> concurrent tests. The script starts 8 of them and expects all pass.
> >> Then it reruns the same test with one process randomly killed and
> >> usage checked to be zero after all processes exit.
> >>
> >> The script also includes a test with low mem_cg limit and LARGE sgx_epc
> >> limit to verify that the RAM used for per-cgroup reclamation is charged
> >> to a proper mem_cg. For this test, it turns off swapping before start,
> >> and turns swapping back on afterwards.
> >>
> >> Signed-off-by: Haitao Huang 
> >> ---
> >> V11:
> >> - Remove cgroups-tools dependency and make scripts ash compatible.  
> >> (Jarkko)
> >> - Drop support for cgroup v1 and simplify. (Michal, Jarkko)
> >> - Add documentation for functions. (Jarkko)
> >> - Turn off swapping before memcontrol tests and back on after
> >> - Format and style fixes, name for hard coded values
> >>
> >> V7:
> >> - Added memcontrol test.
> >>
> >> V5:
> >> - Added script with automatic results checking, remove the interactive
> >> script.
> >> - The script can run independent from the series below.
> >> ---
> >>  tools/testing/selftests/sgx/ash_cgexec.sh |  16 +
> >>  .../selftests/sgx/run_epc_cg_selftests.sh | 275 ++
> >>  .../selftests/sgx/watch_misc_for_tests.sh |  11 +
> >>  3 files changed, 302 insertions(+)
> >>  create mode 100755 tools/testing/selftests/sgx/ash_cgexec.sh
> >>  create mode 100755 tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> >>  create mode 100755 tools/testing/selftests/sgx/watch_misc_for_tests.sh
> >>
> >> diff --git a/tools/testing/selftests/sgx/ash_cgexec.sh  
> >> b/tools/testing/selftests/sgx/ash_cgexec.sh
> >> new file mode 100755
> >> index ..cfa5d2b0e795
> >> --- /dev/null
> >> +++ b/tools/testing/selftests/sgx/ash_cgexec.sh
> >> @@ -0,0 +1,16 @@
> >> +#!/usr/bin/env sh
> >> +# SPDX-License-Identifier: GPL-2.0
> >> +# Copyright(c) 2024 Intel Corporation.
> >> +
> >> +# Start a program in a given cgroup.
> >> +# Supports V2 cgroup paths, relative to /sys/fs/cgroup
> >> +if [ "$#" -lt 2 ]; then
> >> +echo "Usage: $0   [args...]"
> >> +exit 1
> >> +fi
> >> +# Move this shell to the cgroup.
> >> +echo 0 >/sys/fs/cgroup/$1/cgroup.procs
> >> +shift
> >> +# Execute the command within the cgroup
> >> +exec "$@"
> >> +
> >> diff --git a/tools/testing/selftests/sgx/run_epc_cg_selftests.sh  
> >> b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> >> new file mode 100755
> >> index ..dd56273056fc
> >> --- /dev/null
> >> +++ b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> >> @@ -0,0 +1,275 @@
> >> +#!/usr/bin/env sh
> >> +# SPDX-License-Identifier: GPL-2.0
> >> +# Copyright(c) 2023, 2024 Intel Corporation.
> >> +
> >> +TEST_ROOT_CG=selftest
>

Re: [PATCH v11 01/14] x86/sgx: Replace boolean parameters with enums

2024-04-15 Thread Jarkko Sakkinen

On Mon Apr 15, 2024 at 4:22 PM EEST, Huang, Kai wrote:
> On Wed, 2024-04-10 at 11:25 -0700, Haitao Huang wrote:
> > Replace boolean parameters for 'reclaim' in the function
> > sgx_alloc_epc_page() and its callers with an enum.
> > 
> > Also opportunistically remove non-static declaration of
> > __sgx_alloc_epc_page() and a typo
> > 
> > Signed-off-by: Haitao Huang 
> > Suggested-by: Jarkko Sakkinen 
> > Suggested-by: Dave Hansen 
> > 
>
> Reviewed-by: Kai Huang 

Hmm... missing my reviewed-by so:

Reviewed-by: Jarkko Sakkinen 

BR, Jarkko

Re: [PATCH v11 14/14] selftests/sgx: Add scripts for EPC cgroup testing

2024-04-15 Thread Jarkko Sakkinen

On Mon Apr 15, 2024 at 6:13 AM EEST, Haitao Huang wrote:
> On Sun, 14 Apr 2024 10:01:03 -0500, Jarkko Sakkinen   
> wrote:
>
> > On Wed Apr 10, 2024 at 9:25 PM EEST, Haitao Huang wrote:
> >> To run selftests for EPC cgroup:
> >>
> >> sudo ./run_epc_cg_selftests.sh
> >>
> >> To watch misc cgroup 'current' changes during testing, run this in a
> >> separate terminal:
> >>
> >> ./watch_misc_for_tests.sh current
> >>
> >> With different cgroups, the script starts one or multiple concurrent SGX
> >> selftests (test_sgx), each to run the unclobbered_vdso_oversubscribed
> >> test case, which loads an enclave of EPC size equal to the EPC capacity
> >> available on the platform. The script checks results against the
> >> expectation set for each cgroup and reports success or failure.
> >>
> >> The script creates 3 different cgroups at the beginning with following
> >> expectations:
> >>
> >> 1) SMALL - intentionally small enough to fail the test loading an
> >> enclave of size equal to the capacity.
> >> 2) LARGE - large enough to run up to 4 concurrent tests but fail some if
> >> more than 4 concurrent tests are run. The script starts 4 expecting at
> >> least one test to pass, and then starts 5 expecting at least one test
> >> to fail.
> >> 3) LARGER - limit is the same as the capacity, large enough to run lots  
> >> of
> >> concurrent tests. The script starts 8 of them and expects all pass.
> >> Then it reruns the same test with one process randomly killed and
> >> usage checked to be zero after all processes exit.
> >>
> >> The script also includes a test with low mem_cg limit and LARGE sgx_epc
> >> limit to verify that the RAM used for per-cgroup reclamation is charged
> >> to a proper mem_cg. For this test, it turns off swapping before start,
> >> and turns swapping back on afterwards.
> >>
> >> Signed-off-by: Haitao Huang 
> >> ---
> >> V11:
> >> - Remove cgroups-tools dependency and make scripts ash compatible.  
> >> (Jarkko)
> >> - Drop support for cgroup v1 and simplify. (Michal, Jarkko)
> >> - Add documentation for functions. (Jarkko)
> >> - Turn off swapping before memcontrol tests and back on after
> >> - Format and style fixes, name for hard coded values
> >>
> >> V7:
> >> - Added memcontrol test.
> >>
> >> V5:
> >> - Added script with automatic results checking, remove the interactive
> >> script.
> >> - The script can run independent from the series below.
> >> ---
> >>  tools/testing/selftests/sgx/ash_cgexec.sh |  16 +
> >>  .../selftests/sgx/run_epc_cg_selftests.sh | 275 ++
> >>  .../selftests/sgx/watch_misc_for_tests.sh |  11 +
> >>  3 files changed, 302 insertions(+)
> >>  create mode 100755 tools/testing/selftests/sgx/ash_cgexec.sh
> >>  create mode 100755 tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> >>  create mode 100755 tools/testing/selftests/sgx/watch_misc_for_tests.sh
> >>
> >> diff --git a/tools/testing/selftests/sgx/ash_cgexec.sh  
> >> b/tools/testing/selftests/sgx/ash_cgexec.sh
> >> new file mode 100755
> >> index ..cfa5d2b0e795
> >> --- /dev/null
> >> +++ b/tools/testing/selftests/sgx/ash_cgexec.sh
> >> @@ -0,0 +1,16 @@
> >> +#!/usr/bin/env sh
> >> +# SPDX-License-Identifier: GPL-2.0
> >> +# Copyright(c) 2024 Intel Corporation.
> >> +
> >> +# Start a program in a given cgroup.
> >> +# Supports V2 cgroup paths, relative to /sys/fs/cgroup
> >> +if [ "$#" -lt 2 ]; then
> >> +echo "Usage: $0   [args...]"
> >> +exit 1
> >> +fi
> >> +# Move this shell to the cgroup.
> >> +echo 0 >/sys/fs/cgroup/$1/cgroup.procs
> >> +shift
> >> +# Execute the command within the cgroup
> >> +exec "$@"
> >> +
> >> diff --git a/tools/testing/selftests/sgx/run_epc_cg_selftests.sh  
> >> b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> >> new file mode 100755
> >> index ..dd56273056fc
> >> --- /dev/null
> >> +++ b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> >> @@ -0,0 +1,275 @@
> >> +#!/usr/bin/env sh
> >> +# SPDX-License-Identifier: GPL-2.0
> >> +# Copyright(c) 2023, 2024 Intel Corporation.
> >> +
> >> +TEST_ROOT_CG=selftest
>

Re: [PATCH v11 14/14] selftests/sgx: Add scripts for EPC cgroup testing

2024-04-14 Thread Jarkko Sakkinen

On Wed Apr 10, 2024 at 9:25 PM EEST, Haitao Huang wrote:
> To run selftests for EPC cgroup:
>
> sudo ./run_epc_cg_selftests.sh
>
> To watch misc cgroup 'current' changes during testing, run this in a
> separate terminal:
>
> ./watch_misc_for_tests.sh current
>
> With different cgroups, the script starts one or multiple concurrent SGX
> selftests (test_sgx), each to run the unclobbered_vdso_oversubscribed
> test case, which loads an enclave of EPC size equal to the EPC capacity
> available on the platform. The script checks results against the
> expectation set for each cgroup and reports success or failure.
>
> The script creates 3 different cgroups at the beginning with following
> expectations:
>
> 1) SMALL - intentionally small enough to fail the test loading an
> enclave of size equal to the capacity.
> 2) LARGE - large enough to run up to 4 concurrent tests but fail some if
> more than 4 concurrent tests are run. The script starts 4 expecting at
> least one test to pass, and then starts 5 expecting at least one test
> to fail.
> 3) LARGER - limit is the same as the capacity, large enough to run lots of
> concurrent tests. The script starts 8 of them and expects all pass.
> Then it reruns the same test with one process randomly killed and
> usage checked to be zero after all processes exit.
>
> The script also includes a test with low mem_cg limit and LARGE sgx_epc
> limit to verify that the RAM used for per-cgroup reclamation is charged
> to a proper mem_cg. For this test, it turns off swapping before start,
> and turns swapping back on afterwards.
>
> Signed-off-by: Haitao Huang 
> ---
> V11:
> - Remove cgroups-tools dependency and make scripts ash compatible. (Jarkko)
> - Drop support for cgroup v1 and simplify. (Michal, Jarkko)
> - Add documentation for functions. (Jarkko)
> - Turn off swapping before memcontrol tests and back on after
> - Format and style fixes, name for hard coded values
>
> V7:
> - Added memcontrol test.
>
> V5:
> - Added script with automatic results checking, remove the interactive
> script.
> - The script can run independent from the series below.
> ---
>  tools/testing/selftests/sgx/ash_cgexec.sh |  16 +
>  .../selftests/sgx/run_epc_cg_selftests.sh | 275 ++
>  .../selftests/sgx/watch_misc_for_tests.sh |  11 +
>  3 files changed, 302 insertions(+)
>  create mode 100755 tools/testing/selftests/sgx/ash_cgexec.sh
>  create mode 100755 tools/testing/selftests/sgx/run_epc_cg_selftests.sh
>  create mode 100755 tools/testing/selftests/sgx/watch_misc_for_tests.sh
>
> diff --git a/tools/testing/selftests/sgx/ash_cgexec.sh 
> b/tools/testing/selftests/sgx/ash_cgexec.sh
> new file mode 100755
> index ..cfa5d2b0e795
> --- /dev/null
> +++ b/tools/testing/selftests/sgx/ash_cgexec.sh
> @@ -0,0 +1,16 @@
> +#!/usr/bin/env sh
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright(c) 2024 Intel Corporation.
> +
> +# Start a program in a given cgroup.
> +# Supports V2 cgroup paths, relative to /sys/fs/cgroup
> +if [ "$#" -lt 2 ]; then
> +echo "Usage: $0   [args...]"
> +exit 1
> +fi
> +# Move this shell to the cgroup.
> +echo 0 >/sys/fs/cgroup/$1/cgroup.procs
> +shift
> +# Execute the command within the cgroup
> +exec "$@"
> +
> diff --git a/tools/testing/selftests/sgx/run_epc_cg_selftests.sh 
> b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> new file mode 100755
> index ..dd56273056fc
> --- /dev/null
> +++ b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> @@ -0,0 +1,275 @@
> +#!/usr/bin/env sh
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright(c) 2023, 2024 Intel Corporation.
> +
> +TEST_ROOT_CG=selftest
> +TEST_CG_SUB1=$TEST_ROOT_CG/test1
> +TEST_CG_SUB2=$TEST_ROOT_CG/test2
> +# We will only set limit in test1 and run tests in test3
> +TEST_CG_SUB3=$TEST_ROOT_CG/test1/test3
> +TEST_CG_SUB4=$TEST_ROOT_CG/test4
> +
> +# Cgroup v2 only
> +CG_ROOT=/sys/fs/cgroup
> +mkdir -p $CG_ROOT/$TEST_CG_SUB1
> +mkdir -p $CG_ROOT/$TEST_CG_SUB2
> +mkdir -p $CG_ROOT/$TEST_CG_SUB3
> +mkdir -p $CG_ROOT/$TEST_CG_SUB4
> +
> +# Turn on misc and memory controller in non-leaf nodes
> +echo "+misc" >  $CG_ROOT/cgroup.subtree_control && \
> +echo "+memory" > $CG_ROOT/cgroup.subtree_control && \
> +echo "+misc" >  $CG_ROOT/$TEST_ROOT_CG/cgroup.subtree_control && \
> +echo "+memory" > $CG_ROOT/$TEST_ROOT_CG/cgroup.subtree_control && \
> +echo "+misc" >  $CG_ROOT/$TEST_CG_SUB1/cgroup.subtree_control
> +if [ $? -ne 0 ]; then
> +echo "# Failed setting up cgroups, make sure misc and memory cgroups are 
> enabled."
> +exit 1
> +fi
> +
> +CAPACITY=$(grep "sgx_epc" "$CG_ROOT/misc.capacity" | awk '{print $2}')
> +# This is below number of VA pages needed for enclave of capacity size. So
> +# should fail oversubscribed cases
> +SMALL=$(( CAPACITY / 512 ))
> +
> +# At least load one enclave of capacity size successfully, maybe up to 4.
> +# But some may fail if we run more than 4 concurrent enclaves of capacity 
> size.
> +LARGE=$(( SMALL * 4 ))
> +
>

Re: [PATCH v11 14/14] selftests/sgx: Add scripts for EPC cgroup testing

2024-04-13 Thread Jarkko Sakkinen

On Wed Apr 10, 2024 at 9:25 PM EEST, Haitao Huang wrote:
> To run selftests for EPC cgroup:
>
> sudo ./run_epc_cg_selftests.sh
>
> To watch misc cgroup 'current' changes during testing, run this in a
> separate terminal:
>
> ./watch_misc_for_tests.sh current
>
> With different cgroups, the script starts one or multiple concurrent SGX
> selftests (test_sgx), each to run the unclobbered_vdso_oversubscribed
> test case, which loads an enclave of EPC size equal to the EPC capacity
> available on the platform. The script checks results against the
> expectation set for each cgroup and reports success or failure.
>
> The script creates 3 different cgroups at the beginning with following
> expectations:
>
> 1) SMALL - intentionally small enough to fail the test loading an
> enclave of size equal to the capacity.
> 2) LARGE - large enough to run up to 4 concurrent tests but fail some if
> more than 4 concurrent tests are run. The script starts 4 expecting at
> least one test to pass, and then starts 5 expecting at least one test
> to fail.
> 3) LARGER - limit is the same as the capacity, large enough to run lots of
> concurrent tests. The script starts 8 of them and expects all pass.
> Then it reruns the same test with one process randomly killed and
> usage checked to be zero after all processes exit.
>
> The script also includes a test with low mem_cg limit and LARGE sgx_epc
> limit to verify that the RAM used for per-cgroup reclamation is charged
> to a proper mem_cg. For this test, it turns off swapping before start,
> and turns swapping back on afterwards.
>
> Signed-off-by: Haitao Huang 
> ---
> V11:
> - Remove cgroups-tools dependency and make scripts ash compatible. (Jarkko)
> - Drop support for cgroup v1 and simplify. (Michal, Jarkko)
> - Add documentation for functions. (Jarkko)
> - Turn off swapping before memcontrol tests and back on after
> - Format and style fixes, name for hard coded values
>
> V7:
> - Added memcontrol test.
>
> V5:
> - Added script with automatic results checking, remove the interactive
> script.
> - The script can run independent from the series below.
> ---
>  tools/testing/selftests/sgx/ash_cgexec.sh |  16 +
>  .../selftests/sgx/run_epc_cg_selftests.sh | 275 ++
>  .../selftests/sgx/watch_misc_for_tests.sh |  11 +
>  3 files changed, 302 insertions(+)
>  create mode 100755 tools/testing/selftests/sgx/ash_cgexec.sh
>  create mode 100755 tools/testing/selftests/sgx/run_epc_cg_selftests.sh
>  create mode 100755 tools/testing/selftests/sgx/watch_misc_for_tests.sh
>
> diff --git a/tools/testing/selftests/sgx/ash_cgexec.sh 
> b/tools/testing/selftests/sgx/ash_cgexec.sh
> new file mode 100755
> index ..cfa5d2b0e795
> --- /dev/null
> +++ b/tools/testing/selftests/sgx/ash_cgexec.sh
> @@ -0,0 +1,16 @@
> +#!/usr/bin/env sh
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright(c) 2024 Intel Corporation.
> +
> +# Start a program in a given cgroup.
> +# Supports V2 cgroup paths, relative to /sys/fs/cgroup
> +if [ "$#" -lt 2 ]; then
> +echo "Usage: $0   [args...]"
> +exit 1
> +fi
> +# Move this shell to the cgroup.
> +echo 0 >/sys/fs/cgroup/$1/cgroup.procs
> +shift
> +# Execute the command within the cgroup
> +exec "$@"
> +
> diff --git a/tools/testing/selftests/sgx/run_epc_cg_selftests.sh 
> b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> new file mode 100755
> index ..dd56273056fc
> --- /dev/null
> +++ b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> @@ -0,0 +1,275 @@
> +#!/usr/bin/env sh
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright(c) 2023, 2024 Intel Corporation.
> +
> +TEST_ROOT_CG=selftest
> +TEST_CG_SUB1=$TEST_ROOT_CG/test1
> +TEST_CG_SUB2=$TEST_ROOT_CG/test2
> +# We will only set limit in test1 and run tests in test3
> +TEST_CG_SUB3=$TEST_ROOT_CG/test1/test3
> +TEST_CG_SUB4=$TEST_ROOT_CG/test4
> +
> +# Cgroup v2 only
> +CG_ROOT=/sys/fs/cgroup
> +mkdir -p $CG_ROOT/$TEST_CG_SUB1
> +mkdir -p $CG_ROOT/$TEST_CG_SUB2
> +mkdir -p $CG_ROOT/$TEST_CG_SUB3
> +mkdir -p $CG_ROOT/$TEST_CG_SUB4
> +
> +# Turn on misc and memory controller in non-leaf nodes
> +echo "+misc" >  $CG_ROOT/cgroup.subtree_control && \
> +echo "+memory" > $CG_ROOT/cgroup.subtree_control && \
> +echo "+misc" >  $CG_ROOT/$TEST_ROOT_CG/cgroup.subtree_control && \
> +echo "+memory" > $CG_ROOT/$TEST_ROOT_CG/cgroup.subtree_control && \
> +echo "+misc" >  $CG_ROOT/$TEST_CG_SUB1/cgroup.subtree_control
> +if [ $? -ne 0 ]; then
> +echo "# Failed setting up cgroups, make sure misc and memory cgroups are 
> enabled."
> +exit 1
> +fi
> +
> +CAPACITY=$(grep "sgx_epc" "$CG_ROOT/misc.capacity" | awk '{print $2}')
> +# This is below number of VA pages needed for enclave of capacity size. So
> +# should fail oversubscribed cases
> +SMALL=$(( CAPACITY / 512 ))
> +
> +# At least load one enclave of capacity size successfully, maybe up to 4.
> +# But some may fail if we run more than 4 concurrent enclaves of capacity 
> size.
> +LARGE=$(( SMALL * 4 ))
> +
>

Re: [PATCH v10 08/14] x86/sgx: Add basic EPC reclamation flow for cgroup

2024-04-13 Thread Jarkko Sakkinen

On Fri Apr 5, 2024 at 6:07 AM EEST, Huang, Kai wrote:
> On Thu, 2024-04-04 at 12:05 -0500, Haitao Huang wrote:
> > > > -static inline int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg)
> > > > +static inline int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg,  
> > > > enum sgx_reclaim r)
> > > 
> > > Is the @r here intentional for shorter typing?
> > > 
> > 
> > yes :-)
> > Will speel out to make it consistent if that's the concern.
>
> I kinda prefer the full name to match the CONFIG_CGROUP_SGX_EPC on case.  You
> can put the 'enum sgx_reclaim reclaim' parameter into the new line if needed:

Why don't cgroups for SGX get always enabled when SGX and
cgroups support are enabled?

BR, Jarkko

Re: [PATCH v9 15/15] selftests/sgx: Add scripts for EPC cgroup testing

2024-04-03 Thread Jarkko Sakkinen

On Tue Apr 2, 2024 at 8:40 PM EEST, Michal Koutný wrote:
> On Tue, Apr 02, 2024 at 11:20:21AM -0500, Haitao Huang 
>  wrote:
> > Do we really want to have it implemented in c?
>
> I only pointed to the available C boilerplate.
>
> > There are much fewer lines of
> > code in shell scripts. Note we are not really testing basic cgroup stuff.
> > All we needed were creating/deleting cgroups and set limits which I think
> > have been demonstrated feasible in the ash scripts now.
>
> I assume you refer to
> Message-Id: <20240331174442.51019-1-haitao.hu...@linux.intel.com>
> right?
>
> Could it be even simpler if you didn't stick to cgtools APIs and v1
> compatibility?
>
> Reducing ash_cgexec.sh to something like
>   echo 0 >$R/$1/cgroup.procs
>   shift
>   exec "$@"
> (with some small builerplate for $R and previous mkdirs)

I already asked about necessity of v1 in some response, and fully
support this idea. Then cgexec can simply be a function wrapping
along the lines what you proposed.

BR, Jarkko

Re: [PATCH v2] selftests/sgx: Improve cgroup test scripts

2024-04-03 Thread Jarkko Sakkinen

On Tue Apr 2, 2024 at 8:31 PM EEST, Haitao Huang wrote:
> On Tue, 02 Apr 2024 02:43:25 -0500, Jarkko Sakkinen   
> wrote:
>
> > On Tue Apr 2, 2024 at 4:42 AM EEST, Haitao Huang wrote:
> >> Make cgroup test scripts ash compatible.
> >> Remove cg-tools dependency.
> >> Add documentation for functions.
> >>
> >> Tested with busybox on Ubuntu.
> >>
> >> Signed-off-by: Haitao Huang 
> >> ---
> >> v2:
> >> - Fixes for v2 cgroup
> >> - Turn off swapping before memcontrol tests and back on after
> >> - Add comments and reformat
> >> ---
> >>  tools/testing/selftests/sgx/ash_cgexec.sh |  57 ++
> >>  .../selftests/sgx/run_epc_cg_selftests.sh | 187 +++---
> >>  .../selftests/sgx/watch_misc_for_tests.sh |  13 +-
> >>  3 files changed, 179 insertions(+), 78 deletions(-)
> >>  create mode 100755 tools/testing/selftests/sgx/ash_cgexec.sh
> >>
> >> diff --git a/tools/testing/selftests/sgx/ash_cgexec.sh  
> >> b/tools/testing/selftests/sgx/ash_cgexec.sh
> >> new file mode 100755
> >> index ..9607784378df
> >> --- /dev/null
> >> +++ b/tools/testing/selftests/sgx/ash_cgexec.sh
> >> @@ -0,0 +1,57 @@
> >> +#!/usr/bin/env sh
> >> +# SPDX-License-Identifier: GPL-2.0
> >> +# Copyright(c) 2024 Intel Corporation.
> >> +
> >> +# Move the current shell process to the specified cgroup
> >> +# Arguments:
> >> +# $1 - The cgroup controller name, e.g., misc, memory.
> >> +# $2 - The path of the cgroup,
> >> +# relative to /sys/fs/cgroup for cgroup v2,
> >> +# relative to /sys/fs/cgroup/$1 for v1.
> >> +move_to_cgroup() {
> >> +controllers="$1"
> >> +path="$2"
> >> +
> >> +# Check if cgroup v2 is in use
> >> +if [ ! -d "/sys/fs/cgroup/misc" ]; then
> >> +# Cgroup v2 logic
> >> +cgroup_full_path="/sys/fs/cgroup/${path}"
> >> +echo $$ > "${cgroup_full_path}/cgroup.procs"
> >> +else
> >> +# Cgroup v1 logic
> >> +OLD_IFS="$IFS"
> >> +IFS=','
> >> +for controller in $controllers; do
> >> +cgroup_full_path="/sys/fs/cgroup/${controller}/${path}"
> >> +echo $$ > "${cgroup_full_path}/tasks"
> >> +done
> >> +IFS="$OLD_IFS"
> >> +fi
> >
> > I think that if you could point me to git v10 and all this I could
> > then quite easily create test image and see what I get from that.
> >
> > I will code review the whole thing but this is definitely good
> > enough to start testing this series properly! Thanks for the
> > effort with this. The payback from this comes after the feature
> > is mainline. We have now sort of reference of the usage patterns
> > and less layers when we need to debug any possible (likely) bugs
> > in the future.
> >
> > This is definitely to the right direction. I'm just wondering do
> > we want to support v1 cgroups or would it make sense support only
> > v2?
> > BR, Jarkko
> >
> I can drop v1. I think most distro now support v2.
> Created this branch to host these changes so far:  
> https://github.com/haitaohuang/linux/tree/sgx_cg_upstream_v10_plus

Thanks! 

I'll point my build to that, make a test image and report the
results.

BR, Jarkko

Re: [PATCH v9 15/15] selftests/sgx: Add scripts for EPC cgroup testing

2024-04-03 Thread Jarkko Sakkinen

On Tue Apr 2, 2024 at 7:20 PM EEST, Haitao Huang wrote:
> On Tue, 02 Apr 2024 06:58:40 -0500, Jarkko Sakkinen   
> wrote:
>
> > On Tue Apr 2, 2024 at 2:23 PM EEST, Michal Koutný wrote:
> >> Hello.
> >>
> >> On Sat, Mar 30, 2024 at 01:26:08PM +0200, Jarkko Sakkinen  
> >>  wrote:
> >> > > > It'd be more complicated and less readable to do all the stuff  
> >> without the
> >> > > > cgroup-tools, esp cgexec. I checked dependency, cgroup-tools only  
> >> depends
> >> > > > on libc so I hope this would not cause too much inconvenience.
> >> > >
> >> > > As per cgroup-tools, please prove this. It makes the job for more
> >> > > complicated *for you* and you are making the job more  complicated
> >> > > to every possible person in the planet running any kernel QA.
> >> > >
> >> > > I weight the latter more than the former. And it is exactly the
> >> > > reason why we did custom user space kselftest in the first place.
> >> > > Let's keep the tradition. All I can say is that kselftest is
> >> > > unfinished in its current form.
> >> > >
> >> > > What is "esp cgexec"?
> >> >
> >> > Also in kselftest we don't drive ultimate simplicity, we drive
> >> > efficient CI/QA. By open coding something like subset of
> >> > cgroup-tools needed to run the test you also help us later
> >> > on to backtrack the kernel changes. With cgroups-tools you
> >> > would have to use strace to get the same info.
> >>
> >> FWIW, see also functions in
> >> tools/testing/selftests/cgroup/cgroup_util.{h,c}.
> >> They likely cover what you need already -- if the tests are in C.
> >>
> >> (I admit that stuff in tools/testing/selftests/cgroup/ is best
> >> understood with strace.)
> >
> > Thanks!
> >
> > My conclusions are that:
> >
> > 1. We probably cannot move the test part of cgroup test itself
> >given the enclave payload dependency.
> > 2. I think it makes sense to still follow the same pattern as
> >other cgroups test and re-use cgroup_util.[ch] functionaltiy.
> >
> > So yeah I guess we need two test programs instead of one.
> >
> > Something along the lines:
> >
> > 1. main.[ch] -> test_sgx.[ch]
> > 2. introduce test_sgx_cgroup.c
> >
> > And test_sgx_cgroup.c would be implement similar test as the shell
> > script and would follow the structure of existing cgroups tests.
> >
> >>
> >> HTH,
> >> Michal
> >
> > BR, Jarkko
> >
> Do we really want to have it implemented in c? There are much fewer lines  
> of code in shell scripts. Note we are not really testing basic cgroup  
> stuff. All we needed were creating/deleting cgroups and set limits which I  
> think have been demonstrated feasible in the ash scripts now.
>
> Given Dave's comments, and test scripts being working and cover the cases  
> needed IMHO, I don't see much need to move to c code. I can add more cases  
> if needed and fall back a c implementation later  if any case can't be  
> implemented in scripts. How about that?

We can settle to: ash + no dependencies. I guess you have for that
all the work done already.

BR, Jarkko

Re: [PATCH v9 15/15] selftests/sgx: Add scripts for EPC cgroup testing

2024-04-03 Thread Jarkko Sakkinen

On Tue Apr 2, 2024 at 6:42 PM EEST, Dave Hansen wrote:
> On 3/30/24 04:23, Jarkko Sakkinen wrote:
> >>> I also wonder is cgroup-tools dependency absolutely required or could
> >>> you just have a function that would interact with sysfs?
> >> I should have checked email before hit the send button for v10 .
> >>
> >> It'd be more complicated and less readable to do all the stuff without the 
> >>  
> >> cgroup-tools, esp cgexec. I checked dependency, cgroup-tools only depends  
> >> on libc so I hope this would not cause too much inconvenience.
> > As per cgroup-tools, please prove this. It makes the job for more
> > complicated *for you* and you are making the job more  complicated
> > to every possible person in the planet running any kernel QA.
>
> I don't see any other use of cgroup-tools in testing/selftests.
>
> I *DO* see a ton of /bin/bash use though.  I wouldn't go to much trouble
> to make the thing ash-compatible.
>
> That said, the most important thing is to get some selftests in place.
> If using cgroup-tools means we get actual, runnable tests in place,
> that's a heck of a lot more important than making them perfect.
> Remember, almost nobody uses SGX.  It's available on *VERY* few systems
> from one CPU vendor and only in very specific hardware configurations.

Ash-compatible is good enough for me, so let's draw the line there.

Ash-compatibility does not cause any major hurdle as can we seen from
Haitao's patch. Earlier version was not even POSIX-compatible, given
that it used hard-coded path.

Most of the added stuff come open coding the tools but in the test
code that is not the big deal, and helps with debugging in the future.
Even right now it helps reviewing kernel patches because it documents
exactly how the feature is seen from user space.

BR, Jarkko

Re: [PATCH v9 15/15] selftests/sgx: Add scripts for EPC cgroup testing

2024-04-02 Thread Jarkko Sakkinen

On Tue Apr 2, 2024 at 2:23 PM EEST, Michal Koutný wrote:
> Hello.
>
> On Sat, Mar 30, 2024 at 01:26:08PM +0200, Jarkko Sakkinen  
> wrote:
> > > > It'd be more complicated and less readable to do all the stuff without 
> > > > the  
> > > > cgroup-tools, esp cgexec. I checked dependency, cgroup-tools only 
> > > > depends  
> > > > on libc so I hope this would not cause too much inconvenience.
> > >
> > > As per cgroup-tools, please prove this. It makes the job for more
> > > complicated *for you* and you are making the job more  complicated
> > > to every possible person in the planet running any kernel QA.
> > >
> > > I weight the latter more than the former. And it is exactly the
> > > reason why we did custom user space kselftest in the first place.
> > > Let's keep the tradition. All I can say is that kselftest is 
> > > unfinished in its current form.
> > >
> > > What is "esp cgexec"?
> > 
> > Also in kselftest we don't drive ultimate simplicity, we drive
> > efficient CI/QA. By open coding something like subset of
> > cgroup-tools needed to run the test you also help us later
> > on to backtrack the kernel changes. With cgroups-tools you
> > would have to use strace to get the same info.
>
> FWIW, see also functions in
> tools/testing/selftests/cgroup/cgroup_util.{h,c}.
> They likely cover what you need already -- if the tests are in C.
>
> (I admit that stuff in tools/testing/selftests/cgroup/ is best
> understood with strace.)

Thanks!

My conclusions are that:

1. We probably cannot move the test part of cgroup test itself
   given the enclave payload dependency.
2. I think it makes sense to still follow the same pattern as
   other cgroups test and re-use cgroup_util.[ch] functionaltiy.

So yeah I guess we need two test programs instead of one.

Something along the lines:

1. main.[ch] -> test_sgx.[ch]
2. introduce test_sgx_cgroup.c

And test_sgx_cgroup.c would be implement similar test as the shell
script and would follow the structure of existing cgroups tests.

>
> HTH,
> Michal

BR, Jarkko

Re: [PATCH v2] selftests/sgx: Improve cgroup test scripts

2024-04-02 Thread Jarkko Sakkinen

On Tue Apr 2, 2024 at 4:42 AM EEST, Haitao Huang wrote:
> Make cgroup test scripts ash compatible.
> Remove cg-tools dependency.
> Add documentation for functions.
>
> Tested with busybox on Ubuntu.
>
> Signed-off-by: Haitao Huang 
> ---
> v2:
> - Fixes for v2 cgroup
> - Turn off swapping before memcontrol tests and back on after
> - Add comments and reformat
> ---
>  tools/testing/selftests/sgx/ash_cgexec.sh |  57 ++
>  .../selftests/sgx/run_epc_cg_selftests.sh | 187 +++---
>  .../selftests/sgx/watch_misc_for_tests.sh |  13 +-
>  3 files changed, 179 insertions(+), 78 deletions(-)
>  create mode 100755 tools/testing/selftests/sgx/ash_cgexec.sh
>
> diff --git a/tools/testing/selftests/sgx/ash_cgexec.sh 
> b/tools/testing/selftests/sgx/ash_cgexec.sh
> new file mode 100755
> index ..9607784378df
> --- /dev/null
> +++ b/tools/testing/selftests/sgx/ash_cgexec.sh
> @@ -0,0 +1,57 @@
> +#!/usr/bin/env sh
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright(c) 2024 Intel Corporation.
> +
> +# Move the current shell process to the specified cgroup
> +# Arguments:
> +#$1 - The cgroup controller name, e.g., misc, memory.
> +#$2 - The path of the cgroup,
> +#relative to /sys/fs/cgroup for cgroup v2,
> +#relative to /sys/fs/cgroup/$1 for v1.
> +move_to_cgroup() {
> +controllers="$1"
> +path="$2"
> +
> +# Check if cgroup v2 is in use
> +if [ ! -d "/sys/fs/cgroup/misc" ]; then
> +# Cgroup v2 logic
> +cgroup_full_path="/sys/fs/cgroup/${path}"
> +echo $$ > "${cgroup_full_path}/cgroup.procs"
> +else
> +# Cgroup v1 logic
> +OLD_IFS="$IFS"
> +IFS=','
> +for controller in $controllers; do
> +cgroup_full_path="/sys/fs/cgroup/${controller}/${path}"
> +echo $$ > "${cgroup_full_path}/tasks"
> +done
> +IFS="$OLD_IFS"
> +fi

I think that if you could point me to git v10 and all this I could
then quite easily create test image and see what I get from that.

I will code review the whole thing but this is definitely good
enough to start testing this series properly! Thanks for the
effort with this. The payback from this comes after the feature
is mainline. We have now sort of reference of the usage patterns
and less layers when we need to debug any possible (likely) bugs
in the future.

This is definitely to the right direction. I'm just wondering do
we want to support v1 cgroups or would it make sense support only
v2?
 
BR, Jarkko

Re: [PATCH] selftests/sgx: Improve cgroup test scripts

2024-04-02 Thread Jarkko Sakkinen

On Tue Apr 2, 2024 at 1:55 AM EEST, Haitao Huang wrote:
> On Mon, 01 Apr 2024 09:22:21 -0500, Jarkko Sakkinen   
> wrote:
>
> > On Sun Mar 31, 2024 at 8:44 PM EEST, Haitao Huang wrote:
> >> Make cgroup test scripts ash compatible.
> >> Remove cg-tools dependency.
> >> Add documentation for functions.
> >>
> >> Tested with busybox on Ubuntu.
> >>
> >> Signed-off-by: Haitao Huang 
> >
> > I'll run this next week on good old NUC7. Thank you.
> >
> > I really wish that either (hopefully both) Intel or AMD would bring up
> > for developers home use meant platform to develop on TDX and SNP. It is
> > a shame that the latest and greatest is from 2018.
> >
> > BR, Jarkko
> >
>
> Argh, missed a few changes for v2 cgroup:
>
> --- a/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> +++ b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> @@ -15,6 +15,8 @@ CG_MEM_ROOT=/sys/fs/cgroup
>   CG_V1=0
>   if [ ! -d "/sys/fs/cgroup/misc" ]; then
>   echo "# cgroup V2 is in use."
> +echo "+misc" >  $CG_MISC_ROOT/cgroup.subtree_control
> +echo "+memory" > $CG_MEM_ROOT/cgroup.subtree_control
>   else
>   echo "# cgroup V1 is in use."
>   CG_MISC_ROOT=/sys/fs/cgroup/misc
> @@ -26,6 +28,11 @@ mkdir -p $CG_MISC_ROOT/$TEST_CG_SUB2
>   mkdir -p $CG_MISC_ROOT/$TEST_CG_SUB3
>   mkdir -p $CG_MISC_ROOT/$TEST_CG_SUB4
>
> +if [ $CG_V1 -eq 0 ]; then
> +echo "+misc" >  $CG_MISC_ROOT/$TEST_ROOT_CG/cgroup.subtree_control
> +echo "+misc" >  $CG_MISC_ROOT/$TEST_CG_SUB1/cgroup.subtree_control
> +fi

Maybe it would be most convenient just to +1 the kselftest patch?

Alternatively you could point out to a Git branch with the series
and the updated patch.

BR, Jarkko

Re: [PATCH v10 05/14] x86/sgx: Implement basic EPC misc cgroup functionality

2024-04-01 Thread Jarkko Sakkinen

On Mon Apr 1, 2024 at 12:29 PM EEST, Huang, Kai wrote:
> On Sat, 2024-03-30 at 13:17 +0200, Jarkko Sakkinen wrote:
> > On Thu Mar 28, 2024 at 2:53 PM EET, Huang, Kai wrote:
> > > 
> > > > --- /dev/null
> > > > +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> > > > @@ -0,0 +1,74 @@
> > > > +// SPDX-License-Identifier: GPL-2.0
> > > > +// Copyright(c) 2022 Intel Corporation.
> > > 
> > > It's 2024 now.
> > > 
> > > And looks you need to use C style comment for /* Copyright ... */, after 
> > > looking
> > > at some other C files.
> > 
> > To be fair, this happens *all the time* to everyone :-)
> > 
> > I've proposed this few times in SGX context and going to say it now.
> > Given the nature of Git copyrights would anyway need to be sorted by
> > the Git log not possibly incorrect copyright platters in the header
> > and source files.
> > 
>
> Sure fine to me either way.  Thanks for pointing out.
>
> I have some vague memory that we should update the year but I guess I was 
> wrong.

I think updating year makes sense!

I'd be fine not having copyright platter at all since the commit is from
Intel domain anyway but if it is kept then the year needs to be
corrected.

I mean Git commit stores all the data, including exact date.

BR, Jarkko

Re: [PATCH] selftests/sgx: Improve cgroup test scripts

2024-04-01 Thread Jarkko Sakkinen

On Sun Mar 31, 2024 at 8:44 PM EEST, Haitao Huang wrote:
> Make cgroup test scripts ash compatible.
> Remove cg-tools dependency.
> Add documentation for functions.
>
> Tested with busybox on Ubuntu.
>
> Signed-off-by: Haitao Huang 

I'll run this next week on good old NUC7. Thank you.

I really wish that either (hopefully both) Intel or AMD would bring up
for developers home use meant platform to develop on TDX and SNP. It is
a shame that the latest and greatest is from 2018.

BR, Jarkko

Re: [PATCH v9 15/15] selftests/sgx: Add scripts for EPC cgroup testing

2024-04-01 Thread Jarkko Sakkinen

On Sun Mar 31, 2024 at 8:35 PM EEST, Haitao Huang wrote:
> On Sun, 31 Mar 2024 11:19:04 -0500, Jarkko Sakkinen   
> wrote:
>
> > On Sat Mar 30, 2024 at 5:32 PM EET, Haitao Huang wrote:
> >> On Sat, 30 Mar 2024 06:15:14 -0500, Jarkko Sakkinen 
> >> wrote:
> >>
> >> > On Thu Mar 28, 2024 at 5:54 AM EET, Haitao Huang wrote:
> >> >> On Wed, 27 Mar 2024 07:55:34 -0500, Jarkko Sakkinen  
> >> 
> >> >> wrote:
> >> >>
> >> >> > On Mon, 2024-02-05 at 13:06 -0800, Haitao Huang wrote:
> >> >> >> The scripts rely on cgroup-tools package from libcgroup [1].
> >> >> >>
> >> >> >> To run selftests for epc cgroup:
> >> >> >>
> >> >> >> sudo ./run_epc_cg_selftests.sh
> >> >> >>
> >> >> >> To watch misc cgroup 'current' changes during testing, run this  
> >> in a
> >> >> >> separate terminal:
> >> >> >>
> >> >> >> ./watch_misc_for_tests.sh current
> >> >> >>
> >> >> >> With different cgroups, the script starts one or multiple  
> >> concurrent
> >> >> >> SGX
> >> >> >> selftests, each to run one unclobbered_vdso_oversubscribed  
> >> test.Each
> >> >> >> of such test tries to load an enclave of EPC size equal to the EPC
> >> >> >> capacity available on the platform. The script checks results  
> >> against
> >> >> >> the expectation set for each cgroup and reports success or  
> >> failure.
> >> >> >>
> >> >> >> The script creates 3 different cgroups at the beginning with
> >> >> >> following
> >> >> >> expectations:
> >> >> >>
> >> >> >> 1) SMALL - intentionally small enough to fail the test loading an
> >> >> >> enclave of size equal to the capacity.
> >> >> >> 2) LARGE - large enough to run up to 4 concurrent tests but fail  
> >> some
> >> >> >> if
> >> >> >> more than 4 concurrent tests are run. The script starts 4  
> >> expecting
> >> >> >> at
> >> >> >> least one test to pass, and then starts 5 expecting at least one  
> >> test
> >> >> >> to fail.
> >> >> >> 3) LARGER - limit is the same as the capacity, large enough to run
> >> >> >> lots of
> >> >> >> concurrent tests. The script starts 8 of them and expects all  
> >> pass.
> >> >> >> Then it reruns the same test with one process randomly killed and
> >> >> >> usage checked to be zero after all process exit.
> >> >> >>
> >> >> >> The script also includes a test with low mem_cg limit and LARGE
> >> >> >> sgx_epc
> >> >> >> limit to verify that the RAM used for per-cgroup reclamation is
> >> >> >> charged
> >> >> >> to a proper mem_cg.
> >> >> >>
> >> >> >> [1] https://github.com/libcgroup/libcgroup/blob/main/README
> >> >> >>
> >> >> >> Signed-off-by: Haitao Huang 
> >> >> >> ---
> >> >> >> V7:
> >> >> >> - Added memcontrol test.
> >> >> >>
> >> >> >> V5:
> >> >> >> - Added script with automatic results checking, remove the
> >> >> >> interactive
> >> >> >> script.
> >> >> >> - The script can run independent from the series below.
> >> >> >> ---
> >> >> >>  .../selftests/sgx/run_epc_cg_selftests.sh | 246
> >> >> >> ++
> >> >> >>  .../selftests/sgx/watch_misc_for_tests.sh |  13 +
> >> >> >>  2 files changed, 259 insertions(+)
> >> >> >>  create mode 100755
> >> >> >> tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> >> >> >>  create mode 100755
> >> >> >> tools/testing/selftests/sgx/watch_misc_for_tests.sh
> >> >> >>
> >> >> >> diff --git a/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> >> >> >> b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh

Re: [PATCH v9 15/15] selftests/sgx: Add scripts for EPC cgroup testing

2024-03-31 Thread Jarkko Sakkinen

On Sat Mar 30, 2024 at 5:32 PM EET, Haitao Huang wrote:
> On Sat, 30 Mar 2024 06:15:14 -0500, Jarkko Sakkinen   
> wrote:
>
> > On Thu Mar 28, 2024 at 5:54 AM EET, Haitao Huang wrote:
> >> On Wed, 27 Mar 2024 07:55:34 -0500, Jarkko Sakkinen 
> >> wrote:
> >>
> >> > On Mon, 2024-02-05 at 13:06 -0800, Haitao Huang wrote:
> >> >> The scripts rely on cgroup-tools package from libcgroup [1].
> >> >>
> >> >> To run selftests for epc cgroup:
> >> >>
> >> >> sudo ./run_epc_cg_selftests.sh
> >> >>
> >> >> To watch misc cgroup 'current' changes during testing, run this in a
> >> >> separate terminal:
> >> >>
> >> >> ./watch_misc_for_tests.sh current
> >> >>
> >> >> With different cgroups, the script starts one or multiple concurrent
> >> >> SGX
> >> >> selftests, each to run one unclobbered_vdso_oversubscribed test.Each
> >> >> of such test tries to load an enclave of EPC size equal to the EPC
> >> >> capacity available on the platform. The script checks results against
> >> >> the expectation set for each cgroup and reports success or failure.
> >> >>
> >> >> The script creates 3 different cgroups at the beginning with
> >> >> following
> >> >> expectations:
> >> >>
> >> >> 1) SMALL - intentionally small enough to fail the test loading an
> >> >> enclave of size equal to the capacity.
> >> >> 2) LARGE - large enough to run up to 4 concurrent tests but fail some
> >> >> if
> >> >> more than 4 concurrent tests are run. The script starts 4 expecting
> >> >> at
> >> >> least one test to pass, and then starts 5 expecting at least one test
> >> >> to fail.
> >> >> 3) LARGER - limit is the same as the capacity, large enough to run
> >> >> lots of
> >> >> concurrent tests. The script starts 8 of them and expects all pass.
> >> >> Then it reruns the same test with one process randomly killed and
> >> >> usage checked to be zero after all process exit.
> >> >>
> >> >> The script also includes a test with low mem_cg limit and LARGE
> >> >> sgx_epc
> >> >> limit to verify that the RAM used for per-cgroup reclamation is
> >> >> charged
> >> >> to a proper mem_cg.
> >> >>
> >> >> [1] https://github.com/libcgroup/libcgroup/blob/main/README
> >> >>
> >> >> Signed-off-by: Haitao Huang 
> >> >> ---
> >> >> V7:
> >> >> - Added memcontrol test.
> >> >>
> >> >> V5:
> >> >> - Added script with automatic results checking, remove the
> >> >> interactive
> >> >> script.
> >> >> - The script can run independent from the series below.
> >> >> ---
> >> >>  .../selftests/sgx/run_epc_cg_selftests.sh | 246
> >> >> ++
> >> >>  .../selftests/sgx/watch_misc_for_tests.sh |  13 +
> >> >>  2 files changed, 259 insertions(+)
> >> >>  create mode 100755
> >> >> tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> >> >>  create mode 100755
> >> >> tools/testing/selftests/sgx/watch_misc_for_tests.sh
> >> >>
> >> >> diff --git a/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> >> >> b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> >> >> new file mode 100755
> >> >> index ..e027bf39f005
> >> >> --- /dev/null
> >> >> +++ b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> >> >> @@ -0,0 +1,246 @@
> >> >> +#!/bin/bash
> >> >
> >> > This is not portable and neither does hold in the wild.
> >> >
> >> > It does not even often hold as it is not uncommon to place bash
> >> > to the path /usr/bin/bash. If I recall correctly, e.g. NixOS has
> >> > a path that is neither of those two.
> >> >
> >> > Should be #!/usr/bin/env bash
> >> >
> >> > That is POSIX compatible form.
> >> >
> >>
> >> Sure
> >>
> >> > Just got around trying to test this in NUC7 so looking into this in
> >> > more detail.
> >>
> >

Re: [PATCH v10 14/14] selftests/sgx: Add scripts for EPC cgroup testing

2024-03-30 Thread Jarkko Sakkinen

On Thu Mar 28, 2024 at 2:22 AM EET, Haitao Huang wrote:
> The scripts rely on cgroup-tools package from libcgroup [1].
>
> To run selftests for epc cgroup:
>
> sudo ./run_epc_cg_selftests.sh
>
> To watch misc cgroup 'current' changes during testing, run this in a
> separate terminal:
>
> ./watch_misc_for_tests.sh current
>
> With different cgroups, the script starts one or multiple concurrent SGX
> selftests, each to run one unclobbered_vdso_oversubscribed test.  Each
> of such test tries to load an enclave of EPC size equal to the EPC
> capacity available on the platform. The script checks results against
> the expectation set for each cgroup and reports success or failure.
>
> The script creates 3 different cgroups at the beginning with following
> expectations:
>
> 1) SMALL - intentionally small enough to fail the test loading an
> enclave of size equal to the capacity.
> 2) LARGE - large enough to run up to 4 concurrent tests but fail some if
> more than 4 concurrent tests are run. The script starts 4 expecting at
> least one test to pass, and then starts 5 expecting at least one test
> to fail.
> 3) LARGER - limit is the same as the capacity, large enough to run lots of
> concurrent tests. The script starts 8 of them and expects all pass.
> Then it reruns the same test with one process randomly killed and
> usage checked to be zero after all process exit.
>
> The script also includes a test with low mem_cg limit and LARGE sgx_epc
> limit to verify that the RAM used for per-cgroup reclamation is charged
> to a proper mem_cg.
>
> [1] https://github.com/libcgroup/libcgroup/blob/main/README
>
> Signed-off-by: Haitao Huang 

My previous comments and you have two undocumented dependencies for your
selftest (I searched for cgexec and cgroups-tools as keywords).

BR, Jarkko

Re: [PATCH v9 15/15] selftests/sgx: Add scripts for EPC cgroup testing

2024-03-30 Thread Jarkko Sakkinen

On Sat Mar 30, 2024 at 1:23 PM EET, Jarkko Sakkinen wrote:
> On Thu Mar 28, 2024 at 2:57 AM EET, Haitao Huang wrote:
> > On Wed, 27 Mar 2024 11:56:35 -0500, Jarkko Sakkinen   
> > wrote:
> >
> > > On Wed Mar 27, 2024 at 2:55 PM EET, Jarkko Sakkinen wrote:
> > >> On Mon, 2024-02-05 at 13:06 -0800, Haitao Huang wrote:
> > >> > The scripts rely on cgroup-tools package from libcgroup [1].
> > >> >
> > >> > To run selftests for epc cgroup:
> > >> >
> > >> > sudo ./run_epc_cg_selftests.sh
> > >> >
> > >> > To watch misc cgroup 'current' changes during testing, run this in a
> > >> > separate terminal:
> > >> >
> > >> > ./watch_misc_for_tests.sh current
> > >> >
> > >> > With different cgroups, the script starts one or multiple concurrent
> > >> > SGX
> > >> > selftests, each to run one unclobbered_vdso_oversubscribed test.> Each
> > >> > of such test tries to load an enclave of EPC size equal to the EPC
> > >> > capacity available on the platform. The script checks results against
> > >> > the expectation set for each cgroup and reports success or failure.
> > >> >
> > >> > The script creates 3 different cgroups at the beginning with
> > >> > following
> > >> > expectations:
> > >> >
> > >> > 1) SMALL - intentionally small enough to fail the test loading an
> > >> > enclave of size equal to the capacity.
> > >> > 2) LARGE - large enough to run up to 4 concurrent tests but fail some
> > >> > if
> > >> > more than 4 concurrent tests are run. The script starts 4 expecting
> > >> > at
> > >> > least one test to pass, and then starts 5 expecting at least one test
> > >> > to fail.
> > >> > 3) LARGER - limit is the same as the capacity, large enough to run
> > >> > lots of
> > >> > concurrent tests. The script starts 8 of them and expects all pass.
> > >> > Then it reruns the same test with one process randomly killed and
> > >> > usage checked to be zero after all process exit.
> > >> >
> > >> > The script also includes a test with low mem_cg limit and LARGE
> > >> > sgx_epc
> > >> > limit to verify that the RAM used for per-cgroup reclamation is
> > >> > charged
> > >> > to a proper mem_cg.
> > >> >
> > >> > [1] https://github.com/libcgroup/libcgroup/blob/main/README
> > >> >
> > >> > Signed-off-by: Haitao Huang 
> > >> > ---
> > >> > V7:
> > >> > - Added memcontrol test.
> > >> >
> > >> > V5:
> > >> > - Added script with automatic results checking, remove the
> > >> > interactive
> > >> > script.
> > >> > - The script can run independent from the series below.
> > >> > ---
> > >> >  .../selftests/sgx/run_epc_cg_selftests.sh | 246
> > >> > ++
> > >> >  .../selftests/sgx/watch_misc_for_tests.sh |  13 +
> > >> >  2 files changed, 259 insertions(+)
> > >> >  create mode 100755
> > >> > tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> > >> >  create mode 100755
> > >> > tools/testing/selftests/sgx/watch_misc_for_tests.sh
> > >> >
> > >> > diff --git a/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> > >> > b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> > >> > new file mode 100755
> > >> > index ..e027bf39f005
> > >> > --- /dev/null
> > >> > +++ b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> > >> > @@ -0,0 +1,246 @@
> > >> > +#!/bin/bash
> > >>
> > >> This is not portable and neither does hold in the wild.
> > >>
> > >> It does not even often hold as it is not uncommon to place bash
> > >> to the path /usr/bin/bash. If I recall correctly, e.g. NixOS has
> > >> a path that is neither of those two.
> > >>
> > >> Should be #!/usr/bin/env bash
> > >>
> > >> That is POSIX compatible form.
> > >>
> > >> Just got around trying to test this in NUC7 so looking into this in
> > >> more detail.
&

Re: [PATCH v9 15/15] selftests/sgx: Add scripts for EPC cgroup testing

2024-03-30 Thread Jarkko Sakkinen

On Thu Mar 28, 2024 at 2:57 AM EET, Haitao Huang wrote:
> On Wed, 27 Mar 2024 11:56:35 -0500, Jarkko Sakkinen   
> wrote:
>
> > On Wed Mar 27, 2024 at 2:55 PM EET, Jarkko Sakkinen wrote:
> >> On Mon, 2024-02-05 at 13:06 -0800, Haitao Huang wrote:
> >> > The scripts rely on cgroup-tools package from libcgroup [1].
> >> >
> >> > To run selftests for epc cgroup:
> >> >
> >> > sudo ./run_epc_cg_selftests.sh
> >> >
> >> > To watch misc cgroup 'current' changes during testing, run this in a
> >> > separate terminal:
> >> >
> >> > ./watch_misc_for_tests.sh current
> >> >
> >> > With different cgroups, the script starts one or multiple concurrent
> >> > SGX
> >> > selftests, each to run one unclobbered_vdso_oversubscribed test.> Each
> >> > of such test tries to load an enclave of EPC size equal to the EPC
> >> > capacity available on the platform. The script checks results against
> >> > the expectation set for each cgroup and reports success or failure.
> >> >
> >> > The script creates 3 different cgroups at the beginning with
> >> > following
> >> > expectations:
> >> >
> >> > 1) SMALL - intentionally small enough to fail the test loading an
> >> > enclave of size equal to the capacity.
> >> > 2) LARGE - large enough to run up to 4 concurrent tests but fail some
> >> > if
> >> > more than 4 concurrent tests are run. The script starts 4 expecting
> >> > at
> >> > least one test to pass, and then starts 5 expecting at least one test
> >> > to fail.
> >> > 3) LARGER - limit is the same as the capacity, large enough to run
> >> > lots of
> >> > concurrent tests. The script starts 8 of them and expects all pass.
> >> > Then it reruns the same test with one process randomly killed and
> >> > usage checked to be zero after all process exit.
> >> >
> >> > The script also includes a test with low mem_cg limit and LARGE
> >> > sgx_epc
> >> > limit to verify that the RAM used for per-cgroup reclamation is
> >> > charged
> >> > to a proper mem_cg.
> >> >
> >> > [1] https://github.com/libcgroup/libcgroup/blob/main/README
> >> >
> >> > Signed-off-by: Haitao Huang 
> >> > ---
> >> > V7:
> >> > - Added memcontrol test.
> >> >
> >> > V5:
> >> > - Added script with automatic results checking, remove the
> >> > interactive
> >> > script.
> >> > - The script can run independent from the series below.
> >> > ---
> >> >  .../selftests/sgx/run_epc_cg_selftests.sh | 246
> >> > ++
> >> >  .../selftests/sgx/watch_misc_for_tests.sh |  13 +
> >> >  2 files changed, 259 insertions(+)
> >> >  create mode 100755
> >> > tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> >> >  create mode 100755
> >> > tools/testing/selftests/sgx/watch_misc_for_tests.sh
> >> >
> >> > diff --git a/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> >> > b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> >> > new file mode 100755
> >> > index ..e027bf39f005
> >> > --- /dev/null
> >> > +++ b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> >> > @@ -0,0 +1,246 @@
> >> > +#!/bin/bash
> >>
> >> This is not portable and neither does hold in the wild.
> >>
> >> It does not even often hold as it is not uncommon to place bash
> >> to the path /usr/bin/bash. If I recall correctly, e.g. NixOS has
> >> a path that is neither of those two.
> >>
> >> Should be #!/usr/bin/env bash
> >>
> >> That is POSIX compatible form.
> >>
> >> Just got around trying to test this in NUC7 so looking into this in
> >> more detail.
> >>
> >> That said can you make the script work with just "#!/usr/bin/env sh"
> >> and make sure that it is busybox ash compatible?
> >>
> >> I don't see any necessity to make this bash only and it adds to the
> >> compilation time of the image. Otherwise lot of this could be tested
> >> just with qemu+bzImage+busybox(inside initramfs).
> >>
> >> Now you are adding fully glibc shenanigans for the sake of syntax
&g

Re: [PATCH v10 05/14] x86/sgx: Implement basic EPC misc cgroup functionality

2024-03-30 Thread Jarkko Sakkinen

On Thu Mar 28, 2024 at 2:53 PM EET, Huang, Kai wrote:
>
> > --- /dev/null
> > +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> > @@ -0,0 +1,74 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +// Copyright(c) 2022 Intel Corporation.
>
> It's 2024 now.
>
> And looks you need to use C style comment for /* Copyright ... */, after 
> looking
> at some other C files.

To be fair, this happens *all the time* to everyone :-)

I've proposed this few times in SGX context and going to say it now.
Given the nature of Git copyrights would anyway need to be sorted by
the Git log not possibly incorrect copyright platters in the header
and source files.

BR, Jarkko

Re: [PATCH v9 15/15] selftests/sgx: Add scripts for EPC cgroup testing

2024-03-30 Thread Jarkko Sakkinen

On Thu Mar 28, 2024 at 5:54 AM EET, Haitao Huang wrote:
> On Wed, 27 Mar 2024 07:55:34 -0500, Jarkko Sakkinen   
> wrote:
>
> > On Mon, 2024-02-05 at 13:06 -0800, Haitao Huang wrote:
> >> The scripts rely on cgroup-tools package from libcgroup [1].
> >>
> >> To run selftests for epc cgroup:
> >>
> >> sudo ./run_epc_cg_selftests.sh
> >>
> >> To watch misc cgroup 'current' changes during testing, run this in a
> >> separate terminal:
> >>
> >> ./watch_misc_for_tests.sh current
> >>
> >> With different cgroups, the script starts one or multiple concurrent
> >> SGX
> >> selftests, each to run one unclobbered_vdso_oversubscribed test.Each
> >> of such test tries to load an enclave of EPC size equal to the EPC
> >> capacity available on the platform. The script checks results against
> >> the expectation set for each cgroup and reports success or failure.
> >>
> >> The script creates 3 different cgroups at the beginning with
> >> following
> >> expectations:
> >>
> >> 1) SMALL - intentionally small enough to fail the test loading an
> >> enclave of size equal to the capacity.
> >> 2) LARGE - large enough to run up to 4 concurrent tests but fail some
> >> if
> >> more than 4 concurrent tests are run. The script starts 4 expecting
> >> at
> >> least one test to pass, and then starts 5 expecting at least one test
> >> to fail.
> >> 3) LARGER - limit is the same as the capacity, large enough to run
> >> lots of
> >> concurrent tests. The script starts 8 of them and expects all pass.
> >> Then it reruns the same test with one process randomly killed and
> >> usage checked to be zero after all process exit.
> >>
> >> The script also includes a test with low mem_cg limit and LARGE
> >> sgx_epc
> >> limit to verify that the RAM used for per-cgroup reclamation is
> >> charged
> >> to a proper mem_cg.
> >>
> >> [1] https://github.com/libcgroup/libcgroup/blob/main/README
> >>
> >> Signed-off-by: Haitao Huang 
> >> ---
> >> V7:
> >> - Added memcontrol test.
> >>
> >> V5:
> >> - Added script with automatic results checking, remove the
> >> interactive
> >> script.
> >> - The script can run independent from the series below.
> >> ---
> >>  .../selftests/sgx/run_epc_cg_selftests.sh | 246
> >> ++
> >>  .../selftests/sgx/watch_misc_for_tests.sh |  13 +
> >>  2 files changed, 259 insertions(+)
> >>  create mode 100755
> >> tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> >>  create mode 100755
> >> tools/testing/selftests/sgx/watch_misc_for_tests.sh
> >>
> >> diff --git a/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> >> b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> >> new file mode 100755
> >> index ..e027bf39f005
> >> --- /dev/null
> >> +++ b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> >> @@ -0,0 +1,246 @@
> >> +#!/bin/bash
> >
> > This is not portable and neither does hold in the wild.
> >
> > It does not even often hold as it is not uncommon to place bash
> > to the path /usr/bin/bash. If I recall correctly, e.g. NixOS has
> > a path that is neither of those two.
> >
> > Should be #!/usr/bin/env bash
> >
> > That is POSIX compatible form.
> >
>
> Sure
>
> > Just got around trying to test this in NUC7 so looking into this in
> > more detail.
>
> Thanks. Could you please check if this version works for you?
>
> https://github.com/haitaohuang/linux/commit/3c424b841cf3cf66b085a424f4b537fbc3bbff6f
>
> >
> > That said can you make the script work with just "#!/usr/bin/env sh"
> > and make sure that it is busybox ash compatible?
>
> Yes.
>
> >
> > I don't see any necessity to make this bash only and it adds to the
> > compilation time of the image. Otherwise lot of this could be tested
> > just with qemu+bzImage+busybox(inside initramfs).
> >
>
> will still need cgroup-tools as you pointed out later. Compiling from its  
> upstream code OK?

Can you explain why you need it?

What is the thing you cannot do without it?

BR, Jarkko

Re: [PATCH v9 15/15] selftests/sgx: Add scripts for EPC cgroup testing

2024-03-27 Thread Jarkko Sakkinen

On Wed Mar 27, 2024 at 2:55 PM EET, Jarkko Sakkinen wrote:
> On Mon, 2024-02-05 at 13:06 -0800, Haitao Huang wrote:
> > The scripts rely on cgroup-tools package from libcgroup [1].
> > 
> > To run selftests for epc cgroup:
> > 
> > sudo ./run_epc_cg_selftests.sh
> > 
> > To watch misc cgroup 'current' changes during testing, run this in a
> > separate terminal:
> > 
> > ./watch_misc_for_tests.sh current
> > 
> > With different cgroups, the script starts one or multiple concurrent
> > SGX
> > selftests, each to run one unclobbered_vdso_oversubscribed test. 
> > Each
> > of such test tries to load an enclave of EPC size equal to the EPC
> > capacity available on the platform. The script checks results against
> > the expectation set for each cgroup and reports success or failure.
> > 
> > The script creates 3 different cgroups at the beginning with
> > following
> > expectations:
> > 
> > 1) SMALL - intentionally small enough to fail the test loading an
> > enclave of size equal to the capacity.
> > 2) LARGE - large enough to run up to 4 concurrent tests but fail some
> > if
> > more than 4 concurrent tests are run. The script starts 4 expecting
> > at
> > least one test to pass, and then starts 5 expecting at least one test
> > to fail.
> > 3) LARGER - limit is the same as the capacity, large enough to run
> > lots of
> > concurrent tests. The script starts 8 of them and expects all pass.
> > Then it reruns the same test with one process randomly killed and
> > usage checked to be zero after all process exit.
> > 
> > The script also includes a test with low mem_cg limit and LARGE
> > sgx_epc
> > limit to verify that the RAM used for per-cgroup reclamation is
> > charged
> > to a proper mem_cg.
> > 
> > [1] https://github.com/libcgroup/libcgroup/blob/main/README
> > 
> > Signed-off-by: Haitao Huang 
> > ---
> > V7:
> > - Added memcontrol test.
> > 
> > V5:
> > - Added script with automatic results checking, remove the
> > interactive
> > script.
> > - The script can run independent from the series below.
> > ---
> >  .../selftests/sgx/run_epc_cg_selftests.sh | 246
> > ++
> >  .../selftests/sgx/watch_misc_for_tests.sh |  13 +
> >  2 files changed, 259 insertions(+)
> >  create mode 100755
> > tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> >  create mode 100755
> > tools/testing/selftests/sgx/watch_misc_for_tests.sh
> > 
> > diff --git a/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> > b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> > new file mode 100755
> > index ..e027bf39f005
> > --- /dev/null
> > +++ b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> > @@ -0,0 +1,246 @@
> > +#!/bin/bash
>
> This is not portable and neither does hold in the wild.
>
> It does not even often hold as it is not uncommon to place bash
> to the path /usr/bin/bash. If I recall correctly, e.g. NixOS has
> a path that is neither of those two.
>
> Should be #!/usr/bin/env bash
>
> That is POSIX compatible form.
>
> Just got around trying to test this in NUC7 so looking into this in
> more detail.
>
> That said can you make the script work with just "#!/usr/bin/env sh"
> and make sure that it is busybox ash compatible?
>
> I don't see any necessity to make this bash only and it adds to the
> compilation time of the image. Otherwise lot of this could be tested
> just with qemu+bzImage+busybox(inside initramfs).
>
> Now you are adding fully glibc shenanigans for the sake of syntax
> sugar.
>
> > +# SPDX-License-Identifier: GPL-2.0
> > +# Copyright(c) 2023 Intel Corporation.
> > +
> > +TEST_ROOT_CG=selftest
> > +cgcreate -g misc:$TEST_ROOT_CG
>
> How do you know that cgcreate exists? It is used a lot in the script
> with no check for the existence. Please fix e.g. with "command -v
> cgreate".
>
> > +if [ $? -ne 0 ]; then
> > +    echo "# Please make sure cgroup-tools is installed, and misc
> > cgroup is mounted."
> > +    exit 1
> > +fi
>
> And please do not do it this way. Also, please remove the advice for
> "cgroups-tool". This is not meant to be debian only. Better would be
> to e.g. point out the URL of the upstream project.
>
> And yeah the whole message should be based on "command -v", not like
> this.
>
> > +TEST_CG_SUB1=$TEST_ROOT_CG/test1
> > +TEST_CG_SUB2=$TEST_ROOT_CG/test2
> > +# We will only

Re: [PATCH v7 2/2] arch/riscv: Enable kprobes when CONFIG_MODULES=n

2024-03-27 Thread Jarkko Sakkinen

On Tue Mar 26, 2024 at 8:42 PM EET, Conor Dooley wrote:
> On Tue, Mar 26, 2024 at 03:46:16PM +0200, Jarkko Sakkinen wrote:
> > Tacing with kprobes while running a monolithic kernel is currently
> > impossible due the kernel module allocator dependency.
> > 
> > Address the issue by implementing textmem API for RISC-V.
>
> This doesn't compile for nommu:
>   /build/tmp.3xucsBhqDV/arch/riscv/kernel/execmem.c:10:46: error: 
> 'MODULES_VADDR' undeclared (first use in this function)
>   /build/tmp.3xucsBhqDV/arch/riscv/kernel/execmem.c:11:37: error: 
> 'MODULES_END' undeclared (first use in this function)
>   /build/tmp.3xucsBhqDV/arch/riscv/kernel/execmem.c:14:1: error: control 
> reaches end of non-void function [-Werror=return-type]
> Clang builds also report:
> ../arch/riscv/kernel/execmem.c:8:56: warning: omitting the parameter name in 
> a function definition is a C2x extension [-Wc2x-extensions]
>
> > 
> > Link: https://www.sochub.fi # for power on testing new SoC's with a minimal 
> > stack
> > Link: 
> > https://lore.kernel.org/all/2022060814.3054333-1-jar...@profian.com/ # 
> > continuation
> > Signed-off-by: Jarkko Sakkinen 
> > ---
> > v5-v7:
> > - No changes.
> > v4:
> > - Include linux/execmem.h.
> > v3:
> > - Architecture independent parts have been split to separate patches.
> > - Do not change arch/riscv/kernel/module.c as it is out of scope for
> >   this patch set now.
>
> Meta comment. I dunno when v1 was sent, but versions can you please
> relax with submitting new versions of your patches? There's conversations
> ongoing on v5 at the moment, while this is a more recent version. v2
> seems to have been sent on the 23rd and there's been 5 versions in the
> last day:
> https://patchwork.kernel.org/project/linux-riscv/list/?submitter=195059=*
>
> Could you please also try and use a cover letter for patchsets, ideally
> with a consistent subject? Otherwise I have to manually mark stuff as
> superseded.

Point taken but the work has been taken over by Mark and relevant
changes are now sucked into that patch set.

> Thanks,
> Conor.

BR, Jarkko

Re: [PATCH v4 02/14] mm: Switch mm->get_unmapped_area() to a flag

2024-03-27 Thread Jarkko Sakkinen

On Wed, 2024-03-27 at 02:42 +, Edgecombe, Rick P wrote:
> On Tue, 2024-03-26 at 13:57 +0200, Jarkko Sakkinen wrote:
> > In which conditions which path is used during the initialization of
> > mm
> > and why is this the case? It is an open claim in the current form.
> 
> There is an arch_pick_mmap_layout() that arch's can have their own
> rules for. There is also a
> generic one. It gets called during exec.
> 
> > 
> > That would be nice to have documented for the sake of being
> > complete
> > description. I have zero doubts of the claim being untrue.
> 
> ...being untrue?
> 

I mean I believe the change itself makes sense, it is just not
fully documented in the commit message.

BR, Jarkko

Re: [PATCH v9 15/15] selftests/sgx: Add scripts for EPC cgroup testing

2024-03-27 Thread Jarkko Sakkinen

On Mon, 2024-02-05 at 13:06 -0800, Haitao Huang wrote:
> The scripts rely on cgroup-tools package from libcgroup [1].
> 
> To run selftests for epc cgroup:
> 
> sudo ./run_epc_cg_selftests.sh
> 
> To watch misc cgroup 'current' changes during testing, run this in a
> separate terminal:
> 
> ./watch_misc_for_tests.sh current
> 
> With different cgroups, the script starts one or multiple concurrent
> SGX
> selftests, each to run one unclobbered_vdso_oversubscribed test. 
> Each
> of such test tries to load an enclave of EPC size equal to the EPC
> capacity available on the platform. The script checks results against
> the expectation set for each cgroup and reports success or failure.
> 
> The script creates 3 different cgroups at the beginning with
> following
> expectations:
> 
> 1) SMALL - intentionally small enough to fail the test loading an
> enclave of size equal to the capacity.
> 2) LARGE - large enough to run up to 4 concurrent tests but fail some
> if
> more than 4 concurrent tests are run. The script starts 4 expecting
> at
> least one test to pass, and then starts 5 expecting at least one test
> to fail.
> 3) LARGER - limit is the same as the capacity, large enough to run
> lots of
> concurrent tests. The script starts 8 of them and expects all pass.
> Then it reruns the same test with one process randomly killed and
> usage checked to be zero after all process exit.
> 
> The script also includes a test with low mem_cg limit and LARGE
> sgx_epc
> limit to verify that the RAM used for per-cgroup reclamation is
> charged
> to a proper mem_cg.
> 
> [1] https://github.com/libcgroup/libcgroup/blob/main/README
> 
> Signed-off-by: Haitao Huang 
> ---
> V7:
> - Added memcontrol test.
> 
> V5:
> - Added script with automatic results checking, remove the
> interactive
> script.
> - The script can run independent from the series below.
> ---
>  .../selftests/sgx/run_epc_cg_selftests.sh | 246
> ++
>  .../selftests/sgx/watch_misc_for_tests.sh |  13 +
>  2 files changed, 259 insertions(+)
>  create mode 100755
> tools/testing/selftests/sgx/run_epc_cg_selftests.sh
>  create mode 100755
> tools/testing/selftests/sgx/watch_misc_for_tests.sh
> 
> diff --git a/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> new file mode 100755
> index ..e027bf39f005
> --- /dev/null
> +++ b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> @@ -0,0 +1,246 @@
> +#!/bin/bash

This is not portable and neither does hold in the wild.

It does not even often hold as it is not uncommon to place bash
to the path /usr/bin/bash. If I recall correctly, e.g. NixOS has
a path that is neither of those two.

Should be #!/usr/bin/env bash

That is POSIX compatible form.

Just got around trying to test this in NUC7 so looking into this in
more detail.

That said can you make the script work with just "#!/usr/bin/env sh"
and make sure that it is busybox ash compatible?

I don't see any necessity to make this bash only and it adds to the
compilation time of the image. Otherwise lot of this could be tested
just with qemu+bzImage+busybox(inside initramfs).

Now you are adding fully glibc shenanigans for the sake of syntax
sugar.

> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright(c) 2023 Intel Corporation.
> +
> +TEST_ROOT_CG=selftest
> +cgcreate -g misc:$TEST_ROOT_CG

How do you know that cgcreate exists? It is used a lot in the script
with no check for the existence. Please fix e.g. with "command -v
cgreate".

> +if [ $? -ne 0 ]; then
> +    echo "# Please make sure cgroup-tools is installed, and misc
> cgroup is mounted."
> +    exit 1
> +fi

And please do not do it this way. Also, please remove the advice for
"cgroups-tool". This is not meant to be debian only. Better would be
to e.g. point out the URL of the upstream project.

And yeah the whole message should be based on "command -v", not like
this.

> +TEST_CG_SUB1=$TEST_ROOT_CG/test1
> +TEST_CG_SUB2=$TEST_ROOT_CG/test2
> +# We will only set limit in test1 and run tests in test3
> +TEST_CG_SUB3=$TEST_ROOT_CG/test1/test3
> +TEST_CG_SUB4=$TEST_ROOT_CG/test4
> +
> +cgcreate -g misc:$TEST_CG_SUB1



> +cgcreate -g misc:$TEST_CG_SUB2
> +cgcreate -g misc:$TEST_CG_SUB3
> +cgcreate -g misc:$TEST_CG_SUB4
> +
> +# Default to V2
> +CG_MISC_ROOT=/sys/fs/cgroup
> +CG_MEM_ROOT=/sys/fs/cgroup
> +CG_V1=0
> +if [ ! -d "/sys/fs/cgroup/misc" ]; then
> +    echo "# cgroup V2 is in use."
> +else
> +    echo "# cgroup V1 is in use."

Is "#" prefix a standard for kselftest? I don't know this, thus asking.

> +    CG_MISC_ROOT=/sys/fs/cgroup/misc
> +    CG_MEM_ROOT=/sys/fs/cgroup/memory
> +    CG_V1=1

Have you checked what is the indentation policy for bash scripts inside
kernel tree. I don't know what it is. That's why I'm asking.

> +fi
> +
> +CAPACITY=$(grep "sgx_epc" "$CG_MISC_ROOT/misc.capacity" | awk
> '{print $2}')
> +# This is below number of VA pages needed for enclave of capacity
> size. So
> +# should

Re: [PATCH v2] arch/riscv: Enable kprobes when CONFIG_MODULES=n

2024-03-26 Thread Jarkko Sakkinen

On Tue Mar 26, 2024 at 6:38 PM EET, Mark Rutland wrote:
> On Wed, Mar 27, 2024 at 12:24:03AM +0900, Masami Hiramatsu wrote:
> > On Tue, 26 Mar 2024 14:46:10 +
> > Mark Rutland  wrote:
> > > 
> > > On Mon, Mar 25, 2024 at 11:56:32AM +0900, Masami Hiramatsu wrote:
> > > > I think, we'd better to introduce `alloc_execmem()`,
> > > > CONFIG_HAVE_ALLOC_EXECMEM and CONFIG_ALLOC_EXECMEM at first
> > > > 
> > > >   config HAVE_ALLOC_EXECMEM
> > > > bool
> > > > 
> > > >   config ALLOC_EXECMEM
> > > > bool "Executable trampline memory allocation"
> > > > depends on MODULES || HAVE_ALLOC_EXECMEM
> > > > 
> > > > And define fallback macro to module_alloc() like this.
> > > > 
> > > > #ifndef CONFIG_HAVE_ALLOC_EXECMEM
> > > > #define alloc_execmem(size, gfp)module_alloc(size)
> > > > #endif
> > > 
> > > Please can we *not* do this? I think this is abstracting at the wrong 
> > > level (as
> > > I mentioned on the prior execmem proposals).
> > > 
> > > Different exectuable allocations can have different requirements. For 
> > > example,
> > > on arm64 modules need to be within 2G of the kernel image, but the 
> > > kprobes XOL
> > > areas can be anywhere in the kernel VA space.
> > > 
> > > Forcing those behind the same interface makes things *harder* for 
> > > architectures
> > > and/or makes the common code more complicated (if that ends up having to 
> > > track
> > > all those different requirements). From my PoV it'd be much better to have
> > > separate kprobes_alloc_*() functions for kprobes which an architecture 
> > > can then
> > > choose to implement using a common library if it wants to.
> > > 
> > > I took a look at doing that using the core ifdeffery fixups from Jarkko's 
> > > v6,
> > > and it looks pretty clean to me (and works in testing on arm64):
> > > 
> > >   
> > > https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=kprobes/without-modules
> > > 
> > > Could we please start with that approach, with kprobe-specific alloc/free 
> > > code
> > > provided by the architecture?
> > 
> > OK, as far as I can read the code, this method also works and neat! 
> > (and minimum intrusion). I actually found that exposing CONFIG_ALLOC_EXECMEM
> > to user does not help, it should be an internal change. So hiding this 
> > change
> > from user is better choice. Then there is no reason to introduce the new
> > alloc_execmem, but just expand kprobe_alloc_insn_page() is reasonable.
> > 
> > Mark, can you send this series here, so that others can review/test it?
>
> I've written up a cover letter and sent that out:
>   
>   https://lore.kernel.org/lkml/20240326163624.3253157-1-mark.rutl...@arm.com/
>
> Mark.

Ya, saw it thanks!

BR, Jarkko

Re: [PATCH v2] arch/riscv: Enable kprobes when CONFIG_MODULES=n

2024-03-26 Thread Jarkko Sakkinen

On Tue Mar 26, 2024 at 6:15 PM EET, Calvin Owens wrote:
> On Wednesday 03/27 at 00:24 +0900, Masami Hiramatsu wrote:
> > On Tue, 26 Mar 2024 14:46:10 +
> > Mark Rutland  wrote:
> > 
> > > Hi Masami,
> > > 
> > > On Mon, Mar 25, 2024 at 11:56:32AM +0900, Masami Hiramatsu wrote:
> > > > Hi Jarkko,
> > > > 
> > > > On Sun, 24 Mar 2024 01:29:08 +0200
> > > > Jarkko Sakkinen  wrote:
> > > > 
> > > > > Tracing with kprobes while running a monolithic kernel is currently
> > > > > impossible due the kernel module allocator dependency.
> > > > > 
> > > > > Address the issue by allowing architectures to implement 
> > > > > module_alloc()
> > > > > and module_memfree() independent of the module subsystem. An arch tree
> > > > > can signal this by setting HAVE_KPROBES_ALLOC in its Kconfig file.
> > > > > 
> > > > > Realize the feature on RISC-V by separating allocator to 
> > > > > module_alloc.c
> > > > > and implementing module_memfree().
> > > > 
> > > > Even though, this involves changes in arch-independent part. So it 
> > > > should
> > > > be solved by generic way. Did you checked Calvin's thread?
> > > > 
> > > > https://lore.kernel.org/all/cover.1709676663.git.jcalvinow...@gmail.com/
> > > > 
> > > > I think, we'd better to introduce `alloc_execmem()`,
> > > > CONFIG_HAVE_ALLOC_EXECMEM and CONFIG_ALLOC_EXECMEM at first
> > > > 
> > > >   config HAVE_ALLOC_EXECMEM
> > > > bool
> > > > 
> > > >   config ALLOC_EXECMEM
> > > > bool "Executable trampline memory allocation"
> > > > depends on MODULES || HAVE_ALLOC_EXECMEM
> > > > 
> > > > And define fallback macro to module_alloc() like this.
> > > > 
> > > > #ifndef CONFIG_HAVE_ALLOC_EXECMEM
> > > > #define alloc_execmem(size, gfp)module_alloc(size)
> > > > #endif
> > > 
> > > Please can we *not* do this? I think this is abstracting at the wrong 
> > > level (as
> > > I mentioned on the prior execmem proposals).
> > > 
> > > Different exectuable allocations can have different requirements. For 
> > > example,
> > > on arm64 modules need to be within 2G of the kernel image, but the 
> > > kprobes XOL
> > > areas can be anywhere in the kernel VA space.
> > > 
> > > Forcing those behind the same interface makes things *harder* for 
> > > architectures
> > > and/or makes the common code more complicated (if that ends up having to 
> > > track
> > > all those different requirements). From my PoV it'd be much better to have
> > > separate kprobes_alloc_*() functions for kprobes which an architecture 
> > > can then
> > > choose to implement using a common library if it wants to.
> > > 
> > > I took a look at doing that using the core ifdeffery fixups from Jarkko's 
> > > v6,
> > > and it looks pretty clean to me (and works in testing on arm64):
> > > 
> > >   
> > > https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=kprobes/without-modules
> > > 
> > > Could we please start with that approach, with kprobe-specific alloc/free 
> > > code
> > > provided by the architecture?
>
> Heh, I also noticed that dead !RWX branch in arm64 patch_map(), I was
> about to send a patch to remove it.
>
> > OK, as far as I can read the code, this method also works and neat! 
> > (and minimum intrusion). I actually found that exposing CONFIG_ALLOC_EXECMEM
> > to user does not help, it should be an internal change. So hiding this 
> > change
> > from user is better choice. Then there is no reason to introduce the new
> > alloc_execmem, but just expand kprobe_alloc_insn_page() is reasonable.
>
> I'm happy with this, it solves the first half of my problem. But I want
> eBPF to work in the !MODULES case too.
>
> I think Mark's approach can work for bpf as well, without needing to
> touch module_alloc() at all? So I might be able to drop that first patch
> entirely.

Yeah, I think we're aligned. Later on, if/when you send the bpf series
please also cc me and I might possibly test those patches too.

BR, Jarkko

Re: [PATCH v2] arch/riscv: Enable kprobes when CONFIG_MODULES=n

2024-03-26 Thread Jarkko Sakkinen

On Tue Mar 26, 2024 at 5:24 PM EET, Masami Hiramatsu (Google) wrote:
> On Tue, 26 Mar 2024 14:46:10 +
> Mark Rutland  wrote:
>
> > Hi Masami,
> > 
> > On Mon, Mar 25, 2024 at 11:56:32AM +0900, Masami Hiramatsu wrote:
> > > Hi Jarkko,
> > > 
> > > On Sun, 24 Mar 2024 01:29:08 +0200
> > > Jarkko Sakkinen  wrote:
> > > 
> > > > Tracing with kprobes while running a monolithic kernel is currently
> > > > impossible due the kernel module allocator dependency.
> > > > 
> > > > Address the issue by allowing architectures to implement module_alloc()
> > > > and module_memfree() independent of the module subsystem. An arch tree
> > > > can signal this by setting HAVE_KPROBES_ALLOC in its Kconfig file.
> > > > 
> > > > Realize the feature on RISC-V by separating allocator to module_alloc.c
> > > > and implementing module_memfree().
> > > 
> > > Even though, this involves changes in arch-independent part. So it should
> > > be solved by generic way. Did you checked Calvin's thread?
> > > 
> > > https://lore.kernel.org/all/cover.1709676663.git.jcalvinow...@gmail.com/
> > > 
> > > I think, we'd better to introduce `alloc_execmem()`,
> > > CONFIG_HAVE_ALLOC_EXECMEM and CONFIG_ALLOC_EXECMEM at first
> > > 
> > >   config HAVE_ALLOC_EXECMEM
> > >   bool
> > > 
> > >   config ALLOC_EXECMEM
> > >   bool "Executable trampline memory allocation"
> > >   depends on MODULES || HAVE_ALLOC_EXECMEM
> > > 
> > > And define fallback macro to module_alloc() like this.
> > > 
> > > #ifndef CONFIG_HAVE_ALLOC_EXECMEM
> > > #define alloc_execmem(size, gfp)  module_alloc(size)
> > > #endif
> > 
> > Please can we *not* do this? I think this is abstracting at the wrong level 
> > (as
> > I mentioned on the prior execmem proposals).
> > 
> > Different exectuable allocations can have different requirements. For 
> > example,
> > on arm64 modules need to be within 2G of the kernel image, but the kprobes 
> > XOL
> > areas can be anywhere in the kernel VA space.
> > 
> > Forcing those behind the same interface makes things *harder* for 
> > architectures
> > and/or makes the common code more complicated (if that ends up having to 
> > track
> > all those different requirements). From my PoV it'd be much better to have
> > separate kprobes_alloc_*() functions for kprobes which an architecture can 
> > then
> > choose to implement using a common library if it wants to.
> > 
> > I took a look at doing that using the core ifdeffery fixups from Jarkko's 
> > v6,
> > and it looks pretty clean to me (and works in testing on arm64):
> > 
> >   
> > https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=kprobes/without-modules
> > 
> > Could we please start with that approach, with kprobe-specific alloc/free 
> > code
> > provided by the architecture?
>
> OK, as far as I can read the code, this method also works and neat! 
> (and minimum intrusion). I actually found that exposing CONFIG_ALLOC_EXECMEM
> to user does not help, it should be an internal change. So hiding this change
> from user is better choice. Then there is no reason to introduce the new
> alloc_execmem, but just expand kprobe_alloc_insn_page() is reasonable.
>
> Mark, can you send this series here, so that others can review/test it?

I'm totally fine with this but yeah best would be if it could carry
the riscv part. Mark, even if you have only possibility to compile
test that part I can check that it works.

BR, Jarkko

Re: [PATCH v5 1/2] kprobes: textmem API

2024-03-26 Thread Jarkko Sakkinen

On Tue Mar 26, 2024 at 5:05 PM EET, Masami Hiramatsu (Google) wrote:
> > According to kconfig-language.txt:
> > 
> > "select should be used with care. select will force a symbol to a value
> > without visiting the dependencies."
> > 
> > So the problem here lies in KPROBES config entry using select statement
> > to pick ALLOC_EXECMEM. It will not take the depends on statement into
> > account and thus will allow to select kprobes without any allocator in
> > place.
>
> OK, in that case "depend on" is good.

Yeah, did not remember this at all. Only recalled when I started to
get linking errors when compiling just the first patch... It's a bit
uninituitive twist in kconfig :-)

BR, Jarkko

Re: [PATCH v5 2/2] arch/riscv: Enable kprobes when CONFIG_MODULES=n

2024-03-26 Thread Jarkko Sakkinen

On Tue Mar 26, 2024 at 6:49 PM EET, Jarkko Sakkinen wrote:
> On Tue Mar 26, 2024 at 3:57 PM EET, Alexandre Ghiti wrote:
> > Hi Jarkko,
> >
> > On 25/03/2024 22:55, Jarkko Sakkinen wrote:
> > > Tacing with kprobes while running a monolithic kernel is currently
> > > impossible due the kernel module allocator dependency.
> > >
> > > Address the issue by implementing textmem API for RISC-V.
> > >
> > > Link: https://www.sochub.fi # for power on testing new SoC's with a 
> > > minimal stack
> > > Link: 
> > > https://lore.kernel.org/all/2022060814.3054333-1-jar...@profian.com/ 
> > > # continuation
> > > Signed-off-by: Jarkko Sakkinen 
> > > ---
> > > v5:
> > > - No changes, expect removing alloc_execmem() call which should have
> > >been part of the previous patch.
> > > v4:
> > > - Include linux/execmem.h.
> > > v3:
> > > - Architecture independent parts have been split to separate patches.
> > > - Do not change arch/riscv/kernel/module.c as it is out of scope for
> > >this patch set now.
> > > v2:
> > > - Better late than never right? :-)
> > > - Focus only to RISC-V for now to make the patch more digestable. This
> > >is the arch where I use the patch on a daily basis to help with QA.
> > > - Introduce HAVE_KPROBES_ALLOC flag to help with more gradual migration.
> > > ---
> > >   arch/riscv/Kconfig  |  1 +
> > >   arch/riscv/kernel/Makefile  |  3 +++
> > >   arch/riscv/kernel/execmem.c | 22 ++
> > >   3 files changed, 26 insertions(+)
> > >   create mode 100644 arch/riscv/kernel/execmem.c
> > >
> > > diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> > > index e3142ce531a0..499512fb17ff 100644
> > > --- a/arch/riscv/Kconfig
> > > +++ b/arch/riscv/Kconfig
> > > @@ -132,6 +132,7 @@ config RISCV
> > >   select HAVE_KPROBES if !XIP_KERNEL
> > >   select HAVE_KPROBES_ON_FTRACE if !XIP_KERNEL
> > >   select HAVE_KRETPROBES if !XIP_KERNEL
> > > + select HAVE_ALLOC_EXECMEM if !XIP_KERNEL
> > >   # https://github.com/ClangBuiltLinux/linux/issues/1881
> > >   select HAVE_LD_DEAD_CODE_DATA_ELIMINATION if !LD_IS_LLD
> > >   select HAVE_MOVE_PMD
> > > diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
> > > index 604d6bf7e476..337797f10d3e 100644
> > > --- a/arch/riscv/kernel/Makefile
> > > +++ b/arch/riscv/kernel/Makefile
> > > @@ -73,6 +73,9 @@ obj-$(CONFIG_SMP)   += cpu_ops.o
> > >   
> > >   obj-$(CONFIG_RISCV_BOOT_SPINWAIT) += cpu_ops_spinwait.o
> > >   obj-$(CONFIG_MODULES)   += module.o
> > > +ifeq ($(CONFIG_ALLOC_EXECMEM),y)
> > > +obj-y+= execmem.o
> > > +endif
> > >   obj-$(CONFIG_MODULE_SECTIONS)   += module-sections.o
> > >   
> > >   obj-$(CONFIG_CPU_PM)+= suspend_entry.o suspend.o
> > > diff --git a/arch/riscv/kernel/execmem.c b/arch/riscv/kernel/execmem.c
> > > new file mode 100644
> > > index ..3e52522ead32
> > > --- /dev/null
> > > +++ b/arch/riscv/kernel/execmem.c
> > > @@ -0,0 +1,22 @@
> > > +// SPDX-License-Identifier: GPL-2.0-or-later
> > > +
> > > +#include 
> > > +#include 
> > > +#include 
> > > +#include 
> > > +
> > > +void *alloc_execmem(unsigned long size, gfp_t /* gfp */)
>
> Need to have the parameter name here. I guess this could just as well
> pass through gfp to vmalloc from the caller as kprobes does call
> module_alloc() with GFP_KERNEL set in RISC-V.
>
> > > +{
> > > + return __vmalloc_node_range(size, 1, MODULES_VADDR,
> > > + MODULES_END, GFP_KERNEL,
> > > + PAGE_KERNEL, 0, NUMA_NO_NODE,
> > > + __builtin_return_address(0));
> > > +}
> >
> >
> > The __vmalloc_node_range() line ^^ must be from an old kernel since we 
> > added VM_FLUSH_RESET_PERMS in 6.8, see 749b94b08005 ("riscv: Fix 
> > module_alloc() that did not reset the linear mapping permissions").
> >
> > In addition, I guess module_alloc() should now use alloc_execmem() right?
>
> Ack for the first comment. For the 2nd it is up to arch/ to choose
> whether to have shared or separate allocators.
>
> So if you want I can change it that way but did not want

Re: [PATCH v2] arch/riscv: Enable kprobes when CONFIG_MODULES=n

2024-03-26 Thread Jarkko Sakkinen

On Tue Mar 26, 2024 at 4:46 PM EET, Mark Rutland wrote:
> Hi Masami,
>
> On Mon, Mar 25, 2024 at 11:56:32AM +0900, Masami Hiramatsu wrote:
> > Hi Jarkko,
> > 
> > On Sun, 24 Mar 2024 01:29:08 +0200
> > Jarkko Sakkinen  wrote:
> > 
> > > Tracing with kprobes while running a monolithic kernel is currently
> > > impossible due the kernel module allocator dependency.
> > > 
> > > Address the issue by allowing architectures to implement module_alloc()
> > > and module_memfree() independent of the module subsystem. An arch tree
> > > can signal this by setting HAVE_KPROBES_ALLOC in its Kconfig file.
> > > 
> > > Realize the feature on RISC-V by separating allocator to module_alloc.c
> > > and implementing module_memfree().
> > 
> > Even though, this involves changes in arch-independent part. So it should
> > be solved by generic way. Did you checked Calvin's thread?
> > 
> > https://lore.kernel.org/all/cover.1709676663.git.jcalvinow...@gmail.com/
> > 
> > I think, we'd better to introduce `alloc_execmem()`,
> > CONFIG_HAVE_ALLOC_EXECMEM and CONFIG_ALLOC_EXECMEM at first
> > 
> >   config HAVE_ALLOC_EXECMEM
> > bool
> > 
> >   config ALLOC_EXECMEM
> > bool "Executable trampline memory allocation"
> > depends on MODULES || HAVE_ALLOC_EXECMEM
> > 
> > And define fallback macro to module_alloc() like this.
> > 
> > #ifndef CONFIG_HAVE_ALLOC_EXECMEM
> > #define alloc_execmem(size, gfp)module_alloc(size)
> > #endif
>
> Please can we *not* do this? I think this is abstracting at the wrong level 
> (as
> I mentioned on the prior execmem proposals).
>
> Different exectuable allocations can have different requirements. For example,
> on arm64 modules need to be within 2G of the kernel image, but the kprobes XOL
> areas can be anywhere in the kernel VA space.
>
> Forcing those behind the same interface makes things *harder* for 
> architectures
> and/or makes the common code more complicated (if that ends up having to track
> all those different requirements). From my PoV it'd be much better to have
> separate kprobes_alloc_*() functions for kprobes which an architecture can 
> then
> choose to implement using a common library if it wants to.
>
> I took a look at doing that using the core ifdeffery fixups from Jarkko's v6,
> and it looks pretty clean to me (and works in testing on arm64):
>
>   
> https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=kprobes/without-modules
>
> Could we please start with that approach, with kprobe-specific alloc/free code
> provided by the architecture?

How should we move forward?

I'm fine with someone picking the pieces of my work as long as also the
riscv side is included. Can also continue rotating this, whatever works.

>
> Thanks,
> Mark.

BR, Jarkko

Re: [PATCH v5 2/2] arch/riscv: Enable kprobes when CONFIG_MODULES=n

2024-03-26 Thread Jarkko Sakkinen

On Tue Mar 26, 2024 at 3:57 PM EET, Alexandre Ghiti wrote:
> Hi Jarkko,
>
> On 25/03/2024 22:55, Jarkko Sakkinen wrote:
> > Tacing with kprobes while running a monolithic kernel is currently
> > impossible due the kernel module allocator dependency.
> >
> > Address the issue by implementing textmem API for RISC-V.
> >
> > Link: https://www.sochub.fi # for power on testing new SoC's with a minimal 
> > stack
> > Link: 
> > https://lore.kernel.org/all/2022060814.3054333-1-jar...@profian.com/ # 
> > continuation
> > Signed-off-by: Jarkko Sakkinen 
> > ---
> > v5:
> > - No changes, expect removing alloc_execmem() call which should have
> >been part of the previous patch.
> > v4:
> > - Include linux/execmem.h.
> > v3:
> > - Architecture independent parts have been split to separate patches.
> > - Do not change arch/riscv/kernel/module.c as it is out of scope for
> >this patch set now.
> > v2:
> > - Better late than never right? :-)
> > - Focus only to RISC-V for now to make the patch more digestable. This
> >is the arch where I use the patch on a daily basis to help with QA.
> > - Introduce HAVE_KPROBES_ALLOC flag to help with more gradual migration.
> > ---
> >   arch/riscv/Kconfig  |  1 +
> >   arch/riscv/kernel/Makefile  |  3 +++
> >   arch/riscv/kernel/execmem.c | 22 ++
> >   3 files changed, 26 insertions(+)
> >   create mode 100644 arch/riscv/kernel/execmem.c
> >
> > diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> > index e3142ce531a0..499512fb17ff 100644
> > --- a/arch/riscv/Kconfig
> > +++ b/arch/riscv/Kconfig
> > @@ -132,6 +132,7 @@ config RISCV
> > select HAVE_KPROBES if !XIP_KERNEL
> > select HAVE_KPROBES_ON_FTRACE if !XIP_KERNEL
> > select HAVE_KRETPROBES if !XIP_KERNEL
> > +   select HAVE_ALLOC_EXECMEM if !XIP_KERNEL
> > # https://github.com/ClangBuiltLinux/linux/issues/1881
> > select HAVE_LD_DEAD_CODE_DATA_ELIMINATION if !LD_IS_LLD
> > select HAVE_MOVE_PMD
> > diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
> > index 604d6bf7e476..337797f10d3e 100644
> > --- a/arch/riscv/kernel/Makefile
> > +++ b/arch/riscv/kernel/Makefile
> > @@ -73,6 +73,9 @@ obj-$(CONFIG_SMP) += cpu_ops.o
> >   
> >   obj-$(CONFIG_RISCV_BOOT_SPINWAIT) += cpu_ops_spinwait.o
> >   obj-$(CONFIG_MODULES) += module.o
> > +ifeq ($(CONFIG_ALLOC_EXECMEM),y)
> > +obj-y  += execmem.o
> > +endif
> >   obj-$(CONFIG_MODULE_SECTIONS) += module-sections.o
> >   
> >   obj-$(CONFIG_CPU_PM)  += suspend_entry.o suspend.o
> > diff --git a/arch/riscv/kernel/execmem.c b/arch/riscv/kernel/execmem.c
> > new file mode 100644
> > index ..3e52522ead32
> > --- /dev/null
> > +++ b/arch/riscv/kernel/execmem.c
> > @@ -0,0 +1,22 @@
> > +// SPDX-License-Identifier: GPL-2.0-or-later
> > +
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +
> > +void *alloc_execmem(unsigned long size, gfp_t /* gfp */)

Need to have the parameter name here. I guess this could just as well
pass through gfp to vmalloc from the caller as kprobes does call
module_alloc() with GFP_KERNEL set in RISC-V.

> > +{
> > +   return __vmalloc_node_range(size, 1, MODULES_VADDR,
> > +   MODULES_END, GFP_KERNEL,
> > +   PAGE_KERNEL, 0, NUMA_NO_NODE,
> > +   __builtin_return_address(0));
> > +}
>
>
> The __vmalloc_node_range() line ^^ must be from an old kernel since we 
> added VM_FLUSH_RESET_PERMS in 6.8, see 749b94b08005 ("riscv: Fix 
> module_alloc() that did not reset the linear mapping permissions").
>
> In addition, I guess module_alloc() should now use alloc_execmem() right?

Ack for the first comment. For the 2nd it is up to arch/ to choose
whether to have shared or separate allocators.

So if you want I can change it that way but did not want to make the
call myself.

>
>
> > +
> > +void free_execmem(void *region)
> > +{
> > +   if (in_interrupt())
> > +   pr_warn("In interrupt context: vmalloc may not work.\n");
> > +
> > +   vfree(region);
> > +}
>
>
> I remember Mike Rapoport sent a patchset to introduce an API for 
> executable memory allocation 
> (https://lore.kernel.org/linux-mm/20230918072955.2507221-1-r...@kernel.org/), 
> how does this intersect with your work? I don't know the status of his 
> patchset though

[PATCH v7 2/2] arch/riscv: Enable kprobes when CONFIG_MODULES=n

2024-03-26 Thread Jarkko Sakkinen

Tacing with kprobes while running a monolithic kernel is currently
impossible due the kernel module allocator dependency.

Address the issue by implementing textmem API for RISC-V.

Link: https://www.sochub.fi # for power on testing new SoC's with a minimal 
stack
Link: https://lore.kernel.org/all/2022060814.3054333-1-jar...@profian.com/ 
# continuation
Signed-off-by: Jarkko Sakkinen 
---
v5-v7:
- No changes.
v4:
- Include linux/execmem.h.
v3:
- Architecture independent parts have been split to separate patches.
- Do not change arch/riscv/kernel/module.c as it is out of scope for
  this patch set now.
v2:
- Better late than never right? :-)
- Focus only to RISC-V for now to make the patch more digestable. This
  is the arch where I use the patch on a daily basis to help with QA.
- Introduce HAVE_KPROBES_ALLOC flag to help with more gradual migration.
---
 arch/riscv/Kconfig  |  1 +
 arch/riscv/kernel/Makefile  |  3 +++
 arch/riscv/kernel/execmem.c | 22 ++
 3 files changed, 26 insertions(+)
 create mode 100644 arch/riscv/kernel/execmem.c

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index e3142ce531a0..499512fb17ff 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -132,6 +132,7 @@ config RISCV
select HAVE_KPROBES if !XIP_KERNEL
select HAVE_KPROBES_ON_FTRACE if !XIP_KERNEL
select HAVE_KRETPROBES if !XIP_KERNEL
+   select HAVE_ALLOC_EXECMEM if !XIP_KERNEL
# https://github.com/ClangBuiltLinux/linux/issues/1881
select HAVE_LD_DEAD_CODE_DATA_ELIMINATION if !LD_IS_LLD
select HAVE_MOVE_PMD
diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
index 604d6bf7e476..337797f10d3e 100644
--- a/arch/riscv/kernel/Makefile
+++ b/arch/riscv/kernel/Makefile
@@ -73,6 +73,9 @@ obj-$(CONFIG_SMP) += cpu_ops.o
 
 obj-$(CONFIG_RISCV_BOOT_SPINWAIT) += cpu_ops_spinwait.o
 obj-$(CONFIG_MODULES)  += module.o
+ifeq ($(CONFIG_ALLOC_EXECMEM),y)
+obj-y  += execmem.o
+endif
 obj-$(CONFIG_MODULE_SECTIONS)  += module-sections.o
 
 obj-$(CONFIG_CPU_PM)   += suspend_entry.o suspend.o
diff --git a/arch/riscv/kernel/execmem.c b/arch/riscv/kernel/execmem.c
new file mode 100644
index ..3e52522ead32
--- /dev/null
+++ b/arch/riscv/kernel/execmem.c
@@ -0,0 +1,22 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+#include 
+#include 
+#include 
+#include 
+
+void *alloc_execmem(unsigned long size, gfp_t /* gfp */)
+{
+   return __vmalloc_node_range(size, 1, MODULES_VADDR,
+   MODULES_END, GFP_KERNEL,
+   PAGE_KERNEL, 0, NUMA_NO_NODE,
+   __builtin_return_address(0));
+}
+
+void free_execmem(void *region)
+{
+   if (in_interrupt())
+   pr_warn("In interrupt context: vmalloc may not work.\n");
+
+   vfree(region);
+}
-- 
2.44.0

[PATCH v7 1/2] kprobes: Implement trampoline memory allocator for tracing

2024-03-26 Thread Jarkko Sakkinen

Tracing with kprobes while running a monolithic kernel is currently
impossible because CONFIG_KPROBES depends on CONFIG_MODULES.

Introduce alloc_execmem() and free_execmem() for allocating executable
memory. If an arch implements these functions, it can mark this up with
the HAVE_ALLOC_EXECMEM kconfig flag.

The second new kconfig flag is ALLOC_EXECMEM, which can be selected if
either MODULES is selected or HAVE_ALLOC_EXECMEM is support by the arch. If
HAVE_ALLOC_EXECMEM is not supported by an arch, module_alloc() and
module_memfree() are used as a fallback, thus retaining backwards
compatibility to earlier kernel versions.

This will allow architecture to enable kprobes traces without requiring
to enable module.

The support can be implemented with four easy steps:

1. Implement alloc_execmem().
2. Implement free_execmem().
3. Edit arch//Makefile.
4. Set HAVE_ALLOC_EXECMEM in arch//Kconfig.

Link: 
https://lore.kernel.org/all/20240325115632.04e37297491cadfbbf382...@kernel.org/
Suggested-by: Masami Hiramatsu 
Signed-off-by: Jarkko Sakkinen 
---
v7:
- Use "depends on" for ALLOC_EXECMEM instead of "select"
- Reduced and narrowed CONFIG_MODULES checks further in kprobes.c.
v6:
- Use null pointer for notifiers and register the module notifier only if
  IS_ENABLED(CONFIG_MODULES) is set.
- Fixed typo in the commit message and wrote more verbose description
  of the feature.
v5:
- alloc_execmem() was missing GFP_KERNEL parameter. The patch set did
  compile because 2/2 had the fixup (leaked there when rebasing the
  patch set).
v4:
- Squashed a couple of unrequired CONFIG_MODULES checks.
- See https://lore.kernel.org/all/d034m18d63ec.2y11d954ys...@kernel.org/
v3:
- A new patch added.
- For IS_DEFINED() I need advice as I could not really find that many
  locations where it would be applicable.
---
 arch/Kconfig| 17 +++-
 include/linux/execmem.h | 13 +
 kernel/kprobes.c| 53 ++---
 kernel/trace/trace_kprobe.c | 15 +--
 4 files changed, 73 insertions(+), 25 deletions(-)
 create mode 100644 include/linux/execmem.h

diff --git a/arch/Kconfig b/arch/Kconfig
index a5af0edd3eb8..5e9735f60f3c 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -52,8 +52,8 @@ config GENERIC_ENTRY
 
 config KPROBES
bool "Kprobes"
-   depends on MODULES
depends on HAVE_KPROBES
+   depends on ALLOC_EXECMEM
select KALLSYMS
select TASKS_RCU if PREEMPTION
help
@@ -215,6 +215,21 @@ config HAVE_OPTPROBES
 config HAVE_KPROBES_ON_FTRACE
bool
 
+config HAVE_ALLOC_EXECMEM
+   bool
+   help
+ Architectures that select this option are capable of allocating 
trampoline
+ executable memory for tracing subsystems, indepedently of the kernel 
module
+ subsystem.
+
+config ALLOC_EXECMEM
+   bool "Executable (trampoline) memory allocation"
+   default y
+   depends on MODULES || HAVE_ALLOC_EXECMEM
+   help
+ Select this for executable (trampoline) memory. Can be enabled when 
either
+ module allocator or arch-specific allocator is available.
+
 config ARCH_CORRECT_STACKTRACE_ON_KRETPROBE
bool
help
diff --git a/include/linux/execmem.h b/include/linux/execmem.h
new file mode 100644
index ..ae2ff151523a
--- /dev/null
+++ b/include/linux/execmem.h
@@ -0,0 +1,13 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_EXECMEM_H
+#define _LINUX_EXECMEM_H
+
+#ifdef CONFIG_HAVE_ALLOC_EXECMEM
+void *alloc_execmem(unsigned long size, gfp_t gfp);
+void free_execmem(void *region);
+#else
+#define alloc_execmem(size, gfp)   module_alloc(size)
+#define free_execmem(region)   module_memfree(region)
+#endif
+
+#endif /* _LINUX_EXECMEM_H */
diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index 9d9095e81792..13bef5de315c 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -44,6 +44,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define KPROBE_HASH_BITS 6
 #define KPROBE_TABLE_SIZE (1 << KPROBE_HASH_BITS)
@@ -113,17 +114,17 @@ enum kprobe_slot_state {
 void __weak *alloc_insn_page(void)
 {
/*
-* Use module_alloc() so this page is within +/- 2GB of where the
+* Use alloc_execmem() so this page is within +/- 2GB of where the
 * kernel image and loaded module images reside. This is required
 * for most of the architectures.
 * (e.g. x86-64 needs this to handle the %rip-relative fixups.)
 */
-   return module_alloc(PAGE_SIZE);
+   return alloc_execmem(PAGE_SIZE, GFP_KERNEL);
 }
 
 static void free_insn_page(void *page)
 {
-   module_memfree(page);
+   free_execmem(page);
 }
 
 struct kprobe_insn_cache kprobe_insn_slots = {
@@ -1592,6 +1593,7 @@ static int check_kprobe_address_safe(struct kprobe *p,
goto out;
}
 
+#ifdef CONFIG_MODULES
/*
 *

Re: [PATCH v5 1/2] kprobes: textmem API

2024-03-26 Thread Jarkko Sakkinen

On Tue Mar 26, 2024 at 4:01 AM EET, Jarkko Sakkinen wrote:
> On Tue Mar 26, 2024 at 3:31 AM EET, Jarkko Sakkinen wrote:
> > > > +#endif /* _LINUX_EXECMEM_H */
> > > > diff --git a/kernel/kprobes.c b/kernel/kprobes.c
> > > > index 9d9095e81792..87fd8c14a938 100644
> > > > --- a/kernel/kprobes.c
> > > > +++ b/kernel/kprobes.c
> > > > @@ -44,6 +44,7 @@
> > > >  #include 
> > > >  #include 
> > > >  #include 
> > > > +#include 
> > > >  
> > > >  #define KPROBE_HASH_BITS 6
> > > >  #define KPROBE_TABLE_SIZE (1 << KPROBE_HASH_BITS)
> > > > @@ -113,17 +114,17 @@ enum kprobe_slot_state {
> > > >  void __weak *alloc_insn_page(void)
> > > >  {
> > > > /*
> > > > -* Use module_alloc() so this page is within +/- 2GB of where 
> > > > the
> > > > +* Use alloc_execmem() so this page is within +/- 2GB of where 
> > > > the
> > > >  * kernel image and loaded module images reside. This is 
> > > > required
> > > >  * for most of the architectures.
> > > >  * (e.g. x86-64 needs this to handle the %rip-relative fixups.)
> > > >  */
> > > > -   return module_alloc(PAGE_SIZE);
> > > > +   return alloc_execmem(PAGE_SIZE, GFP_KERNEL);
> > > >  }
> > > >  
> > > >  static void free_insn_page(void *page)
> > > >  {
> > > > -   module_memfree(page);
> > > > +   free_execmem(page);
> > > >  }
> > > >  
> > > >  struct kprobe_insn_cache kprobe_insn_slots = {
> > > > @@ -1580,6 +1581,7 @@ static int check_kprobe_address_safe(struct 
> > > > kprobe *p,
> > > > goto out;
> > > > }
> > > >  
> > > > +#ifdef CONFIG_MODULES
> > >
> > > You don't need this block, because these APIs have dummy functions.
> >
> > Hmm... I'll verify this tomorrow.
>
> It depends on having struct module available given "(*probed_mod)->state".
>
> It is non-existent unless CONFIG_MODULES is set given how things are
> flagged in include/linux/module.h.

Hey, noticed kconfig issue.

According to kconfig-language.txt:

"select should be used with care. select will force a symbol to a value
without visiting the dependencies."

So the problem here lies in KPROBES config entry using select statement
to pick ALLOC_EXECMEM. It will not take the depends on statement into
account and thus will allow to select kprobes without any allocator in
place.

So to address this I'd suggest to use depends on statement also for
describing relation between KPROBES and ALLOC_EXECMEM. It does not make
life worse than before for anyone because even with the current kernel
you have to select MODULES before you can move forward with kprobes.

BR, Jarkko

Re: [PATCH v4 02/14] mm: Switch mm->get_unmapped_area() to a flag

2024-03-26 Thread Jarkko Sakkinen

On Tue Mar 26, 2024 at 4:16 AM EET, Rick Edgecombe wrote:
> The mm_struct contains a function pointer *get_unmapped_area(), which
> is set to either arch_get_unmapped_area() or
> arch_get_unmapped_area_topdown() during the initialization of the mm.

In which conditions which path is used during the initialization of mm
and why is this the case? It is an open claim in the current form.

That would be nice to have documented for the sake of being complete
description. I have zero doubts of the claim being untrue.

BR, Jarkko

Re: [PATCH v5 1/2] kprobes: textmem API

2024-03-25 Thread Jarkko Sakkinen

On Tue Mar 26, 2024 at 3:31 AM EET, Jarkko Sakkinen wrote:
> > > +#endif /* _LINUX_EXECMEM_H */
> > > diff --git a/kernel/kprobes.c b/kernel/kprobes.c
> > > index 9d9095e81792..87fd8c14a938 100644
> > > --- a/kernel/kprobes.c
> > > +++ b/kernel/kprobes.c
> > > @@ -44,6 +44,7 @@
> > >  #include 
> > >  #include 
> > >  #include 
> > > +#include 
> > >  
> > >  #define KPROBE_HASH_BITS 6
> > >  #define KPROBE_TABLE_SIZE (1 << KPROBE_HASH_BITS)
> > > @@ -113,17 +114,17 @@ enum kprobe_slot_state {
> > >  void __weak *alloc_insn_page(void)
> > >  {
> > >   /*
> > > -  * Use module_alloc() so this page is within +/- 2GB of where the
> > > +  * Use alloc_execmem() so this page is within +/- 2GB of where the
> > >* kernel image and loaded module images reside. This is required
> > >* for most of the architectures.
> > >* (e.g. x86-64 needs this to handle the %rip-relative fixups.)
> > >*/
> > > - return module_alloc(PAGE_SIZE);
> > > + return alloc_execmem(PAGE_SIZE, GFP_KERNEL);
> > >  }
> > >  
> > >  static void free_insn_page(void *page)
> > >  {
> > > - module_memfree(page);
> > > + free_execmem(page);
> > >  }
> > >  
> > >  struct kprobe_insn_cache kprobe_insn_slots = {
> > > @@ -1580,6 +1581,7 @@ static int check_kprobe_address_safe(struct kprobe 
> > > *p,
> > >   goto out;
> > >   }
> > >  
> > > +#ifdef CONFIG_MODULES
> >
> > You don't need this block, because these APIs have dummy functions.
>
> Hmm... I'll verify this tomorrow.

It depends on having struct module available given "(*probed_mod)->state".

It is non-existent unless CONFIG_MODULES is set given how things are
flagged in include/linux/module.h.

BR, Jarkko

Re: [PATCH v5 1/2] kprobes: textmem API

2024-03-25 Thread Jarkko Sakkinen

On Tue Mar 26, 2024 at 2:58 AM EET, Masami Hiramatsu (Google) wrote:
> On Mon, 25 Mar 2024 23:55:01 +0200
> Jarkko Sakkinen  wrote:
>
> > Tracing with kprobes while running a monolithic kernel is currently
> > impossible because CONFIG_KPROBES depends on CONFIG_MODULES because it uses
> > the kernel module allocator.
> > 
> > Introduce alloc_textmem() and free_textmem() for allocating executable
> > memory. If an arch implements these functions, it can mark this up with
> > the HAVE_ALLOC_EXECMEM kconfig flag.
> > 
> > At first this feature will be used for enabling kprobes without
> > modules support for arch/riscv.
> > 
> > Link: 
> > https://lore.kernel.org/all/20240325115632.04e37297491cadfbbf382...@kernel.org/
> > Suggested-by: Masami Hiramatsu 
> > Signed-off-by: Jarkko Sakkinen 
> > ---
> > v5:
> > - alloc_execmem() was missing GFP_KERNEL parameter. The patch set did
> >   compile because 2/2 had the fixup (leaked there when rebasing the
> >   patch set).
> > v4:
> > - Squashed a couple of unrequired CONFIG_MODULES checks.
> > - See https://lore.kernel.org/all/d034m18d63ec.2y11d954ys...@kernel.org/
> > v3:
> > - A new patch added.
> > - For IS_DEFINED() I need advice as I could not really find that many
> >   locations where it would be applicable.
> > ---
> >  arch/Kconfig| 16 +++-
> >  include/linux/execmem.h | 13 +
> >  kernel/kprobes.c| 17 ++---
> >  kernel/trace/trace_kprobe.c |  8 
> >  4 files changed, 50 insertions(+), 4 deletions(-)
> >  create mode 100644 include/linux/execmem.h
> > 
> > diff --git a/arch/Kconfig b/arch/Kconfig
> > index a5af0edd3eb8..33ba68b7168f 100644
> > --- a/arch/Kconfig
> > +++ b/arch/Kconfig
> > @@ -52,8 +52,8 @@ config GENERIC_ENTRY
> >  
> >  config KPROBES
> > bool "Kprobes"
> > -   depends on MODULES
> > depends on HAVE_KPROBES
> > +   select ALLOC_EXECMEM
> > select KALLSYMS
> > select TASKS_RCU if PREEMPTION
> > help
> > @@ -215,6 +215,20 @@ config HAVE_OPTPROBES
> >  config HAVE_KPROBES_ON_FTRACE
> > bool
> >  
> > +config HAVE_ALLOC_EXECMEM
> > +   bool
> > +   help
> > + Architectures that select this option are capable of allocating 
> > executable
> > + memory, which can be used by subsystems but is not dependent of any 
> > of its
> > + clients.
> > +
> > +config ALLOC_EXECMEM
> > +   bool "Executable (trampoline) memory allocation"
> > +   depends on MODULES || HAVE_ALLOC_EXECMEM
> > +   help
> > + Select this for executable (trampoline) memory. Can be enabled when 
> > either
> > + module allocator or arch-specific allocator is available.
> > +
> >  config ARCH_CORRECT_STACKTRACE_ON_KRETPROBE
> > bool
> > help
> > diff --git a/include/linux/execmem.h b/include/linux/execmem.h
> > new file mode 100644
> > index ..ae2ff151523a
> > --- /dev/null
> > +++ b/include/linux/execmem.h
> > @@ -0,0 +1,13 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +#ifndef _LINUX_EXECMEM_H
> > +#define _LINUX_EXECMEM_H
> > +
> > +#ifdef CONFIG_HAVE_ALLOC_EXECMEM
> > +void *alloc_execmem(unsigned long size, gfp_t gfp);
> > +void free_execmem(void *region);
> > +#else
> > +#define alloc_execmem(size, gfp)   module_alloc(size)
> > +#define free_execmem(region)   module_memfree(region)
> > +#endif
> > +
> > +#endif /* _LINUX_EXECMEM_H */
> > diff --git a/kernel/kprobes.c b/kernel/kprobes.c
> > index 9d9095e81792..87fd8c14a938 100644
> > --- a/kernel/kprobes.c
> > +++ b/kernel/kprobes.c
> > @@ -44,6 +44,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  
> >  #define KPROBE_HASH_BITS 6
> >  #define KPROBE_TABLE_SIZE (1 << KPROBE_HASH_BITS)
> > @@ -113,17 +114,17 @@ enum kprobe_slot_state {
> >  void __weak *alloc_insn_page(void)
> >  {
> > /*
> > -* Use module_alloc() so this page is within +/- 2GB of where the
> > +* Use alloc_execmem() so this page is within +/- 2GB of where the
> >  * kernel image and loaded module images reside. This is required
> >  * for most of the architectures.
> >  * (e.g. x86-64 needs this to handle the %rip-relative fixups.)
> >  */
> > -   return module_alloc(PAGE_SIZE);
> > +   return alloc_execmem(PAGE_SIZE, GFP_KERNEL);
> >

[PATCH v6 2/2] arch/riscv: Enable kprobes when CONFIG_MODULES=n

2024-03-25 Thread Jarkko Sakkinen

Tacing with kprobes while running a monolithic kernel is currently
impossible due the kernel module allocator dependency.

Address the issue by implementing textmem API for RISC-V.

Link: https://www.sochub.fi # for power on testing new SoC's with a minimal 
stack
Link: https://lore.kernel.org/all/2022060814.3054333-1-jar...@profian.com/ 
# continuation
Signed-off-by: Jarkko Sakkinen 
---
v4:
- Include linux/execmem.h.
v3:
- Architecture independent parts have been split to separate patches.
- Do not change arch/riscv/kernel/module.c as it is out of scope for
  this patch set now.
v2:
- Better late than never right? :-)
- Focus only to RISC-V for now to make the patch more digestable. This
  is the arch where I use the patch on a daily basis to help with QA.
- Introduce HAVE_KPROBES_ALLOC flag to help with more gradual migration.
---
 arch/riscv/Kconfig  |  1 +
 arch/riscv/kernel/Makefile  |  3 +++
 arch/riscv/kernel/execmem.c | 22 ++
 3 files changed, 26 insertions(+)
 create mode 100644 arch/riscv/kernel/execmem.c

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index e3142ce531a0..499512fb17ff 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -132,6 +132,7 @@ config RISCV
select HAVE_KPROBES if !XIP_KERNEL
select HAVE_KPROBES_ON_FTRACE if !XIP_KERNEL
select HAVE_KRETPROBES if !XIP_KERNEL
+   select HAVE_ALLOC_EXECMEM if !XIP_KERNEL
# https://github.com/ClangBuiltLinux/linux/issues/1881
select HAVE_LD_DEAD_CODE_DATA_ELIMINATION if !LD_IS_LLD
select HAVE_MOVE_PMD
diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
index 604d6bf7e476..337797f10d3e 100644
--- a/arch/riscv/kernel/Makefile
+++ b/arch/riscv/kernel/Makefile
@@ -73,6 +73,9 @@ obj-$(CONFIG_SMP) += cpu_ops.o
 
 obj-$(CONFIG_RISCV_BOOT_SPINWAIT) += cpu_ops_spinwait.o
 obj-$(CONFIG_MODULES)  += module.o
+ifeq ($(CONFIG_ALLOC_EXECMEM),y)
+obj-y  += execmem.o
+endif
 obj-$(CONFIG_MODULE_SECTIONS)  += module-sections.o
 
 obj-$(CONFIG_CPU_PM)   += suspend_entry.o suspend.o
diff --git a/arch/riscv/kernel/execmem.c b/arch/riscv/kernel/execmem.c
new file mode 100644
index ..3e52522ead32
--- /dev/null
+++ b/arch/riscv/kernel/execmem.c
@@ -0,0 +1,22 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+#include 
+#include 
+#include 
+#include 
+
+void *alloc_execmem(unsigned long size, gfp_t /* gfp */)
+{
+   return __vmalloc_node_range(size, 1, MODULES_VADDR,
+   MODULES_END, GFP_KERNEL,
+   PAGE_KERNEL, 0, NUMA_NO_NODE,
+   __builtin_return_address(0));
+}
+
+void free_execmem(void *region)
+{
+   if (in_interrupt())
+   pr_warn("In interrupt context: vmalloc may not work.\n");
+
+   vfree(region);
+}
-- 
2.44.0

[PATCH v6 1/2] kprobes: implemente trampoline memory allocator

2024-03-25 Thread Jarkko Sakkinen

Tracing with kprobes while running a monolithic kernel is currently
impossible because CONFIG_KPROBES depends on CONFIG_MODULES.

Introduce alloc_execmem() and free_execmem() for allocating executable
memory. If an arch implements these functions, it can mark this up with
the HAVE_ALLOC_EXECMEM kconfig flag.

The second new kconfig flag is ALLOC_EXECMEM, which can be selected if
either MODULES is selected or HAVE_ALLOC_EXECMEM is support by the arch. If
HAVE_ALLOC_EXECMEM is not supported by an arch, module_alloc() and
module_memfree() are used as a fallback, thus retaining backwards
compatibility to earlier kernel versions.

This will allow architecture to enable kprobes traces without requiring
to enable module.

The support can be implemented with four easy steps:

1. Implement alloc_execmem().
2. Implement free_execmem().
3. Edit arch//Makefile.
4. Set HAVE_ALLOC_EXECMEM in arch//Kconfig.

Link: 
https://lore.kernel.org/all/20240325115632.04e37297491cadfbbf382...@kernel.org/
Suggested-by: Masami Hiramatsu 
Signed-off-by: Jarkko Sakkinen 
---
v6:
- Use null pointer for notifiers and register the module notifier only if
  IS_ENABLED(CONFIG_MODULES) is set.
- Fixed typo in the commit message and wrote more verbose description
  of the feature.
v5:
- alloc_execmem() was missing GFP_KERNEL parameter. The patch set did
  compile because 2/2 had the fixup (leaked there when rebasing the
  patch set).
v4:
- Squashed a couple of unrequired CONFIG_MODULES checks.
- See https://lore.kernel.org/all/d034m18d63ec.2y11d954ys...@kernel.org/
v3:
- A new patch added.
- For IS_DEFINED() I need advice as I could not really find that many
  locations where it would be applicable.
---
 arch/Kconfig| 16 +++-
 include/linux/execmem.h | 13 +
 kernel/kprobes.c| 19 +++
 kernel/trace/trace_kprobe.c | 15 +--
 4 files changed, 56 insertions(+), 7 deletions(-)
 create mode 100644 include/linux/execmem.h

diff --git a/arch/Kconfig b/arch/Kconfig
index a5af0edd3eb8..33ba68b7168f 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -52,8 +52,8 @@ config GENERIC_ENTRY
 
 config KPROBES
bool "Kprobes"
-   depends on MODULES
depends on HAVE_KPROBES
+   select ALLOC_EXECMEM
select KALLSYMS
select TASKS_RCU if PREEMPTION
help
@@ -215,6 +215,20 @@ config HAVE_OPTPROBES
 config HAVE_KPROBES_ON_FTRACE
bool
 
+config HAVE_ALLOC_EXECMEM
+   bool
+   help
+ Architectures that select this option are capable of allocating 
executable
+ memory, which can be used by subsystems but is not dependent of any 
of its
+ clients.
+
+config ALLOC_EXECMEM
+   bool "Executable (trampoline) memory allocation"
+   depends on MODULES || HAVE_ALLOC_EXECMEM
+   help
+ Select this for executable (trampoline) memory. Can be enabled when 
either
+ module allocator or arch-specific allocator is available.
+
 config ARCH_CORRECT_STACKTRACE_ON_KRETPROBE
bool
help
diff --git a/include/linux/execmem.h b/include/linux/execmem.h
new file mode 100644
index ..ae2ff151523a
--- /dev/null
+++ b/include/linux/execmem.h
@@ -0,0 +1,13 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_EXECMEM_H
+#define _LINUX_EXECMEM_H
+
+#ifdef CONFIG_HAVE_ALLOC_EXECMEM
+void *alloc_execmem(unsigned long size, gfp_t gfp);
+void free_execmem(void *region);
+#else
+#define alloc_execmem(size, gfp)   module_alloc(size)
+#define free_execmem(region)   module_memfree(region)
+#endif
+
+#endif /* _LINUX_EXECMEM_H */
diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index 9d9095e81792..9ed07a4bf9e3 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -44,6 +44,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define KPROBE_HASH_BITS 6
 #define KPROBE_TABLE_SIZE (1 << KPROBE_HASH_BITS)
@@ -113,17 +114,17 @@ enum kprobe_slot_state {
 void __weak *alloc_insn_page(void)
 {
/*
-* Use module_alloc() so this page is within +/- 2GB of where the
+* Use alloc_execmem() so this page is within +/- 2GB of where the
 * kernel image and loaded module images reside. This is required
 * for most of the architectures.
 * (e.g. x86-64 needs this to handle the %rip-relative fixups.)
 */
-   return module_alloc(PAGE_SIZE);
+   return alloc_execmem(PAGE_SIZE, GFP_KERNEL);
 }
 
 static void free_insn_page(void *page)
 {
-   module_memfree(page);
+   free_execmem(page);
 }
 
 struct kprobe_insn_cache kprobe_insn_slots = {
@@ -1580,6 +1581,7 @@ static int check_kprobe_address_safe(struct kprobe *p,
goto out;
}
 
+#ifdef CONFIG_MODULES
/* Check if 'p' is probing a module. */
*probed_mod = __module_text_address((unsigned long) p->addr);
if (*probed_mod) {
@@ -1603,6 +1605,8 @@ static int check_kprobe_address_safe(struct kprobe *p,

Re: [PATCH v3 1/2] kprobes: textmem API

2024-03-25 Thread Jarkko Sakkinen

On Mon Mar 25, 2024 at 10:37 PM EET, Jarkko Sakkinen wrote:
> - if (ret == -ENOENT && !trace_kprobe_module_exist(tk)) {
> +#ifdef CONFIG_MODULES
> + if (ret == -ENOENT && trace_kprobe_module_exist(tk))
> + ret = 0;
> +#endif /* CONFIG_MODULES */

For this we could have

#ifndef CONFIG_MODULES
#define trace_kprobe_module_exist(tk) false
#endif

That would clean up at least two locations requiring no changes. Should
I go forward this or not?

BR, Jarkko

Re: [PATCH v5 1/2] kprobes: textmem API

2024-03-25 Thread Jarkko Sakkinen

On Tue Mar 26, 2024 at 1:50 AM EET, Masami Hiramatsu (Google) wrote:
> On Tue, 26 Mar 2024 00:09:42 +0200
> "Jarkko Sakkinen"  wrote:
>
> > On Mon Mar 25, 2024 at 11:55 PM EET, Jarkko Sakkinen wrote:
> > > +#ifdef CONFIG_MODULES
> > >   if (register_module_notifier(_kprobe_module_nb))
> > >   return -EINVAL;
> > > +#endif /* CONFIG_MODULES */
> > 
> > register_module_notifier() does have "dummy" version but what
> > would I pass to it. It makes more mess than it cleans to declare
> > also a "dummy" version of trace_kprobe_module_nb.
>
> That is better than having #ifdef in the function.
>
> > 
> > The callback itself has too tight module subsystem bindings so
> > that they could be simply flagged with IS_DEFINED() (or correct
> > if I'm mistaken, this the conclusion I've ended up with).
>
> Please try this.
>
> -
> diff --git a/kernel/kprobes.c b/kernel/kprobes.c
> index 70dc6179086e..bc98db14927f 100644
> --- a/kernel/kprobes.c
> +++ b/kernel/kprobes.c
> @@ -2625,6 +2625,7 @@ static void remove_module_kprobe_blacklist(struct 
> module *mod)
>   }
>  }
>  
> +#ifdef CONFIG_MODULES
>  /* Module notifier call back, checking kprobes on the module */
>  static int kprobes_module_callback(struct notifier_block *nb,
>  unsigned long val, void *data)
> @@ -2675,6 +2676,9 @@ static int kprobes_module_callback(struct 
> notifier_block *nb,
>   mutex_unlock(_mutex);
>   return NOTIFY_DONE;
>  }
> +#else
> +#define kprobes_module_callback  (NULL)
> +#endif
>  
>  static struct notifier_block kprobe_module_nb = {
>   .notifier_call = kprobes_module_callback,
> @@ -2739,7 +2743,7 @@ static int __init init_kprobes(void)
>   err = arch_init_kprobes();
>   if (!err)
>   err = register_die_notifier(_exceptions_nb);
> - if (!err)
> + if (!err && IS_ENABLED(CONFIG_MODULES))
>   err = register_module_notifier(_module_nb);
>  
>   kprobes_initialized = (err == 0);

OK, thanks for the suggestion WFM.

I'll give this also a spin with VisionFive2 RISC-V SBC before sending
v6.

BR, Jarkko

Re: [PATCH v2] arch/riscv: Enable kprobes when CONFIG_MODULES=n

2024-03-25 Thread Jarkko Sakkinen

On Mon Mar 25, 2024 at 4:56 AM EET, Masami Hiramatsu (Google) wrote:
> Hi Jarkko,
>
> On Sun, 24 Mar 2024 01:29:08 +0200
> Jarkko Sakkinen  wrote:
>
> > Tracing with kprobes while running a monolithic kernel is currently
> > impossible due the kernel module allocator dependency.
> > 
> > Address the issue by allowing architectures to implement module_alloc()
> > and module_memfree() independent of the module subsystem. An arch tree
> > can signal this by setting HAVE_KPROBES_ALLOC in its Kconfig file.
> > 
> > Realize the feature on RISC-V by separating allocator to module_alloc.c
> > and implementing module_memfree().
>
> Even though, this involves changes in arch-independent part. So it should
> be solved by generic way. Did you checked Calvin's thread?
>
> https://lore.kernel.org/all/cover.1709676663.git.jcalvinow...@gmail.com/

Nope, has not been in my radar but for sure will check it.

I don't mind making this more generic. The  point of this version was to
put focus on single architecture and do as little as possible how the
code works right now so that it is easier to give feedback on direction.

> I think, we'd better to introduce `alloc_execmem()`,
> CONFIG_HAVE_ALLOC_EXECMEM and CONFIG_ALLOC_EXECMEM at first
>
>   config HAVE_ALLOC_EXECMEM
>   bool
>
>   config ALLOC_EXECMEM
>   bool "Executable trampline memory allocation"
>   depends on MODULES || HAVE_ALLOC_EXECMEM

Right so this is logically the same as I have just with ALLOC_EXECMEM
added to cover both MODULES and HAVE_ALLOC_EXECMEM (which is essentially
the same as HAVE_ALLOC_KPROBES just with a different name).

Not at all against this. I think this factor more understandable
structuring, just "peer checking" that I understand what I'm reading :-)

> And define fallback macro to module_alloc() like this.
>
> #ifndef CONFIG_HAVE_ALLOC_EXECMEM
> #define alloc_execmem(size, gfp)  module_alloc(size)
> #endif
>
> Then, introduce a new dependency to kprobes
>
>   config KPROBES
>   bool "Kprobes"
>   select ALLOC_EXECMEM

OK I think this is good but now I see actually two logical chunks of
work becuse this select changes how KPROBES kconfig option works. It
has previously required to manually select MODULES first.

So I'll "select MODULES" as separate patch to keep all logical changes
transparent...

>
> and update kprobes to use alloc_execmem and remove module related
> code from it.
>
> You also should consider using IS_ENABLED(CONFIG_MODULE) in the code to
> avoid using #ifdefs.
>
> Finally, you can add RISCV implementation patch of HAVE_ALLOC_EXECMEM in the
> next patch.

OK, I think the suggestions are sane and not that much drift what I have
now so works for me.

> Thank you,

BR, Jarkko

Re: [RFC][PATCH 0/4] Make bpf_jit and kprobes work with CONFIG_MODULES=n

2024-03-25 Thread Jarkko Sakkinen

On Wed Mar 6, 2024 at 10:05 PM EET, Calvin Owens wrote:
> Hello all,
>
> This patchset makes it possible to use bpftrace with kprobes on kernels
> built without loadable module support.
>
> On a Raspberry Pi 4b, this saves about 700KB of memory where BPF is
> needed but loadable module support is not. These two kernels had
> identical configurations, except CONFIG_MODULE was off in the second:
>
>- Linux version 6.8.0-rc7
>- Memory: 3330672K/4050944K available (16576K kernel code, 2390K rwdata,
>- 12364K rodata, 5632K init, 675K bss, 195984K reserved, 524288K 
> cma-reserved)
>+ Linux version 6.8.0-rc7-3-g2af01251ca21
>+ Memory: 3331400K/4050944K available (16512K kernel code, 2384K rwdata,
>+ 11728K rodata, 5632K init, 673K bss, 195256K reserved, 524288K 
> cma-reserved)
>
> I don't intend to present an exhaustive list of !MODULES usecases, since
> I'm sure there are many I'm not aware of. Performance is a common one,
> the primary justification being that static text is mapped on hugepages
> and module text is not. Security is another, since rootkits are much
> harder to implement without modules.
>
> The first patch is the interesting one: it moves module_alloc() into its
> own file with its own Kconfig option, so it can be utilized even when
> loadable module support is disabled. I got the idea from an unmerged
> patch from a few years ago I found on lkml (see [1/4] for details). I
> think this also has value in its own right, since I suspect there are
> potential users beyond bpf, hopefully we will hear from some.
>
> Patches 2-3 are proofs of concept to demonstrate the first patch is
> sufficient to achieve my goal (full ebpf functionality without modules).
>
> Patch 4 adds a new "-n" argument to vmtest.sh to run the BPF selftests
> without modules, so the prior three patches can be rigorously tested.
>
> If something like the first patch were to eventually be merged, the rest
> could go through the normal bpf-next process as I clean them up: I've
> only based them on Linus' tree and combined them into a series here to
> introduce the idea.
>
> If you prefer to fetch the patches via git:
>
>   [1/4] https://github.com/jcalvinowens/linux.git work/module-alloc
>  +[2/4]+[3/4] https://github.com/jcalvinowens/linux.git work/nomodule-bpf
>  +[4/4] https://github.com/jcalvinowens/linux.git testing/nomodule-bpf-ci
>
> In addition to the automated BPF selftests, I've lightly tested this on
> my laptop (x86_64), a Raspberry Pi 4b (arm64), and a Raspberry Pi Zero W
> (arm). The other architectures have only been compile tested.
>
> I didn't want to spam all the arch maintainers with what I expect will
> be a discussion mostly about modules and bpf, so I've left them off this
> first submission. I will be sure to add them on future submissions of
> the first patch. Of course, feedback on the arch bits is welcome here.
>
> In addition to feedback on the patches themselves, I'm interested in
> hearing from anybody else who might find this functionality useful.
>
> Thanks,
> Calvin
>
>
> Calvin Owens (4):
>   module: mm: Make module_alloc() generally available
>   bpf: Allow BPF_JIT with CONFIG_MODULES=n
>   kprobes: Allow kprobes with CONFIG_MODULES=n
>   selftests/bpf: Support testing the !MODULES case
>
>  arch/Kconfig  |   4 +-
>  arch/arm/kernel/module.c  |  35 -
>  arch/arm/mm/Makefile  |   2 +
>  arch/arm/mm/module_alloc.c|  40 ++
>  arch/arm64/kernel/module.c| 127 -
>  arch/arm64/mm/Makefile|   1 +
>  arch/arm64/mm/module_alloc.c  | 130 ++
>  arch/loongarch/kernel/module.c|   6 -
>  arch/loongarch/mm/Makefile|   2 +
>  arch/loongarch/mm/module_alloc.c  |  10 ++
>  arch/mips/kernel/module.c |  10 --
>  arch/mips/mm/Makefile |   2 +
>  arch/mips/mm/module_alloc.c   |  13 ++
>  arch/nios2/kernel/module.c|  20 ---
>  arch/nios2/mm/Makefile|   2 +
>  arch/nios2/mm/module_alloc.c  |  22 +++
>  arch/parisc/kernel/module.c   |  12 --
>  arch/parisc/mm/Makefile   |   1 +
>  arch/parisc/mm/module_alloc.c |  15 ++
>  arch/powerpc/kernel/module.c  |  36 -
>  arch/powerpc/mm/Makefile  |   1 +
>  arch/powerpc/mm/module_alloc.c|  41 ++
>  arch/riscv/kernel/module.c|  11 --
>  arch/riscv/mm/Makefile|   1 +
>  arch/riscv/mm/module_alloc.c  |  17 +++
>  arch/s390/kernel/module.c |  37 -
>  arch/s390/mm/Makefile |   1 +
>  arch/s390/mm/module_alloc.c   |  42 ++
>  arch/sparc/kernel/module.c

Re: [PATCH v5 1/2] kprobes: textmem API

2024-03-25 Thread Jarkko Sakkinen

On Tue Mar 26, 2024 at 12:09 AM EET, Jarkko Sakkinen wrote:
> On Mon Mar 25, 2024 at 11:55 PM EET, Jarkko Sakkinen wrote:
> > +#ifdef CONFIG_MODULES
> > if (register_module_notifier(_kprobe_module_nb))
> > return -EINVAL;
> > +#endif /* CONFIG_MODULES */
>
> register_module_notifier() does have "dummy" version but what
> would I pass to it. It makes more mess than it cleans to declare
> also a "dummy" version of trace_kprobe_module_nb.
>
> The callback itself has too tight module subsystem bindings so
> that they could be simply flagged with IS_DEFINED() (or correct
> if I'm mistaken, this the conclusion I've ended up with).

One way to clean that up would be to create trace_kprobe_module.c and
move kernel module specific code over there and then change
kernel/trace/Makefile as follows:

ifeq ($(CONFIG_PERF_EVENTS),y)
obj-y += trace_kprobe.o
obj-$(CONFIG_MODULES) += trace_kprobe_module.o
endif

and define trace_kprobe_module_init() or similar to do all the dance
with notifiers etc.

This crossed my mind but did not want to do it without feedback.

BR, Jarkko

Re: [PATCH v5 1/2] kprobes: textmem API

2024-03-25 Thread Jarkko Sakkinen

On Mon Mar 25, 2024 at 11:55 PM EET, Jarkko Sakkinen wrote:
> +#ifdef CONFIG_MODULES
>   if (register_module_notifier(_kprobe_module_nb))
>   return -EINVAL;
> +#endif /* CONFIG_MODULES */

register_module_notifier() does have "dummy" version but what
would I pass to it. It makes more mess than it cleans to declare
also a "dummy" version of trace_kprobe_module_nb.

The callback itself has too tight module subsystem bindings so
that they could be simply flagged with IS_DEFINED() (or correct
if I'm mistaken, this the conclusion I've ended up with).

BR, Jarkko

Re: [PATCH v5 1/2] kprobes: textmem API

2024-03-25 Thread Jarkko Sakkinen

s/textmem/execmem/ (also in long description)

will hold sending a new version as not a functional issue, will fix
after review.

BR, Jarkko

[PATCH v5 2/2] arch/riscv: Enable kprobes when CONFIG_MODULES=n

2024-03-25 Thread Jarkko Sakkinen

Tacing with kprobes while running a monolithic kernel is currently
impossible due the kernel module allocator dependency.

Address the issue by implementing textmem API for RISC-V.

Link: https://www.sochub.fi # for power on testing new SoC's with a minimal 
stack
Link: https://lore.kernel.org/all/2022060814.3054333-1-jar...@profian.com/ 
# continuation
Signed-off-by: Jarkko Sakkinen 
---
v5:
- No changes, expect removing alloc_execmem() call which should have
  been part of the previous patch.
v4:
- Include linux/execmem.h.
v3:
- Architecture independent parts have been split to separate patches.
- Do not change arch/riscv/kernel/module.c as it is out of scope for
  this patch set now.
v2:
- Better late than never right? :-)
- Focus only to RISC-V for now to make the patch more digestable. This
  is the arch where I use the patch on a daily basis to help with QA.
- Introduce HAVE_KPROBES_ALLOC flag to help with more gradual migration.
---
 arch/riscv/Kconfig  |  1 +
 arch/riscv/kernel/Makefile  |  3 +++
 arch/riscv/kernel/execmem.c | 22 ++
 3 files changed, 26 insertions(+)
 create mode 100644 arch/riscv/kernel/execmem.c

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index e3142ce531a0..499512fb17ff 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -132,6 +132,7 @@ config RISCV
select HAVE_KPROBES if !XIP_KERNEL
select HAVE_KPROBES_ON_FTRACE if !XIP_KERNEL
select HAVE_KRETPROBES if !XIP_KERNEL
+   select HAVE_ALLOC_EXECMEM if !XIP_KERNEL
# https://github.com/ClangBuiltLinux/linux/issues/1881
select HAVE_LD_DEAD_CODE_DATA_ELIMINATION if !LD_IS_LLD
select HAVE_MOVE_PMD
diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
index 604d6bf7e476..337797f10d3e 100644
--- a/arch/riscv/kernel/Makefile
+++ b/arch/riscv/kernel/Makefile
@@ -73,6 +73,9 @@ obj-$(CONFIG_SMP) += cpu_ops.o
 
 obj-$(CONFIG_RISCV_BOOT_SPINWAIT) += cpu_ops_spinwait.o
 obj-$(CONFIG_MODULES)  += module.o
+ifeq ($(CONFIG_ALLOC_EXECMEM),y)
+obj-y  += execmem.o
+endif
 obj-$(CONFIG_MODULE_SECTIONS)  += module-sections.o
 
 obj-$(CONFIG_CPU_PM)   += suspend_entry.o suspend.o
diff --git a/arch/riscv/kernel/execmem.c b/arch/riscv/kernel/execmem.c
new file mode 100644
index ..3e52522ead32
--- /dev/null
+++ b/arch/riscv/kernel/execmem.c
@@ -0,0 +1,22 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+#include 
+#include 
+#include 
+#include 
+
+void *alloc_execmem(unsigned long size, gfp_t /* gfp */)
+{
+   return __vmalloc_node_range(size, 1, MODULES_VADDR,
+   MODULES_END, GFP_KERNEL,
+   PAGE_KERNEL, 0, NUMA_NO_NODE,
+   __builtin_return_address(0));
+}
+
+void free_execmem(void *region)
+{
+   if (in_interrupt())
+   pr_warn("In interrupt context: vmalloc may not work.\n");
+
+   vfree(region);
+}
-- 
2.44.0

[PATCH v5 1/2] kprobes: textmem API

2024-03-25 Thread Jarkko Sakkinen

Tracing with kprobes while running a monolithic kernel is currently
impossible because CONFIG_KPROBES depends on CONFIG_MODULES because it uses
the kernel module allocator.

Introduce alloc_textmem() and free_textmem() for allocating executable
memory. If an arch implements these functions, it can mark this up with
the HAVE_ALLOC_EXECMEM kconfig flag.

At first this feature will be used for enabling kprobes without
modules support for arch/riscv.

Link: 
https://lore.kernel.org/all/20240325115632.04e37297491cadfbbf382...@kernel.org/
Suggested-by: Masami Hiramatsu 
Signed-off-by: Jarkko Sakkinen 
---
v5:
- alloc_execmem() was missing GFP_KERNEL parameter. The patch set did
  compile because 2/2 had the fixup (leaked there when rebasing the
  patch set).
v4:
- Squashed a couple of unrequired CONFIG_MODULES checks.
- See https://lore.kernel.org/all/d034m18d63ec.2y11d954ys...@kernel.org/
v3:
- A new patch added.
- For IS_DEFINED() I need advice as I could not really find that many
  locations where it would be applicable.
---
 arch/Kconfig| 16 +++-
 include/linux/execmem.h | 13 +
 kernel/kprobes.c| 17 ++---
 kernel/trace/trace_kprobe.c |  8 
 4 files changed, 50 insertions(+), 4 deletions(-)
 create mode 100644 include/linux/execmem.h

diff --git a/arch/Kconfig b/arch/Kconfig
index a5af0edd3eb8..33ba68b7168f 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -52,8 +52,8 @@ config GENERIC_ENTRY
 
 config KPROBES
bool "Kprobes"
-   depends on MODULES
depends on HAVE_KPROBES
+   select ALLOC_EXECMEM
select KALLSYMS
select TASKS_RCU if PREEMPTION
help
@@ -215,6 +215,20 @@ config HAVE_OPTPROBES
 config HAVE_KPROBES_ON_FTRACE
bool
 
+config HAVE_ALLOC_EXECMEM
+   bool
+   help
+ Architectures that select this option are capable of allocating 
executable
+ memory, which can be used by subsystems but is not dependent of any 
of its
+ clients.
+
+config ALLOC_EXECMEM
+   bool "Executable (trampoline) memory allocation"
+   depends on MODULES || HAVE_ALLOC_EXECMEM
+   help
+ Select this for executable (trampoline) memory. Can be enabled when 
either
+ module allocator or arch-specific allocator is available.
+
 config ARCH_CORRECT_STACKTRACE_ON_KRETPROBE
bool
help
diff --git a/include/linux/execmem.h b/include/linux/execmem.h
new file mode 100644
index ..ae2ff151523a
--- /dev/null
+++ b/include/linux/execmem.h
@@ -0,0 +1,13 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_EXECMEM_H
+#define _LINUX_EXECMEM_H
+
+#ifdef CONFIG_HAVE_ALLOC_EXECMEM
+void *alloc_execmem(unsigned long size, gfp_t gfp);
+void free_execmem(void *region);
+#else
+#define alloc_execmem(size, gfp)   module_alloc(size)
+#define free_execmem(region)   module_memfree(region)
+#endif
+
+#endif /* _LINUX_EXECMEM_H */
diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index 9d9095e81792..87fd8c14a938 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -44,6 +44,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define KPROBE_HASH_BITS 6
 #define KPROBE_TABLE_SIZE (1 << KPROBE_HASH_BITS)
@@ -113,17 +114,17 @@ enum kprobe_slot_state {
 void __weak *alloc_insn_page(void)
 {
/*
-* Use module_alloc() so this page is within +/- 2GB of where the
+* Use alloc_execmem() so this page is within +/- 2GB of where the
 * kernel image and loaded module images reside. This is required
 * for most of the architectures.
 * (e.g. x86-64 needs this to handle the %rip-relative fixups.)
 */
-   return module_alloc(PAGE_SIZE);
+   return alloc_execmem(PAGE_SIZE, GFP_KERNEL);
 }
 
 static void free_insn_page(void *page)
 {
-   module_memfree(page);
+   free_execmem(page);
 }
 
 struct kprobe_insn_cache kprobe_insn_slots = {
@@ -1580,6 +1581,7 @@ static int check_kprobe_address_safe(struct kprobe *p,
goto out;
}
 
+#ifdef CONFIG_MODULES
/* Check if 'p' is probing a module. */
*probed_mod = __module_text_address((unsigned long) p->addr);
if (*probed_mod) {
@@ -1603,6 +1605,8 @@ static int check_kprobe_address_safe(struct kprobe *p,
ret = -ENOENT;
}
}
+#endif
+
 out:
preempt_enable();
jump_label_unlock();
@@ -2482,6 +2486,7 @@ int kprobe_add_area_blacklist(unsigned long start, 
unsigned long end)
return 0;
 }
 
+#ifdef CONFIG_MODULES
 /* Remove all symbols in given area from kprobe blacklist */
 static void kprobe_remove_area_blacklist(unsigned long start, unsigned long 
end)
 {
@@ -2499,6 +2504,7 @@ static void kprobe_remove_ksym_blacklist(unsigned long 
entry)
 {
kprobe_remove_area_blacklist(entry, entry + 1);
 }
+#endif /* CONFIG_MODULES */
 
 int __weak arch_kprobe_get_kallsym(unsigned int

[PATCH v3 1/2] kprobes: textmem API

2024-03-25 Thread Jarkko Sakkinen

Tracing with kprobes while running a monolithic kernel is currently
impossible because CONFIG_KPROBES depends on CONFIG_MODULES because it uses
the kernel module allocator.

Introduce alloc_textmem() and free_textmem() for allocating executable
memory. If an arch implements these functions, it can mark this up with
the HAVE_ALLOC_EXECMEM kconfig flag.

At first this feature will be used for enabling kprobes without
modules support for arch/riscv.

Link: 
https://lore.kernel.org/all/20240325115632.04e37297491cadfbbf382...@kernel.org/
Suggested-by: Masami Hiramatsu 
Signed-off-by: Jarkko Sakkinen 
---
v3:
- A new patch added.
- For IS_DEFINED() I need advice as I could not really find that many
  locations where it would be applicable.
---
 arch/Kconfig| 16 +++-
 include/linux/execmem.h | 13 +
 kernel/kprobes.c| 17 ++---
 kernel/trace/trace_kprobe.c | 18 --
 4 files changed, 58 insertions(+), 6 deletions(-)
 create mode 100644 include/linux/execmem.h

diff --git a/arch/Kconfig b/arch/Kconfig
index a5af0edd3eb8..33ba68b7168f 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -52,8 +52,8 @@ config GENERIC_ENTRY
 
 config KPROBES
bool "Kprobes"
-   depends on MODULES
depends on HAVE_KPROBES
+   select ALLOC_EXECMEM
select KALLSYMS
select TASKS_RCU if PREEMPTION
help
@@ -215,6 +215,20 @@ config HAVE_OPTPROBES
 config HAVE_KPROBES_ON_FTRACE
bool
 
+config HAVE_ALLOC_EXECMEM
+   bool
+   help
+ Architectures that select this option are capable of allocating 
executable
+ memory, which can be used by subsystems but is not dependent of any 
of its
+ clients.
+
+config ALLOC_EXECMEM
+   bool "Executable (trampoline) memory allocation"
+   depends on MODULES || HAVE_ALLOC_EXECMEM
+   help
+ Select this for executable (trampoline) memory. Can be enabled when 
either
+ module allocator or arch-specific allocator is available.
+
 config ARCH_CORRECT_STACKTRACE_ON_KRETPROBE
bool
help
diff --git a/include/linux/execmem.h b/include/linux/execmem.h
new file mode 100644
index ..ae2ff151523a
--- /dev/null
+++ b/include/linux/execmem.h
@@ -0,0 +1,13 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_EXECMEM_H
+#define _LINUX_EXECMEM_H
+
+#ifdef CONFIG_HAVE_ALLOC_EXECMEM
+void *alloc_execmem(unsigned long size, gfp_t gfp);
+void free_execmem(void *region);
+#else
+#define alloc_execmem(size, gfp)   module_alloc(size)
+#define free_execmem(region)   module_memfree(region)
+#endif
+
+#endif /* _LINUX_EXECMEM_H */
diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index 9d9095e81792..a1a547723c3c 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -44,6 +44,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define KPROBE_HASH_BITS 6
 #define KPROBE_TABLE_SIZE (1 << KPROBE_HASH_BITS)
@@ -113,17 +114,17 @@ enum kprobe_slot_state {
 void __weak *alloc_insn_page(void)
 {
/*
-* Use module_alloc() so this page is within +/- 2GB of where the
+* Use alloc_execmem() so this page is within +/- 2GB of where the
 * kernel image and loaded module images reside. This is required
 * for most of the architectures.
 * (e.g. x86-64 needs this to handle the %rip-relative fixups.)
 */
-   return module_alloc(PAGE_SIZE);
+   return alloc_execmem(PAGE_SIZE);
 }
 
 static void free_insn_page(void *page)
 {
-   module_memfree(page);
+   free_execmem(page);
 }
 
 struct kprobe_insn_cache kprobe_insn_slots = {
@@ -1580,6 +1581,7 @@ static int check_kprobe_address_safe(struct kprobe *p,
goto out;
}
 
+#ifdef CONFIG_MODULES
/* Check if 'p' is probing a module. */
*probed_mod = __module_text_address((unsigned long) p->addr);
if (*probed_mod) {
@@ -1603,6 +1605,8 @@ static int check_kprobe_address_safe(struct kprobe *p,
ret = -ENOENT;
}
}
+#endif
+
 out:
preempt_enable();
jump_label_unlock();
@@ -2482,6 +2486,7 @@ int kprobe_add_area_blacklist(unsigned long start, 
unsigned long end)
return 0;
 }
 
+#ifdef CONFIG_MODULES
 /* Remove all symbols in given area from kprobe blacklist */
 static void kprobe_remove_area_blacklist(unsigned long start, unsigned long 
end)
 {
@@ -2499,6 +2504,7 @@ static void kprobe_remove_ksym_blacklist(unsigned long 
entry)
 {
kprobe_remove_area_blacklist(entry, entry + 1);
 }
+#endif /* CONFIG_MODULES */
 
 int __weak arch_kprobe_get_kallsym(unsigned int *symnum, unsigned long *value,
   char *type, char *sym)
@@ -2564,6 +2570,7 @@ static int __init populate_kprobe_blacklist(unsigned long 
*start,
return ret ? : arch_populate_kprobe_blacklist();
 }
 
+#ifdef CONFIG_MODULES
 static void add_module_kprobe_blacklist(struct mod

[PATCH v4 2/2] arch/riscv: Enable kprobes when CONFIG_MODULES=n

2024-03-25 Thread Jarkko Sakkinen

Tacing with kprobes while running a monolithic kernel is currently
impossible due the kernel module allocator dependency.

Address the issue by implementing textmem API for RISC-V.

Link: https://www.sochub.fi # for power on testing new SoC's with a minimal 
stack
Link: https://lore.kernel.org/all/2022060814.3054333-1-jar...@profian.com/ 
# continuation
Signed-off-by: Jarkko Sakkinen 
---
v4:
- Include linux/execmem.h.
v3:
- Architecture independent parts have been split to separate patches.
- Do not change arch/riscv/kernel/module.c as it is out of scope for
  this patch set now.
v2:
- Better late than never right? :-)
- Focus only to RISC-V for now to make the patch more digestable. This
  is the arch where I use the patch on a daily basis to help with QA.
- Introduce HAVE_KPROBES_ALLOC flag to help with more gradual migration.
---
 arch/riscv/Kconfig  |  1 +
 arch/riscv/kernel/Makefile  |  3 +++
 arch/riscv/kernel/execmem.c | 22 ++
 kernel/kprobes.c|  2 +-
 4 files changed, 27 insertions(+), 1 deletion(-)
 create mode 100644 arch/riscv/kernel/execmem.c

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index e3142ce531a0..499512fb17ff 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -132,6 +132,7 @@ config RISCV
select HAVE_KPROBES if !XIP_KERNEL
select HAVE_KPROBES_ON_FTRACE if !XIP_KERNEL
select HAVE_KRETPROBES if !XIP_KERNEL
+   select HAVE_ALLOC_EXECMEM if !XIP_KERNEL
# https://github.com/ClangBuiltLinux/linux/issues/1881
select HAVE_LD_DEAD_CODE_DATA_ELIMINATION if !LD_IS_LLD
select HAVE_MOVE_PMD
diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
index 604d6bf7e476..337797f10d3e 100644
--- a/arch/riscv/kernel/Makefile
+++ b/arch/riscv/kernel/Makefile
@@ -73,6 +73,9 @@ obj-$(CONFIG_SMP) += cpu_ops.o
 
 obj-$(CONFIG_RISCV_BOOT_SPINWAIT) += cpu_ops_spinwait.o
 obj-$(CONFIG_MODULES)  += module.o
+ifeq ($(CONFIG_ALLOC_EXECMEM),y)
+obj-y  += execmem.o
+endif
 obj-$(CONFIG_MODULE_SECTIONS)  += module-sections.o
 
 obj-$(CONFIG_CPU_PM)   += suspend_entry.o suspend.o
diff --git a/arch/riscv/kernel/execmem.c b/arch/riscv/kernel/execmem.c
new file mode 100644
index ..3e52522ead32
--- /dev/null
+++ b/arch/riscv/kernel/execmem.c
@@ -0,0 +1,22 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+#include 
+#include 
+#include 
+#include 
+
+void *alloc_execmem(unsigned long size, gfp_t /* gfp */)
+{
+   return __vmalloc_node_range(size, 1, MODULES_VADDR,
+   MODULES_END, GFP_KERNEL,
+   PAGE_KERNEL, 0, NUMA_NO_NODE,
+   __builtin_return_address(0));
+}
+
+void free_execmem(void *region)
+{
+   if (in_interrupt())
+   pr_warn("In interrupt context: vmalloc may not work.\n");
+
+   vfree(region);
+}
diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index a1a547723c3c..87fd8c14a938 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -119,7 +119,7 @@ void __weak *alloc_insn_page(void)
 * for most of the architectures.
 * (e.g. x86-64 needs this to handle the %rip-relative fixups.)
 */
-   return alloc_execmem(PAGE_SIZE);
+   return alloc_execmem(PAGE_SIZE, GFP_KERNEL);
 }
 
 static void free_insn_page(void *page)
-- 
2.44.0

[PATCH v4 1/2] kprobes: textmem API

2024-03-25 Thread Jarkko Sakkinen

Tracing with kprobes while running a monolithic kernel is currently
impossible because CONFIG_KPROBES depends on CONFIG_MODULES because it uses
the kernel module allocator.

Introduce alloc_textmem() and free_textmem() for allocating executable
memory. If an arch implements these functions, it can mark this up with
the HAVE_ALLOC_EXECMEM kconfig flag.

At first this feature will be used for enabling kprobes without
modules support for arch/riscv.

Link: 
https://lore.kernel.org/all/20240325115632.04e37297491cadfbbf382...@kernel.org/
Suggested-by: Masami Hiramatsu 
Signed-off-by: Jarkko Sakkinen 
---
v4:
- Squashed a couple of unrequired CONFIG_MODULES checks.
- See https://lore.kernel.org/all/d034m18d63ec.2y11d954ys...@kernel.org/
v3:
- A new patch added.
- For IS_DEFINED() I need advice as I could not really find that many
  locations where it would be applicable.
---
 arch/Kconfig| 16 +++-
 include/linux/execmem.h | 13 +
 kernel/kprobes.c| 17 ++---
 kernel/trace/trace_kprobe.c |  8 
 4 files changed, 50 insertions(+), 4 deletions(-)
 create mode 100644 include/linux/execmem.h

diff --git a/arch/Kconfig b/arch/Kconfig
index a5af0edd3eb8..33ba68b7168f 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -52,8 +52,8 @@ config GENERIC_ENTRY
 
 config KPROBES
bool "Kprobes"
-   depends on MODULES
depends on HAVE_KPROBES
+   select ALLOC_EXECMEM
select KALLSYMS
select TASKS_RCU if PREEMPTION
help
@@ -215,6 +215,20 @@ config HAVE_OPTPROBES
 config HAVE_KPROBES_ON_FTRACE
bool
 
+config HAVE_ALLOC_EXECMEM
+   bool
+   help
+ Architectures that select this option are capable of allocating 
executable
+ memory, which can be used by subsystems but is not dependent of any 
of its
+ clients.
+
+config ALLOC_EXECMEM
+   bool "Executable (trampoline) memory allocation"
+   depends on MODULES || HAVE_ALLOC_EXECMEM
+   help
+ Select this for executable (trampoline) memory. Can be enabled when 
either
+ module allocator or arch-specific allocator is available.
+
 config ARCH_CORRECT_STACKTRACE_ON_KRETPROBE
bool
help
diff --git a/include/linux/execmem.h b/include/linux/execmem.h
new file mode 100644
index ..ae2ff151523a
--- /dev/null
+++ b/include/linux/execmem.h
@@ -0,0 +1,13 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_EXECMEM_H
+#define _LINUX_EXECMEM_H
+
+#ifdef CONFIG_HAVE_ALLOC_EXECMEM
+void *alloc_execmem(unsigned long size, gfp_t gfp);
+void free_execmem(void *region);
+#else
+#define alloc_execmem(size, gfp)   module_alloc(size)
+#define free_execmem(region)   module_memfree(region)
+#endif
+
+#endif /* _LINUX_EXECMEM_H */
diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index 9d9095e81792..a1a547723c3c 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -44,6 +44,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define KPROBE_HASH_BITS 6
 #define KPROBE_TABLE_SIZE (1 << KPROBE_HASH_BITS)
@@ -113,17 +114,17 @@ enum kprobe_slot_state {
 void __weak *alloc_insn_page(void)
 {
/*
-* Use module_alloc() so this page is within +/- 2GB of where the
+* Use alloc_execmem() so this page is within +/- 2GB of where the
 * kernel image and loaded module images reside. This is required
 * for most of the architectures.
 * (e.g. x86-64 needs this to handle the %rip-relative fixups.)
 */
-   return module_alloc(PAGE_SIZE);
+   return alloc_execmem(PAGE_SIZE);
 }
 
 static void free_insn_page(void *page)
 {
-   module_memfree(page);
+   free_execmem(page);
 }
 
 struct kprobe_insn_cache kprobe_insn_slots = {
@@ -1580,6 +1581,7 @@ static int check_kprobe_address_safe(struct kprobe *p,
goto out;
}
 
+#ifdef CONFIG_MODULES
/* Check if 'p' is probing a module. */
*probed_mod = __module_text_address((unsigned long) p->addr);
if (*probed_mod) {
@@ -1603,6 +1605,8 @@ static int check_kprobe_address_safe(struct kprobe *p,
ret = -ENOENT;
}
}
+#endif
+
 out:
preempt_enable();
jump_label_unlock();
@@ -2482,6 +2486,7 @@ int kprobe_add_area_blacklist(unsigned long start, 
unsigned long end)
return 0;
 }
 
+#ifdef CONFIG_MODULES
 /* Remove all symbols in given area from kprobe blacklist */
 static void kprobe_remove_area_blacklist(unsigned long start, unsigned long 
end)
 {
@@ -2499,6 +2504,7 @@ static void kprobe_remove_ksym_blacklist(unsigned long 
entry)
 {
kprobe_remove_area_blacklist(entry, entry + 1);
 }
+#endif /* CONFIG_MODULES */
 
 int __weak arch_kprobe_get_kallsym(unsigned int *symnum, unsigned long *value,
   char *type, char *sym)
@@ -2564,6 +2570,7 @@ static int __init populate_kprobe_blacklist(unsigned lon

Re: [PATCH v3 2/2] arch/riscv: Enable kprobes when CONFIG_MODULES=n

2024-03-25 Thread Jarkko Sakkinen

On Mon Mar 25, 2024 at 10:37 PM EET, Jarkko Sakkinen wrote:
> Tacing with kprobes while running a monolithic kernel is currently
> impossible due the kernel module allocator dependency.
>
> Address the issue by implementing textmem API for RISC-V.
>
> Link: https://www.sochub.fi # for power on testing new SoC's with a minimal 
> stack
> Link: 
> https://lore.kernel.org/all/2022060814.3054333-1-jar...@profian.com/ # 
> continuation
> Signed-off-by: Jarkko Sakkinen 

I think that for any use case it is best for overall good to realize it
like this. I.e. only create patch sets related to the topic that change
behavior for arch's that are in your heavy use. For me that mean x86
and RISC-V.

That is why I shrinked this from to focus into more narrow scope.

For microarch's more alien to one, it is just too easy to make sloppy
mistakes, which could cause unwanted harm. E.g. it is for best of
arch/sh that someone involved with that microarchitecture does later
on the shenanigans.

BR, Jarkko

[PATCH v3 2/2] arch/riscv: Enable kprobes when CONFIG_MODULES=n

2024-03-25 Thread Jarkko Sakkinen

Tacing with kprobes while running a monolithic kernel is currently
impossible due the kernel module allocator dependency.

Address the issue by implementing textmem API for RISC-V.

Link: https://www.sochub.fi # for power on testing new SoC's with a minimal 
stack
Link: https://lore.kernel.org/all/2022060814.3054333-1-jar...@profian.com/ 
# continuation
Signed-off-by: Jarkko Sakkinen 
---
v3:
- Architecture independent parts have been split to separate patches.
- Do not change arch/riscv/kernel/module.c as it is out of scope for
  this patch set now.
v2:
- Better late than never right? :-)
- Focus only to RISC-V for now to make the patch more digestable. This
  is the arch where I use the patch on a daily basis to help with QA.
- Introduce HAVE_KPROBES_ALLOC flag to help with more gradual migration.
---
 arch/riscv/Kconfig  |  1 +
 arch/riscv/kernel/Makefile  |  3 +++
 arch/riscv/kernel/execmem.c | 22 ++
 kernel/kprobes.c|  2 +-
 4 files changed, 27 insertions(+), 1 deletion(-)
 create mode 100644 arch/riscv/kernel/execmem.c

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index e3142ce531a0..499512fb17ff 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -132,6 +132,7 @@ config RISCV
select HAVE_KPROBES if !XIP_KERNEL
select HAVE_KPROBES_ON_FTRACE if !XIP_KERNEL
select HAVE_KRETPROBES if !XIP_KERNEL
+   select HAVE_ALLOC_EXECMEM if !XIP_KERNEL
# https://github.com/ClangBuiltLinux/linux/issues/1881
select HAVE_LD_DEAD_CODE_DATA_ELIMINATION if !LD_IS_LLD
select HAVE_MOVE_PMD
diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
index 604d6bf7e476..337797f10d3e 100644
--- a/arch/riscv/kernel/Makefile
+++ b/arch/riscv/kernel/Makefile
@@ -73,6 +73,9 @@ obj-$(CONFIG_SMP) += cpu_ops.o
 
 obj-$(CONFIG_RISCV_BOOT_SPINWAIT) += cpu_ops_spinwait.o
 obj-$(CONFIG_MODULES)  += module.o
+ifeq ($(CONFIG_ALLOC_EXECMEM),y)
+obj-y  += execmem.o
+endif
 obj-$(CONFIG_MODULE_SECTIONS)  += module-sections.o
 
 obj-$(CONFIG_CPU_PM)   += suspend_entry.o suspend.o
diff --git a/arch/riscv/kernel/execmem.c b/arch/riscv/kernel/execmem.c
new file mode 100644
index ..4191251476d0
--- /dev/null
+++ b/arch/riscv/kernel/execmem.c
@@ -0,0 +1,22 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+#include 
+#include 
+#include 
+#include 
+
+void *alloc_execmem(unsigned long size, gfp_t /* gfp */)
+{
+   return __vmalloc_node_range(size, 1, MODULES_VADDR,
+   MODULES_END, GFP_KERNEL,
+   PAGE_KERNEL, 0, NUMA_NO_NODE,
+   __builtin_return_address(0));
+}
+
+void free_execmem(void *region)
+{
+   if (in_interrupt())
+   pr_warn("In interrupt context: vmalloc may not work.\n");
+
+   vfree(region);
+}
diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index a1a547723c3c..87fd8c14a938 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -119,7 +119,7 @@ void __weak *alloc_insn_page(void)
 * for most of the architectures.
 * (e.g. x86-64 needs this to handle the %rip-relative fixups.)
 */
-   return alloc_execmem(PAGE_SIZE);
+   return alloc_execmem(PAGE_SIZE, GFP_KERNEL);
 }
 
 static void free_insn_page(void *page)
-- 
2.44.0

Re: [PATCH v2] arch/riscv: Enable kprobes when CONFIG_MODULES=n

2024-03-25 Thread Jarkko Sakkinen

On Mon Mar 25, 2024 at 9:11 PM EET, Jarkko Sakkinen wrote:
> On Mon Mar 25, 2024 at 8:37 PM EET, Jarkko Sakkinen wrote:
> > > You also should consider using IS_ENABLED(CONFIG_MODULE) in the code to
> > > avoid using #ifdefs.
>
> Hmm... I need make a couple of remarks but open for feedback ofc.
>
> First, trace_kprobe_module_exist depends on find_module()
>
> Second, there is a notifier callback that heavily binds to the module
> subsystem.
>
> In both cases using IS_ENABLED would emit a lot of compilation errors.

Also I think adding 'gfp' makes sense exactly at the point as it has
a use case, i.e. two call sites with differing flags. It makes sense
but should be IMHO added exactly at that time.

Leaving it from my patch set does not do any measurable harm but please
correct if I'm missing something.

BR, Jarkko

Re: [PATCH v2] arch/riscv: Enable kprobes when CONFIG_MODULES=n

2024-03-25 Thread Jarkko Sakkinen

On Mon Mar 25, 2024 at 8:37 PM EET, Jarkko Sakkinen wrote:
> > You also should consider using IS_ENABLED(CONFIG_MODULE) in the code to
> > avoid using #ifdefs.

Hmm... I need make a couple of remarks but open for feedback ofc.

First, trace_kprobe_module_exist depends on find_module()

Second, there is a notifier callback that heavily binds to the module
subsystem.

In both cases using IS_ENABLED would emit a lot of compilation errors.

BR, Jarkko

Re: [PATCH v2] arch/riscv: Enable kprobes when CONFIG_MODULES=n

2024-03-23 Thread Jarkko Sakkinen

On Sun Mar 24, 2024 at 2:37 AM EET, Randy Dunlap wrote:
>
>
> On 3/23/24 16:29, Jarkko Sakkinen wrote:
> > +config HAVE_KPROBES_ALLOC
> > +   bool
> > +   help
> > + Architectures that select this option are capable of allocating memory
> > + for kprobes withou the kernel module allocator.
>
> without

Thanks, will fix.

BR, Jarkko

Re: [PATCH v2] arch/riscv: Enable kprobes when CONFIG_MODULES=n

2024-03-23 Thread Jarkko Sakkinen

On Sun Mar 24, 2024 at 1:29 AM EET, Jarkko Sakkinen wrote:
> Tracing with kprobes while running a monolithic kernel is currently
> impossible due the kernel module allocator dependency.
>
> Address the issue by allowing architectures to implement module_alloc()
> and module_memfree() independent of the module subsystem. An arch tree
> can signal this by setting HAVE_KPROBES_ALLOC in its Kconfig file.
>
> Realize the feature on RISC-V by separating allocator to module_alloc.c
> and implementing module_memfree().
>
> Link: https://www.sochub.fi # for power on testing new SoC's with a minimal 
> stack
> Link: 
> https://lore.kernel.org/all/2022060814.3054333-1-jar...@profian.com/ # 
> continuation
> Signed-off-by: Jarkko Sakkinen 

As for testing I tried the kprobes example for boottime tracing
dcoumentation:

https://www.kernel.org/doc/html/v5.7/trace/boottime-trace.html

I.e.

ftrace.event {
kprobes.vfs_read {
probes = "vfs_read $arg1 $arg2"
filter = "common_pid < 100"
enable
}
}

kernel {
console = hvc0
earlycon = sbi
trace_options = sym-addr
trace_event = "initcall:*"
tp_printk
dump_on_oops = 2
trace_buf_size = 1M
}

BR, Jarkko

[PATCH v2] arch/riscv: Enable kprobes when CONFIG_MODULES=n

2024-03-23 Thread Jarkko Sakkinen

Tracing with kprobes while running a monolithic kernel is currently
impossible due the kernel module allocator dependency.

Address the issue by allowing architectures to implement module_alloc()
and module_memfree() independent of the module subsystem. An arch tree
can signal this by setting HAVE_KPROBES_ALLOC in its Kconfig file.

Realize the feature on RISC-V by separating allocator to module_alloc.c
and implementing module_memfree().

Link: https://www.sochub.fi # for power on testing new SoC's with a minimal 
stack
Link: https://lore.kernel.org/all/2022060814.3054333-1-jar...@profian.com/ 
# continuation
Signed-off-by: Jarkko Sakkinen 
---
v2:
- Better late than never right? :-)
- Focus only to RISC-V for now to make the patch more digestable. This
  is the arch where I use the patch on a daily basis to help with QA.
- Introduce HAVE_KPROBES_ALLOC flag to help with more gradual migration.
---
 arch/Kconfig |  8 +++-
 arch/riscv/Kconfig   |  1 +
 arch/riscv/kernel/Makefile   |  5 +
 arch/riscv/kernel/module.c   | 11 ---
 arch/riscv/kernel/module_alloc.c | 28 
 kernel/kprobes.c | 10 ++
 kernel/trace/trace_kprobe.c  | 18 --
 7 files changed, 67 insertions(+), 14 deletions(-)
 create mode 100644 arch/riscv/kernel/module_alloc.c

diff --git a/arch/Kconfig b/arch/Kconfig
index a5af0edd3eb8..c931f1de98a7 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -52,7 +52,7 @@ config GENERIC_ENTRY
 
 config KPROBES
bool "Kprobes"
-   depends on MODULES
+   depends on MODULES || HAVE_KPROBES_ALLOC
depends on HAVE_KPROBES
select KALLSYMS
select TASKS_RCU if PREEMPTION
@@ -215,6 +215,12 @@ config HAVE_OPTPROBES
 config HAVE_KPROBES_ON_FTRACE
bool
 
+config HAVE_KPROBES_ALLOC
+   bool
+   help
+ Architectures that select this option are capable of allocating memory
+ for kprobes withou the kernel module allocator.
+
 config ARCH_CORRECT_STACKTRACE_ON_KRETPROBE
bool
help
diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index e3142ce531a0..4f1b925e83d8 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -132,6 +132,7 @@ config RISCV
select HAVE_KPROBES if !XIP_KERNEL
select HAVE_KPROBES_ON_FTRACE if !XIP_KERNEL
select HAVE_KRETPROBES if !XIP_KERNEL
+   select HAVE_KPROBES_ALLOC if !XIP_KERNEL
# https://github.com/ClangBuiltLinux/linux/issues/1881
select HAVE_LD_DEAD_CODE_DATA_ELIMINATION if !LD_IS_LLD
select HAVE_MOVE_PMD
diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
index 604d6bf7e476..46318194bce1 100644
--- a/arch/riscv/kernel/Makefile
+++ b/arch/riscv/kernel/Makefile
@@ -73,6 +73,11 @@ obj-$(CONFIG_SMP)+= cpu_ops.o
 
 obj-$(CONFIG_RISCV_BOOT_SPINWAIT) += cpu_ops_spinwait.o
 obj-$(CONFIG_MODULES)  += module.o
+ifeq ($(CONFIG_MODULES),y)
+obj-y  += module_alloc.o
+else
+obj-$(CONFIG_KPROBES)  += module_alloc.o
+endif
 obj-$(CONFIG_MODULE_SECTIONS)  += module-sections.o
 
 obj-$(CONFIG_CPU_PM)   += suspend_entry.o suspend.o
diff --git a/arch/riscv/kernel/module.c b/arch/riscv/kernel/module.c
index 5e5a82644451..cc324b450f2e 100644
--- a/arch/riscv/kernel/module.c
+++ b/arch/riscv/kernel/module.c
@@ -905,17 +905,6 @@ int apply_relocate_add(Elf_Shdr *sechdrs, const char 
*strtab,
return 0;
 }
 
-#if defined(CONFIG_MMU) && defined(CONFIG_64BIT)
-void *module_alloc(unsigned long size)
-{
-   return __vmalloc_node_range(size, 1, MODULES_VADDR,
-   MODULES_END, GFP_KERNEL,
-   PAGE_KERNEL, VM_FLUSH_RESET_PERMS,
-   NUMA_NO_NODE,
-   __builtin_return_address(0));
-}
-#endif
-
 int module_finalize(const Elf_Ehdr *hdr,
const Elf_Shdr *sechdrs,
struct module *me)
diff --git a/arch/riscv/kernel/module_alloc.c b/arch/riscv/kernel/module_alloc.c
new file mode 100644
index ..3d9aa8dbca8a
--- /dev/null
+++ b/arch/riscv/kernel/module_alloc.c
@@ -0,0 +1,28 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ *  Copyright (c) 2017 Zihao Yu
+ *  Copyright (c) 2024 Jarkko Sakkinen
+ */
+
+#include 
+#include 
+#include 
+#include 
+
+#if defined(CONFIG_MMU) && defined(CONFIG_64BIT)
+void *module_alloc(unsigned long size)
+{
+   return __vmalloc_node_range(size, 1, MODULES_VADDR,
+   MODULES_END, GFP_KERNEL,
+   PAGE_KERNEL, 0, NUMA_NO_NODE,
+   __builtin_return_address(0));
+}
+
+void module_memfree(void *module_region)
+{
+   if (in_interrupt())
+   pr_warn("In interrupt context: vmalloc may not work.\n");
+
+   vfree(module_region);
+}
+

Re: [PATCH v9 10/15] x86/sgx: Add EPC reclamation in cgroup try_charge()

2024-02-27 Thread Jarkko Sakkinen

On Mon Feb 26, 2024 at 11:56 PM EET, Dave Hansen wrote:
> On 2/26/24 13:48, Haitao Huang wrote:
> > In case of overcomitting, i.e., sum of limits greater than the EPC
> > capacity, if one group has a fault, and its usage is not above its own
> > limit (try_charge() passes), yet total usage of the system has exceeded
> > the capacity, whether we do global reclaim or just reclaim pages in the
> > current faulting group.
>
> I don't see _any_ reason to limit reclaim to the current faulting cgroup.
>
> >> Last, what is the simplest (least amount of code) thing that the SGX
> >> cgroup controller could implement here?
> > 
> > I still think the current approach of doing global reclaim is reasonable
> > and simple: try_charge() checks cgroup limit and reclaim within the
> > group if needed, then do EPC page allocation, reclaim globally if
> > allocation fails due to global usage reaches the capacity.
> > 
> > I'm not sure how not doing global reclaiming in this case would bring
> > any benefit.
> I tend to agree.
>
> Kai, I think your examples sound a little bit contrived.  Have actual
> users expressed a strong intent for doing anything with this series
> other than limiting bad actors from eating all the EPC?

I'd consider this from the viewpoint is there anything in the user space
visible portion of the patch set that would limit tuning the performance
later on, if required let's say by a workload that acts sub-optimally.

If not, then most of performance related issues can be only identified
by actual use of the code.

BR, Jarkko

Re: [RFC PATCH] x86/sgx: Remove 'reclaim' boolean parameters

2024-02-19 Thread Jarkko Sakkinen

On Mon Feb 19, 2024 at 10:25 PM UTC, Haitao Huang wrote:
> On Mon, 19 Feb 2024 14:42:29 -0600, Jarkko Sakkinen   
> wrote:
>
> > On Mon Feb 19, 2024 at 3:56 PM UTC, Dave Hansen wrote:
> >> On 2/19/24 07:39, Haitao Huang wrote:
> >> > Remove all boolean parameters for 'reclaim' from the function
> >> > sgx_alloc_epc_page() and its callers by making two versions of each
> >> > function.
> >> >
> >> > Also opportunistically remove non-static declaration of
> >> > __sgx_alloc_epc_page() and a typo
> >> >
> >> > Signed-off-by: Haitao Huang 
> >> > Suggested-by: Jarkko Sakkinen 
> >> > ---
> >> >  arch/x86/kernel/cpu/sgx/encl.c  | 56 +--
> >> >  arch/x86/kernel/cpu/sgx/encl.h  |  6 ++-
> >> >  arch/x86/kernel/cpu/sgx/ioctl.c | 23 ---
> >> >  arch/x86/kernel/cpu/sgx/main.c  | 68  
> >> ++---
> >> >  arch/x86/kernel/cpu/sgx/sgx.h   |  4 +-
> >> >  arch/x86/kernel/cpu/sgx/virt.c  |  2 +-
> >> >  6 files changed, 115 insertions(+), 44 deletions(-)
> >>
> >> Jarkko, did this turn out how you expected?
> >>
> >> I think passing around a function pointer to *only* communicate 1 bit of
> >> information is a _bit_ overkill here.
> >>
> >> Simply replacing the bool with:
> >>
> >> enum sgx_reclaim {
> >>SGX_NO_RECLAIM,
> >>SGX_DO_RECLAIM
> >> };
> >>
> >> would do the same thing.  Right?
> >>
> >> Are you sure you want a function pointer for this?
> >
> > To look this in context I drafted quickly two branches representing
> > imaginary next version of the patch set.
> >
> > I guess this would simpler and totally sufficient approach.
> >
> > With this approach I'd then change also:
> >
> > [PATCH v9 04/15] x86/sgx: Implement basic EPC misc cgroup functionality
> >
> > And add the enum-parameter already in that patch with just "no reclaim"
> > enum. I.e. then 10/15 will add only "do reclaim" and the new
> > functionality.
> >
> > BR, Jarkko
> >
>
> Thanks. My understanding is:
>
> 1) For this patch, replace the boolean with the enum as Dave suggested. No  
> two versions of the same functions. And this is a prerequisite for the  
> cgroup series, positioned before [PATCH v9 04/15]
>
> 2) For [PATCH v9 04/15], pass a hard coded SGX_NO_RECLAIM to  
> sgx_epc_cg_try_charge() from sgx_alloc_epc_page().

Yup, this will make the whole patch set also a bit leaner as the API
does not change in the middle.

>
> 3) For [PATCH v9 10/15], remove the hard coded value, pass the reclaim  
> enum parameter value from sgx_alloc_epc_page() to  sgx_epc_cg_try_charge()  
> and add the reclaim logic.
>
> I'll send patches soon. But please let me know if I misunderstood.


BR, Jarkko

Re: [RFC PATCH] x86/sgx: Remove 'reclaim' boolean parameters

2024-02-19 Thread Jarkko Sakkinen

On Mon Feb 19, 2024 at 3:56 PM UTC, Dave Hansen wrote:
> On 2/19/24 07:39, Haitao Huang wrote:
> > Remove all boolean parameters for 'reclaim' from the function
> > sgx_alloc_epc_page() and its callers by making two versions of each
> > function.
> > 
> > Also opportunistically remove non-static declaration of
> > __sgx_alloc_epc_page() and a typo
> > 
> > Signed-off-by: Haitao Huang 
> > Suggested-by: Jarkko Sakkinen 
> > ---
> >  arch/x86/kernel/cpu/sgx/encl.c  | 56 +--
> >  arch/x86/kernel/cpu/sgx/encl.h  |  6 ++-
> >  arch/x86/kernel/cpu/sgx/ioctl.c | 23 ---
> >  arch/x86/kernel/cpu/sgx/main.c  | 68 ++---
> >  arch/x86/kernel/cpu/sgx/sgx.h   |  4 +-
> >  arch/x86/kernel/cpu/sgx/virt.c  |  2 +-
> >  6 files changed, 115 insertions(+), 44 deletions(-)
>
> Jarkko, did this turn out how you expected?
>
> I think passing around a function pointer to *only* communicate 1 bit of
> information is a _bit_ overkill here.
>
> Simply replacing the bool with:
>
> enum sgx_reclaim {
>   SGX_NO_RECLAIM,
>   SGX_DO_RECLAIM
> };
>
> would do the same thing.  Right?
>
> Are you sure you want a function pointer for this?

To look this in context I drafted quickly two branches representing
imaginary next version of the patch set.

I guess this would simpler and totally sufficient approach.

With this approach I'd then change also:

[PATCH v9 04/15] x86/sgx: Implement basic EPC misc cgroup functionality

And add the enum-parameter already in that patch with just "no reclaim"
enum. I.e. then 10/15 will add only "do reclaim" and the new
functionality.

BR, Jarkko

Re: [RFC PATCH] x86/sgx: Remove 'reclaim' boolean parameters

2024-02-19 Thread Jarkko Sakkinen

On Mon Feb 19, 2024 at 3:39 PM UTC, Haitao Huang wrote:
> Remove all boolean parameters for 'reclaim' from the function
> sgx_alloc_epc_page() and its callers by making two versions of each
> function.
>
> Also opportunistically remove non-static declaration of
> __sgx_alloc_epc_page() and a typo
>
> Signed-off-by: Haitao Huang 
> Suggested-by: Jarkko Sakkinen 

I think this is for better.

My view point for kernel patches overally is that:

1. A feature should leave the subsystem in cleaner state as
   far as the existing framework of doing things goes.
2. A bugfix can sometimes do the opposite if corner case
   requires some weird dance to perform.

BR, Jarkko

Re: [PATCH v9 10/15] x86/sgx: Add EPC reclamation in cgroup try_charge()

2024-02-19 Thread Jarkko Sakkinen

On Mon Feb 19, 2024 at 3:12 PM UTC, Haitao Huang wrote:
> On Tue, 13 Feb 2024 19:52:25 -0600, Jarkko Sakkinen   
> wrote:
>
> > On Tue Feb 13, 2024 at 1:15 AM EET, Haitao Huang wrote:
> >> Hi Jarkko
> >>
> >> On Mon, 12 Feb 2024 13:55:46 -0600, Jarkko Sakkinen 
> >> wrote:
> >>
> >> > On Mon Feb 5, 2024 at 11:06 PM EET, Haitao Huang wrote:
> >> >> From: Kristen Carlson Accardi 
> >> >>
> >> >> When the EPC usage of a cgroup is near its limit, the cgroup needs to
> >> >> reclaim pages used in the same cgroup to make room for new  
> >> allocations.
> >> >> This is analogous to the behavior that the global reclaimer is  
> >> triggered
> >> >> when the global usage is close to total available EPC.
> >> >>
> >> >> Add a Boolean parameter for sgx_epc_cgroup_try_charge() to indicate
> >> >> whether synchronous reclaim is allowed or not. And trigger the
> >> >> synchronous/asynchronous reclamation flow accordingly.
> >> >>
> >> >> Note at this point, all reclaimable EPC pages are still tracked in  
> >> the
> >> >> global LRU and per-cgroup LRUs are empty. So no per-cgroup  
> >> reclamation
> >> >> is activated yet.
> >> >>
> >> >> Co-developed-by: Sean Christopherson  
> >> 
> >> >> Signed-off-by: Sean Christopherson 
> >> >> Signed-off-by: Kristen Carlson Accardi 
> >> >> Co-developed-by: Haitao Huang 
> >> >> Signed-off-by: Haitao Huang 
> >> >> ---
> >> >> V7:
> >> >> - Split this out from the big patch, #10 in V6. (Dave, Kai)
> >> >> ---
> >> >>  arch/x86/kernel/cpu/sgx/epc_cgroup.c | 26 --
> >> >>  arch/x86/kernel/cpu/sgx/epc_cgroup.h |  4 ++--
> >> >>  arch/x86/kernel/cpu/sgx/main.c   |  2 +-
> >> >>  3 files changed, 27 insertions(+), 5 deletions(-)
> >> >>
> >> >> diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> >> >> b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> >> >> index d399fda2b55e..abf74fdb12b4 100644
> >> >> --- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> >> >> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> >> >> @@ -184,13 +184,35 @@ static void
> >> >> sgx_epc_cgroup_reclaim_work_func(struct work_struct *work)
> >> >>  /**
> >> >>   * sgx_epc_cgroup_try_charge() - try to charge cgroup for a single  
> >> EPC
> >> >> page
> >> >>   * @epc_cg:The EPC cgroup to be charged for the page.
> >> >> + * @reclaim:   Whether or not synchronous reclaim is allowed
> >> >>   * Return:
> >> >>   * * %0 - If successfully charged.
> >> >>   * * -errno - for failures.
> >> >>   */
> >> >> -int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg)
> >> >> +int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg, bool
> >> >> reclaim)
> >> >>  {
> >> >> -   return misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg,  
> >> PAGE_SIZE);
> >> >> +   for (;;) {
> >> >> +   if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg,
> >> >> +   PAGE_SIZE))
> >> >> +   break;
> >> >> +
> >> >> +   if (sgx_epc_cgroup_lru_empty(epc_cg->cg))
> >> >> +   return -ENOMEM;
> >> >> + + if (signal_pending(current))
> >> >> +   return -ERESTARTSYS;
> >> >> +
> >> >> +   if (!reclaim) {
> >> >> +   queue_work(sgx_epc_cg_wq, 
> >> >> _cg->reclaim_work);
> >> >> +   return -EBUSY;
> >> >> +   }
> >> >> +
> >> >> +   if (!sgx_epc_cgroup_reclaim_pages(epc_cg->cg, false))
> >> >> +   /* All pages were too young to reclaim, try 
> >> >> again a little later  
> >> */
> >> >> +   schedule();
> >> >
> >> > This will be total pain to backtrack after a while when something
> >> > needs to be changed so there definitely should be inlin

Re: [PATCH v9 10/15] x86/sgx: Add EPC reclamation in cgroup try_charge()

2024-02-13 Thread Jarkko Sakkinen

On Tue Feb 13, 2024 at 1:15 AM EET, Haitao Huang wrote:
> Hi Jarkko
>
> On Mon, 12 Feb 2024 13:55:46 -0600, Jarkko Sakkinen   
> wrote:
>
> > On Mon Feb 5, 2024 at 11:06 PM EET, Haitao Huang wrote:
> >> From: Kristen Carlson Accardi 
> >>
> >> When the EPC usage of a cgroup is near its limit, the cgroup needs to
> >> reclaim pages used in the same cgroup to make room for new allocations.
> >> This is analogous to the behavior that the global reclaimer is triggered
> >> when the global usage is close to total available EPC.
> >>
> >> Add a Boolean parameter for sgx_epc_cgroup_try_charge() to indicate
> >> whether synchronous reclaim is allowed or not. And trigger the
> >> synchronous/asynchronous reclamation flow accordingly.
> >>
> >> Note at this point, all reclaimable EPC pages are still tracked in the
> >> global LRU and per-cgroup LRUs are empty. So no per-cgroup reclamation
> >> is activated yet.
> >>
> >> Co-developed-by: Sean Christopherson 
> >> Signed-off-by: Sean Christopherson 
> >> Signed-off-by: Kristen Carlson Accardi 
> >> Co-developed-by: Haitao Huang 
> >> Signed-off-by: Haitao Huang 
> >> ---
> >> V7:
> >> - Split this out from the big patch, #10 in V6. (Dave, Kai)
> >> ---
> >>  arch/x86/kernel/cpu/sgx/epc_cgroup.c | 26 --
> >>  arch/x86/kernel/cpu/sgx/epc_cgroup.h |  4 ++--
> >>  arch/x86/kernel/cpu/sgx/main.c   |  2 +-
> >>  3 files changed, 27 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c  
> >> b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> >> index d399fda2b55e..abf74fdb12b4 100644
> >> --- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> >> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> >> @@ -184,13 +184,35 @@ static void  
> >> sgx_epc_cgroup_reclaim_work_func(struct work_struct *work)
> >>  /**
> >>   * sgx_epc_cgroup_try_charge() - try to charge cgroup for a single EPC  
> >> page
> >>   * @epc_cg:   The EPC cgroup to be charged for the page.
> >> + * @reclaim:  Whether or not synchronous reclaim is allowed
> >>   * Return:
> >>   * * %0 - If successfully charged.
> >>   * * -errno - for failures.
> >>   */
> >> -int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg)
> >> +int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg, bool  
> >> reclaim)
> >>  {
> >> -  return misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE);
> >> +  for (;;) {
> >> +  if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg,
> >> +  PAGE_SIZE))
> >> +  break;
> >> +
> >> +  if (sgx_epc_cgroup_lru_empty(epc_cg->cg))
> >> +  return -ENOMEM;
> >> + +if (signal_pending(current))
> >> +  return -ERESTARTSYS;
> >> +
> >> +  if (!reclaim) {
> >> +  queue_work(sgx_epc_cg_wq, _cg->reclaim_work);
> >> +  return -EBUSY;
> >> +  }
> >> +
> >> +  if (!sgx_epc_cgroup_reclaim_pages(epc_cg->cg, false))
> >> +  /* All pages were too young to reclaim, try again a 
> >> little later */
> >> +  schedule();
> >
> > This will be total pain to backtrack after a while when something
> > needs to be changed so there definitely should be inline comments
> > addressing each branch condition.
> >
> > I'd rethink this as:
> >
> > 1. Create static __sgx_epc_cgroup_try_charge() for addressing single
> >iteration with the new "reclaim" parameter.
> > 2. Add a new sgx_epc_group_try_charge_reclaim() function.
> >
> > There's a bit of redundancy with sgx_epc_cgroup_try_charge() and
> > sgx_epc_cgroup_try_charge_reclaim() because both have almost the
> > same loop calling internal __sgx_epc_cgroup_try_charge() with
> > different parameters. That is totally acceptable.
> >
> > Please also add my suggested-by.
> >
> > BR, Jarkko
> >
> > BR, Jarkko
> >
> For #2:
> The only caller of this function, sgx_alloc_epc_page(), has the same  
> boolean which is passed into this this function.

I know. This would be good opportunity to fix that up. Large patch
sets should try to make the space for its feature best possible and
thus also clean up the code base overally.

> If we separate it into sgx_epc_cgroup_try_charge() and  
> sgx_epc_cgroup_try_charge_reclaim(), then the caller has to have the  
> if/else branches. So separation here seems not help?

Of course it does. It makes the code in that location self-documenting
and easier to remember what it does.

BR, Jarkko

Re: [PATCH v2] x86/sgx: fix kernel-doc comment misuse

2024-02-12 Thread Jarkko Sakkinen

On Sun Feb 11, 2024 at 8:24 AM EET, Randy Dunlap wrote:
> Don't use "/**" for a non-kernel-doc comment. This prevents a warning
> from scripts/kernel-doc:
>
> main.c:740: warning: expecting prototype for A section metric is concatenated 
> in a way that @low bits 12(). Prototype was for sgx_calc_section_metric() 
> instead
>
> Cc: Jarkko Sakkinen 
> Cc: Dave Hansen 
> Cc: linux-...@vger.kernel.org
> Cc: x...@kernel.org
> Reviewed-by: Kai Huang 
> Signed-off-by: Randy Dunlap 
> ---
> v2: add Rev-by: Kai Huang
>
>  arch/x86/kernel/cpu/sgx/main.c |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff -- a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -731,7 +731,7 @@ out:
>   return 0;
>  }
>  
> -/**
> +/*
>   * A section metric is concatenated in a way that @low bits 12-31 define the
>   * bits 12-31 of the metric and @high bits 0-19 define the bits 32-51 of the
>   * metric.


Reviewed-by: Jarkko Sakkinen 

BR, Jarkko

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 8073 matches

Mail list logo