Hi Pingfan, Hi Alexei, sorry for the late reply.
On Thu, 29 May 2025 12:17:38 +0800 Pingfan Liu <pi...@redhat.com> wrote: > In the security kexec_file_load case, the buffer which holds the kernel > image is invisible to the userspace. > > The common data flow in bpf scheme is from kernel to bpf-prog. In the > case of kexec_file_load, the kexec component needs to buffer the parsed > result by bpf-prog (opposite the usual direction) to the next stage > parsing. bpf_kexec_carrier() makes the opposite data flow possible. A > bpf-prog can publish the parsed payload address to the kernel, and the > latter can copy them for future use. > > Signed-off-by: Pingfan Liu <pi...@redhat.com> > Cc: Alexei Starovoitov <a...@kernel.org> > Cc: Daniel Borkmann <dan...@iogearbox.net> > Cc: John Fastabend <john.fastab...@gmail.com> > Cc: Andrii Nakryiko <and...@kernel.org> > Cc: Martin KaFai Lau <martin....@linux.dev> > Cc: Eduard Zingerman <eddy...@gmail.com> > Cc: Song Liu <s...@kernel.org> > Cc: Yonghong Song <yonghong.s...@linux.dev> > Cc: KP Singh <kpsi...@kernel.org> > Cc: Stanislav Fomichev <s...@fomichev.me> > Cc: Hao Luo <hao...@google.com> > Cc: Jiri Olsa <jo...@kernel.org> > To: b...@vger.kernel.org > --- > include/linux/bpf.h | 23 +++++ > kernel/bpf/Makefile | 2 +- > kernel/bpf/helpers.c | 2 + > kernel/bpf/helpers_carrier.c | 194 +++++++++++++++++++++++++++++++++++ > 4 files changed, 220 insertions(+), 1 deletion(-) > create mode 100644 kernel/bpf/helpers_carrier.c > [...] > diff --git a/kernel/bpf/helpers_carrier.c b/kernel/bpf/helpers_carrier.c > new file mode 100644 > index 0000000000000..c4e45fdf0ebb8 > --- /dev/null > +++ b/kernel/bpf/helpers_carrier.c > @@ -0,0 +1,194 @@ [...] > +__bpf_kfunc int bpf_mem_range_result_put(struct mem_range_result *result) I'm concerned about the use of kfuncs for our use case. I don't believe they provide the stability we need. With kexec we deal with two different kernels. The 1st kernel, aka. the one that executes kexec to load the 2nd kernel, and the 2nd kernel that is being loaded. In general both kernels are built from different versions with different configs and it is expected that kexec works even when both kernels are years apart. The problem is that in our design the bpf-prog is part of the image of and built from the sources of the 2nd kernel, but runs in the 1st kernel. So the definitions of the kfuncs in both kernels have to match. What makes it worse is that for it to work with secure boot the kernel image, including the bpf-prog, needs to be signed. Which means that the bpf-prog is fixed after build and can no longer be updated. All in all I'm afraid we need a uapi-like stability for those kfuncs for our design to work. Do you have any comments on my concern? Or any idea how we could archive the stability despite using kfuncs? Thanks Philipp > +{ > + return mem_range_result_put(result); > +} > + > +/* > + * Cache the content in @buf into kernel > + */ > +__bpf_kfunc int bpf_copy_to_kernel(const char *name, char *buf, int size) > +{ > + struct mem_range_result *range; > + struct mem_cgroup *memcg, *old_memcg; > + struct str_listener *item; > + resource_handler handler; > + bool kmalloc; > + char *kbuf; > + int id, ret = 0; > + > + id = srcu_read_lock(&srcu); > + item = find_listener(name); > + if (!item) { > + srcu_read_unlock(&srcu, id); > + return -EINVAL; > + } > + kmalloc = item->kmalloc; > + handler = item->handler; > + srcu_read_unlock(&srcu, id); > + memcg = get_mem_cgroup_from_current(); > + old_memcg = set_active_memcg(memcg); > + range = kmalloc(sizeof(struct mem_range_result), GFP_KERNEL); > + if (!range) { > + pr_err("fail to allocate mem_range_result\n"); > + ret = -ENOMEM; > + goto err; > + } > + > + kref_init(&range->ref); > + if (item->kmalloc) > + kbuf = kmalloc(size, GFP_KERNEL | __GFP_ACCOUNT); > + else > + kbuf = __vmalloc(size, GFP_KERNEL | __GFP_ACCOUNT); > + if (!kbuf) { > + kfree(range); > + ret = -ENOMEM; > + goto err; > + } > + ret = copy_from_kernel_nofault(kbuf, buf, size); > + if (unlikely(ret < 0)) { > + kfree(range); > + if (item->kmalloc) > + kfree(kbuf); > + else > + vfree(kbuf); > + ret = -EINVAL; > + goto err; > + } > + range->kmalloc = item->kmalloc; > + range->buf = kbuf; > + range->buf_sz = size; > + range->data_sz = size; > + range->memcg = memcg; > + mem_cgroup_tryget(memcg); > + range->status = 0; > + ret = handler(name, range); > + mem_range_result_put(range); > +err: > + set_active_memcg(old_memcg); > + mem_cgroup_put(memcg); > + return ret; > +} > + > +int register_carrier_listener(struct carrier_listener *listener) > +{ > + struct str_listener *item; > + unsigned int hash; > + int ret; > + > + if (!listener->name) > + return -EINVAL; > + item = kmalloc(sizeof(*item), GFP_KERNEL); > + if (!item) > + return -ENOMEM; > + item->str = kstrdup(listener->name, GFP_KERNEL); > + if (!item->str) { > + kfree(item); > + return -ENOMEM; > + } > + item->handler = listener->handler; > + item->kmalloc = listener->kmalloc; > + hash = jhash(item->str, strlen(item->str), 0); > + mutex_lock(&str_listeners_mutex); > + if (!find_listener(item->str)) { > + hash_add(str_listeners, &item->node, hash); > + } else { > + kfree(item->str); > + kfree(item); > + ret = -EBUSY; > + } > + mutex_unlock(&str_listeners_mutex); > + > + return ret; > +} > +EXPORT_SYMBOL(register_carrier_listener); > + > +int unregister_carrier_listener(char *str) > +{ > + struct str_listener *item; > + int ret = 0; > + > + mutex_lock(&str_listeners_mutex); > + item = find_listener(str); > + if (!!item) > + hash_del(&item->node); > + else > + ret = -EINVAL; > + mutex_unlock(&str_listeners_mutex); > + > + return ret; > +} > +EXPORT_SYMBOL(unregister_carrier_listener); > +