On Tue, Jul 01, 2025 at 01:01:27PM -0600, Jonathan Corbet wrote:
[Adding some of the ELISA folks, who are working in a related area and
might have thoughts on this.  You can find the patch series under
discussion at:

 https://lore.kernel.org/all/20250624180742.5795-1-sas...@kernel.org

Yup, we all met at OSS and reached the conclusion that we should lean
towards a machine readable spec, which we thought was closer to my
proposal than the kerneldoc work.

However, with your suggestion, I think it makes more sense to go back to
kerneldoc as that can be made machine readable.

In theory, all of that will let us have something like the following in
kerneldoc:

- @api-type: syscall
- @api-version: 1
- @context-flags: KAPI_CTX_PROCESS | KAPI_CTX_SLEEPABLE
- @param-type: family, KAPI_TYPE_INT
- @param-flags: family, KAPI_PARAM_IN
- @param-range: family, 0, 45
- @param-mask: type, SOCK_TYPE_MASK | SOCK_CLOEXEC | SOCK_NONBLOCK
- @error-code: -EAFNOSUPPORT, "Address family not supported"
- @error-condition: -EAFNOSUPPORT, "family < 0 || family >= NPROTO"
- @capability: CAP_NET_RAW, KAPI_CAP_GRANT_PERMISSION
- @capability-allows: CAP_NET_RAW, "Create SOCK_RAW sockets"
- @since: 2.0
- @return-type: KAPI_TYPE_FD
- @return-check: KAPI_RETURN_ERROR_CHECK

How does it sound? I'm pretty excited about the possiblity to align this
with kerneldoc. Please poke holes in the plan :)

I think we could do it without all the @signs.  We'd also want to see
how well we could integrate that information with the minimal structure
we already have: getting the return-value information into the Returns:
section, for example, and tying the parameter constraints to the
parameter descriptions we already have.

Right!

So I have a proof of concept which during the build process creates
.apispec.h which are generated from kerneldoc and contain macros
identical to the ones in my RFC.

Here's an example of sys_mlock() spec:

/**
 * sys_mlock - Lock pages in memory
 * @start: Starting address of memory range to lock
 * @len: Length of memory range to lock in bytes
 *
 * Locks pages in the specified address range into RAM, preventing them from
 * being paged to swap. Requires CAP_IPC_LOCK capability or RLIMIT_MEMLOCK
 * resource limit.
 *
 * long-desc: Locks pages in the specified address range into RAM, preventing
 *   them from being paged to swap. Requires CAP_IPC_LOCK capability
 *   or RLIMIT_MEMLOCK resource limit.
 * context-flags: KAPI_CTX_PROCESS | KAPI_CTX_SLEEPABLE
 * param-type: start, KAPI_TYPE_UINT
 * param-flags: start, KAPI_PARAM_IN
 * param-constraint-type: start, KAPI_CONSTRAINT_NONE
 * param-constraint: start, Rounded down to page boundary
 * param-type: len, KAPI_TYPE_UINT
 * param-flags: len, KAPI_PARAM_IN
 * param-constraint-type: len, KAPI_CONSTRAINT_RANGE
 * param-range: len, 0, LONG_MAX
 * param-constraint: len, Rounded up to page boundary
 * return-type: KAPI_TYPE_INT
 * return-check-type: KAPI_RETURN_ERROR_CHECK
 * return-success: 0
 * error-code: -ENOMEM, ENOMEM, Address range issue,
 *   Some of the specified range is not mapped, has unmapped gaps,
 *   or the lock would cause the number of mapped regions to exceed the limit.
 * error-code: -EPERM, EPERM, Insufficient privileges,
 *   The caller is not privileged (no CAP_IPC_LOCK) and RLIMIT_MEMLOCK is 0.
 * error-code: -EINVAL, EINVAL, Address overflow,
 *   The result of the addition start+len was less than start (arithmetic 
overflow).
 * error-code: -EAGAIN, EAGAIN, Some or all memory could not be locked,
 *   Some or all of the specified address range could not be locked.
 * error-code: -EINTR, EINTR, Interrupted by signal,
 *   The operation was interrupted by a fatal signal before completion.
 * error-code: -EFAULT, EFAULT, Bad address,
 *   The specified address range contains invalid addresses that cannot be 
accessed.
 * since-version: 2.0
 * lock: mmap_lock, KAPI_LOCK_RWLOCK
 * lock-acquired: true
 * lock-released: true
 * lock-desc: Process memory map write lock
 * signal: FATAL
 * signal-direction: KAPI_SIGNAL_RECEIVE
 * signal-action: KAPI_SIGNAL_ACTION_RETURN
 * signal-condition: Fatal signal pending
 * signal-desc: Fatal signals (SIGKILL) can interrupt the operation at two 
points:
 *   when acquiring mmap_write_lock_killable() and during page population
 *   in __mm_populate(). Returns -EINTR. Non-fatal signals do NOT interrupt
 *   mlock - the operation continues even if SIGINT/SIGTERM are received.
 * signal-error: -EINTR
 * signal-timing: KAPI_SIGNAL_TIME_DURING
 * signal-priority: 0
 * signal-interruptible: yes
 * signal-state-req: KAPI_SIGNAL_STATE_RUNNING
 * examples: mlock(addr, 4096);  // Lock one page
 *   mlock(addr, len);   // Lock range of pages
 * notes: Memory locks do not stack - multiple calls on the same range can be
 *   undone by a single munlock. Locks are not inherited by child processes.
 *   Pages are locked on whole page boundaries. Commonly used by real-time
 *   applications to prevent page faults during time-critical operations.
 *   Also used for security to prevent sensitive data (e.g., cryptographic keys)
 *   from being written to swap. Note: locked pages may still be saved to
 *   swap during system suspend/hibernate.
 *
 *   Tagged addresses are automatically handled via untagged_addr(). The 
operation
 *   occurs in two phases: first VMAs are marked with VM_LOCKED, then pages are
 *   populated into memory. When checking RLIMIT_MEMLOCK, the kernel optimizes
 *   by recounting locked memory to avoid double-counting overlapping regions.
 * side-effect: KAPI_EFFECT_MODIFY_STATE | KAPI_EFFECT_ALLOC_MEMORY, process 
memory, Locks pages into physical memory, preventing swapping, reversible=yes
 * side-effect: KAPI_EFFECT_MODIFY_STATE, mm->locked_vm, Increases process 
locked memory counter, reversible=yes
 * side-effect: KAPI_EFFECT_ALLOC_MEMORY, physical pages, May allocate and 
populate page table entries, condition=Pages not already present, reversible=yes
 * side-effect: KAPI_EFFECT_MODIFY_STATE | KAPI_EFFECT_ALLOC_MEMORY, page 
faults, Triggers page faults to bring pages into memory, condition=Pages not 
already resident
 * side-effect: KAPI_EFFECT_MODIFY_STATE, VMA splitting, May split existing 
VMAs at lock boundaries, condition=Lock range partially overlaps existing VMA
 * state-trans: memory pages, swappable, locked in RAM, Pages become 
non-swappable and pinned in physical memory
 * state-trans: VMA flags, unlocked, VM_LOCKED set, Virtual memory area marked 
as locked
 * capability: CAP_IPC_LOCK, KAPI_CAP_BYPASS_CHECK, CAP_IPC_LOCK capability
 * capability-allows: Lock unlimited amount of memory (no RLIMIT_MEMLOCK 
enforcement)
 * capability-without: Must respect RLIMIT_MEMLOCK resource limit
 * capability-condition: Checked when RLIMIT_MEMLOCK is 0 or locking would 
exceed limit
 * capability-priority: 0
 * constraint: RLIMIT_MEMLOCK Resource Limit, The RLIMIT_MEMLOCK soft resource 
limit specifies the maximum bytes of memory that may be locked into RAM. 
Unprivileged processes are restricted to this limit. CAP_IPC_LOCK capability 
allows bypassing this limit entirely. The limit is enforced per-process, not 
per-user.
 * constraint-expr: RLIMIT_MEMLOCK Resource Limit, locked_memory + request_size 
<= RLIMIT_MEMLOCK || CAP_IPC_LOCK
 * constraint: Memory Pressure and OOM, Locking large amounts of memory can 
cause system-wide memory pressure and potentially trigger the OOM killer. The 
kernel does not prevent locking memory that would destabilize the system.
 * constraint: Special Memory Areas, Some memory types cannot be locked or are 
silently skipped: VM_IO/VM_PFNMAP areas (device mappings) are skipped; Hugetlb 
pages are inherently pinned and skipped; DAX mappings are always present in 
memory and skipped; Secret memory (memfd_secret) mappings are skipped; 
VM_DROPPABLE memory cannot be locked and is skipped; Gate VMA (kernel entry 
point) is skipped; VM_LOCKED areas are already locked. These special areas are 
silently excluded without error.
 *
 * Context: Process context. May sleep. Takes mmap_lock for write.
 *
 * Return: 0 on success, negative error code on failure
 */

The other thing I would really like to see, to the extent we can, is
that a bunch of patches adding all this data to the source will actually
be accepted by the relevant maintainers.  It would be a shame to get all
this infrastructure into place, then have things stall out due to
maintainer pushback.  Maybe you should start by annotating the
scheduler-related system calls; if that works the rest should be a piece
of cake :)

In the RFC I've sent out I've specced out API from different subsystems
to solicit some feedback on those, but so fair it's been quiet.

I'll resend a "lean" RFC v3 with just the base macro spec infra +
kerneldoc support + "tricker" sched API + "trickier" mm API.

I'm thinking that if it's still quiet in a month or two I'll propose a
talk at LPC around it, or maybe try and feedback/consensus during
maintainer's summit.

But yes, it doesn't make sense to take it in until we have an ack from a
few larger subsystems.

--
Thanks,
Sasha

Reply via email to