Hi John,
I started to do reviews based on design documents for new features. I
think in general it is rather hard for humans to kind of reverse
engineer the design from the patch series. With AI it got easier, but
still should be verified by the author. Could you check if the attached
AI generated document is correct?
Thanks,
Bernd
================================================================================
famfs (FUSE-based fabric-attached memory file system) - Design Document
================================================================================
Audience and scope
==================
This document is written for people already familiar with FUSE (lowlevel ops,
opcodes, INIT capability negotiation) but NOT necessarily with Linux DAX,
devdax, or the kernel's iomap framework. Section 2 is a primer on those.
It covers two trees:
Kernel: /home/bernd/src/linux/linux.git , branch `famfs`,
commits 4a8ae428c392 .. HEAD (da9edf77cbc4)
libfuse: /home/bernd/src/libfuse/libfuse.git , branch `famfs`,
commits d75ae2ee .. HEAD (9c65d781)
Kernel files added or changed:
fs/fuse/famfs.c - new, all famfs kernel logic
fs/fuse/famfs_kfmap.h - new, in-memory fmap structures
fs/fuse/fuse_i.h - per-inode/per-conn famfs additions, helpers
fs/fuse/file.c - r/w/mmap dispatch into famfs paths
fs/fuse/inode.c - INIT-flag negotiation, conn teardown wiring
fs/fuse/iomode.c - bypass io-modes for famfs files
fs/fuse/Kconfig, Makefile - new CONFIG_FUSE_FAMFS_DAX
include/uapi/linux/fuse.h - new opcodes, structs, FUSE_DAX_FMAP flag
fs/namei.c - export may_open_dev()
Documentation/filesystems/famfs.rst - user/admin documentation
libfuse files added or changed:
include/fuse_kernel.h - mirror of kernel uapi at protocol 7.46
include/fuse_common.h - new FUSE_CAP_DAX_FMAP capability bit
include/fuse_lowlevel.h - new ops: get_fmap(), get_daxdev()
lib/fuse_lowlevel.c - INIT negotiation + opcode dispatch
(do_get_fmap, do_get_daxdev)
--------------------------------------------------------------------------------
1. Background and goals
--------------------------------------------------------------------------------
Famfs exposes shared, fabric-attached memory (CXL devdax) as a regular
filesystem. The fast path (read/write/mmap-fault) must reach memory without a
round trip to the FUSE server: the server only delivers metadata.
Two key observations shape the design:
* Files are NEVER allocated in the kernel. Userspace pre-allocates extents
and gives the kernel an "fmap" (file-to-dax-offset map).
* There is NO writeback. Backing memory is the storage; CPU caches are
loaded directly from the dax memory.
Consequences in the kernel:
* No page cache is used. `noop_dirty_folio` is the only address_space op.
* The kernel never grows or shrinks files. Any size change (including
truncate) puts the file into an "error" state.
* Reads/writes/mmap dispatch through `dax_iomap_*()` and the famfs
`iomap_ops`, exactly the way fs-dax filesystems (xfs/ext4) plumb them.
Comparison to other FUSE modes that you may know:
classic FUSE - every read/write/mmap is forwarded to the server.
virtio-fs DAX - the server donates a window of host memory; kernel maps
file ranges into that window via FUSE_SETUPMAPPING /
FUSE_REMOVEMAPPING. The server is still the "owner" of
the backing memory.
famfs (this) - the server hands the kernel a description of where each
file's bytes live on a real character device (devdax).
After that, the server is OUT of the data path entirely.
--------------------------------------------------------------------------------
2. Primer: devdax, DAX and iomap (only what's needed below)
--------------------------------------------------------------------------------
You can skip this section if "iomap_begin", "dax_iomap_rw" and "devdax holder"
already mean something to you.
devdax
A character device (`/dev/daxN.M`) that exposes a contiguous range of
physical memory directly to userspace via mmap. There is no page cache
and no block device underneath; reads and writes hit RAM/CXL memory
directly. Famfs uses devdax devices as its "disks".
DAX (Direct Access)
A kernel pathway that lets a filesystem map file pages straight onto
the underlying memory pages (PFNs) without going through the page cache.
A file/inode tagged with `S_DAX` opts in. Reads turn into memcpy from
the memory; mmap faults install the memory's PFN directly into the page
table (PTE/PMD/PUD).
iomap
A filesystem-agnostic mechanism that says "to do this read/write/fault
on this file at this offset and length, here is exactly which device,
which device-relative offset, and how many bytes are valid here."
Filesystems implement `struct iomap_ops`, of which the central callback
is:
.iomap_begin(inode, file_offset, length, flags,
struct iomap *out, struct iomap *srcmap)
The filesystem fills `out` with:
out->dax_dev - which DAX device backs this range
out->addr - byte offset within that DAX device
out->offset - file offset (echoed back)
out->length - how many contiguous bytes are valid here
out->type - IOMAP_MAPPED (famfs only ever returns this)
The DAX core then loops, calling `iomap_begin` repeatedly to walk the
requested range and, for each chunk, doing either:
- memcpy to/from `dax_dev + addr` (read/write)
- or installing the PFN at `dax_dev + addr` into a page table (faults)
Entry points the famfs code uses:
dax_iomap_rw(iocb, iter, ops) - read/write
dax_iomap_fault(vmf, order, ..., ops) - mmap PTE/PMD/PUD fault
dax holder
DAX devices have a single "holder" - a struct (here `struct fuse_conn *`)
that owns the device. Acquired via `fs_dax_get(devp, holder, holder_ops)`,
released via `fs_put_dax(devp, holder)`. The holder gets called back via
`holder_ops->notify_failure()` when the device reports memory poison.
That is the entire iomap-related vocabulary used in this document.
--------------------------------------------------------------------------------
3. Major kernel data structures
--------------------------------------------------------------------------------
(a) fuse_conn additions (fs/fuse/fuse_i.h):
struct fuse_conn {
...
unsigned int famfs_iomap : 1; /* negotiated at INIT */
struct rw_semaphore famfs_devlist_sem; /* protects dax_devlist */
struct famfs_dax_devlist *dax_devlist; /* table of daxdevs */
};
(b) fuse_inode additions:
struct fuse_inode {
...
void *famfs_meta; /* struct famfs_file_meta *, NULL if not famfs */
};
A non-NULL `famfs_meta` is the marker for "this is a famfs file";
`fuse_file_famfs(fi)` is just `READ_ONCE(fi->famfs_meta) != NULL`.
(c) Per-file metadata - struct famfs_file_meta (famfs_kfmap.h):
+------------------------------------------------------+
| struct famfs_file_meta |
| bool error |
| enum famfs_file_type file_type |
| size_t file_size |
| enum famfs_extent_type fm_extent_type |
| u64 dev_bitmap |
| union { |
| SIMPLE: |
| size_t fm_nextents |
| struct famfs_meta_simple_ext *se |
| INTERLEAVED: |
| size_t fm_niext |
| struct famfs_meta_interleaved_ext *ie |
| } |
+------------------------------------------------------+
Simple extent: (dev_index, ext_offset, ext_len)
Interleaved extent: (nstrips, chunk_size, nbytes, strips[])
where each strip is a simple extent.
(d) Per-conn dax device table:
+-------------------------+ +------------------------------+
| famfs_dax_devlist | | famfs_daxdev[MAX_DAXDEVS=24] |
| nslots = MAX_DAXDEVS |----->| |
| ndevs | | [0] valid? devp, devno, ... |
| devlist *-------------| | [1] valid? devp, devno, ... |
+-------------------------+ | ... |
+------------------------------+
famfs_daxdev fields:
valid - slot has been populated (after wmb)
error - dax notify_failure() arrived (poison)
dax_err - fs_dax_get() failed; cannot be used
devno, devp - dev_t and dax_device pointer
name - chrdev pathname for diagnostics
--------------------------------------------------------------------------------
4. Capability negotiation (FUSE INIT)
--------------------------------------------------------------------------------
The wire-level capability is `FUSE_DAX_FMAP` (bit 43 in the 64-bit flags
field, protocol 7.46). Both ends must advertise it in INIT for the kernel
to enable famfs.
Kernel FUSE server (libfuse)
------ ---------------------
fuse_new_init():
flags |= FUSE_DAX_FMAP -- if capable(CAP_SYS_RAWIO)
----------- FUSE_INIT (in) --------->
libfuse _do_init():
if (inargflags &
FUSE_DAX_FMAP)
conn.capable_ext |=
FUSE_CAP_DAX_FMAP
server's init_done CB
sets:
conn.want_ext |=
FUSE_CAP_DAX_FMAP
libfuse converts that
back to FUSE_DAX_FMAP
in outargflags
<----------- FUSE_INIT (out) ----------
process_init_reply():
if reply.flags & FUSE_DAX_FMAP &&
in.flags also had FUSE_DAX_FMAP:
famfs_init_devlist_sem(fc)
fc->famfs_iomap = 1
Both directions must agree. The kernel re-checks the flag in `in.flags` on the
reply path because process_init_reply() does not run in the server's task
context, so capable() cannot be re-evaluated then; the bit on the way OUT
asserts "the user that mounted us had CAP_SYS_RAWIO".
Kernel file: fs/fuse/inode.c (fuse_new_init, process_init_reply)
libfuse file: lib/fuse_lowlevel.c (_do_init)
--------------------------------------------------------------------------------
5. libfuse server-side surface
--------------------------------------------------------------------------------
This is the userspace API a famfs server is expected to implement on top of
libfuse's lowlevel API. From a FUSE-developer point of view this is the
familiar pattern: two new opcodes, two new callbacks in
`struct fuse_lowlevel_ops`, and a new capability bit.
5.1 Capability bit (include/fuse_common.h)
#define FUSE_CAP_DAX_FMAP (1UL << 32)
This sits in the *extended* capability fields `want_ext` / `capable_ext`,
not the legacy 32-bit `want` / `capable`, because bit 32 is past the end
of the original word.
5.2 New lowlevel callbacks (include/fuse_lowlevel.h)
struct fuse_lowlevel_ops {
...
/* Reply: serialized fuse_famfs_fmap_header followed by extents */
void (*get_fmap) (fuse_req_t req, fuse_ino_t ino, size_t size);
/* Reply: serialized fuse_daxdev_out (mainly: char name[256]) */
void (*get_daxdev) (fuse_req_t req, int daxdev_index);
};
Conventional libfuse semantics apply:
- The callback may reply asynchronously.
- Valid completions: fuse_reply_buf() with the serialized response,
or fuse_reply_err(req, errno) on failure.
- If the server does not provide either op, libfuse replies with
-EOPNOTSUPP automatically.
5.3 Opcode dispatch (lib/fuse_lowlevel.c)
Two entries are added to libfuse's opcode dispatch table:
[FUSE_GET_FMAP] = { do_get_fmap, "GET_FMAP" },
[FUSE_GET_DAXDEV] = { do_get_daxdev, "GET_DAXDEV" },
do_get_fmap:
reads `inarg` as `struct fuse_getxattr_in`*, extracts `arg->size`
(the kernel's hint for the maximum reply size it can accept),
forwards (req, ino, size) to op.get_fmap. The size is currently
fixed at PAGE_SIZE on the kernel side (FMAP_BUFSIZE); a larger
variable-size reply protocol is a future TODO.
do_get_daxdev:
ignores `inarg`. The kernel encodes the device index in `nodeid`
(FUSE_GET_DAXDEV uses nodeid as a small integer, not a real inode),
and libfuse forwards it as `daxdev_index` to op.get_daxdev.
5.4 Wire formats the server must produce
Defined in include/fuse_kernel.h (libfuse's mirror of the kernel uapi):
struct fuse_famfs_fmap_header {
uint8_t file_type; /* enum fuse_famfs_file_type */
uint8_t reserved;
uint16_t fmap_version; /* FAMFS_FMAP_VERSION = 1 */
uint32_t ext_type; /* SIMPLE or INTERLEAVE */
uint32_t nextents;
uint32_t reserved0;
uint64_t file_size;
uint64_t reserved1;
};
struct fuse_famfs_simple_ext {
uint32_t se_devindex; /* index into the per-mount daxdev table */
uint32_t reserved;
uint64_t se_offset; /* PMD-aligned offset in that daxdev */
uint64_t se_len; /* PMD-aligned length */
};
struct fuse_famfs_iext { /* one interleaved extent */
uint32_t ie_nstrips;
uint32_t ie_chunk_size; /* PMD-aligned */
uint64_t ie_nbytes; /* total bytes covered by this extent */
uint64_t reserved;
};
struct fuse_daxdev_out {
uint16_t index;
uint16_t reserved;
uint32_t reserved2;
uint64_t reserved3;
uint64_t reserved4;
char name[256]; /* "/dev/daxN.M" */
};
GET_FMAP reply layout in the buffer (fmap_header followed by extents):
SIMPLE: [ fmap_header ][ simple_ext * nextents ]
INTERLEAVE: [ fmap_header ][ iext, simple_ext*nstrips,
iext, simple_ext*nstrips, ... ]
where there are `nextents` (iext + its strips) groups.
Alignment rules the server MUST honor (else the kernel rejects the fmap):
* fmap_version == 1
* 1 <= nextents <= FUSE_FAMFS_MAX_EXTENTS (32)
* For each strip extent: ext_offset and ext_len PMD-aligned (2 MiB)
* For interleaved: chunk_size PMD-aligned, nstrips in [1, 32]
* sum of extent lengths >= file_size
GET_DAXDEV reply: a single fuse_daxdev_out where `name` is the path of a
character device that the kernel can `kern_path()` to a devdax inode.
5.5 What the server is responsible for
In the famfs design, the libfuse-based server still owns:
* Looking up files in the famfs metadata log (or whatever backend
userspace uses to track allocations).
* Producing fmaps that exactly describe the file's allocation.
* Producing the "/dev/daxN.M" path for each daxdev index it has
used in any fmap.
* All conventional FUSE ops: lookup, getattr, mkdir, unlink, etc.
The server is NOT in the path of any read/write/mmap once the fmap has
been delivered. There is no equivalent of FUSE_READ / FUSE_WRITE traffic
for famfs files.
--------------------------------------------------------------------------------
6. Open flow - GET_FMAP and (lazy) GET_DAXDEV
--------------------------------------------------------------------------------
When a regular file is opened on a famfs-enabled connection, the kernel pulls
the file's fmap from the server, parses it, resolves any unknown daxdev
indices via GET_DAXDEV, and installs the result on the inode.
fuse_open(inode, file) [fs/fuse/file.c]
|
+-- fuse_do_open() (regular FUSE open)
|
+-- if (fc->famfs_iomap && S_ISREG)
| fuse_get_fmap(fm, inode) [famfs.c]
| |
| +-- alloc fmap_buf (PAGE_SIZE)
| |
| +-- args.opcode = FUSE_GET_FMAP
| | args.nodeid = ino
| | args.out_argvar = true (variable-size reply)
| +-- fuse_simple_request(fm, &args) ----> server returns
| | fuse_famfs_fmap_header
| | + extents
| +-- famfs_file_init_dax(fm, inode, fmap_buf, fmap_size)
| |
| +-- famfs_fuse_meta_alloc()
| | parses header + extents into struct famfs_file_meta;
| | accumulates meta->dev_bitmap of referenced devindices;
| | validates PMD alignment + total size >= file_size;
| | cmpxchg-installs *metap (race-safe)
| |
| +-- famfs_update_daxdev_table(fm, meta)
| | if (!fc->dax_devlist) cmpxchg-allocate it
| | under famfs_devlist_sem (read):
| | collect indices that are NOT yet ->valid
| | drop lock, then for each index:
| | famfs_fuse_get_daxdev(fm, idx) <see below>
| |
| +-- inode_lock(inode)
| | famfs_meta_set(fi, meta) (cmpxchg,
NULL=>meta)
| | if installed: i_size_write, S_DAX, a_ops=famfs_dax_aops
| | inode_unlock(inode)
|
+-- fuse_finish_open(inode, file)
+-- skip page cache invalidation if fuse_file_famfs(fi)
GET_DAXDEV per-index flow:
famfs_fuse_get_daxdev(fm, index):
args.opcode = FUSE_GET_DAXDEV
args.nodeid = index
fuse_simple_request() -----> server returns fuse_daxdev_out{.name =
"/dev/daxX.Y"}
under famfs_devlist_sem (write):
if dd->valid: return /* lost race; OK */
famfs_verify_daxdev(name, &dd->devno):
kern_path() + d_backing_inode() + S_ISCHR
may_open_dev() /* exported in fs/namei.c */
dd->devno = inode->i_rdev
dd->name = kstrdup(name)
dd->devp = dax_dev_get(devno)
fs_dax_get(devp, fc, &famfs_fuse_dax_holder_ops)
on failure: dd->dax_err = 1 /* still mark valid */
wmb()
dd->valid = 1
--------------------------------------------------------------------------------
7. iomap interaction - the central design point
--------------------------------------------------------------------------------
famfs implements only `.iomap_begin`. There is no `.iomap_end` because there
is no allocation, dirty tracking, or completion bookkeeping.
const struct iomap_ops famfs_iomap_ops = {
.iomap_begin = famfs_fuse_iomap_begin,
};
The dax core (fs/dax.c) calls into famfs_iomap_ops from three entry points:
dax_iomap_rw(iocb, iter, ops) -- read_iter / write_iter
dax_iomap_fault(vmf, order, ...) -- mmap PTE/PMD/PUD faults
For the iomap concepts used here, see the primer in section 2.
7.1 read/write path
fuse_file_read_iter(iocb, to) [fs/fuse/file.c]
if (fuse_file_famfs(fi))
return famfs_fuse_read_iter(iocb, to); [famfs.c]
famfs_fuse_read_iter:
famfs_fuse_rw_prep(iocb, to):
famfs_file_bad(inode)? -> -EIO/-ENXIO
truncate iter to (i_size - ki_pos)
dax_iomap_rw(iocb, to, &famfs_iomap_ops) ===>
+-- repeatedly:
| iomap_begin(...)
| memcpy_from_pmem/to
| advance position
+-- returns bytes copied
famfs_fuse_write_iter is symmetric (no FOPEN_DIRECT_IO / passthrough fork;
splice paths return -EIO since famfs has no page cache).
7.2 mmap path
fuse_file_mmap(file, vma) [fs/fuse/file.c]
if (fuse_file_famfs(fi))
return famfs_fuse_mmap(file, vma);
famfs_fuse_mmap:
famfs_file_bad(inode)
vma->vm_ops = &famfs_file_vm_ops
vm_flags_set(vma, VM_HUGEPAGE) /* prefer 2MiB faults */
famfs_file_vm_ops:
.fault = famfs_filemap_fault (PTE)
.huge_fault = famfs_filemap_huge_fault (PMD/PUD)
.map_pages = filemap_map_pages
.page_mkwrite = famfs_filemap_mkwrite
.pfn_mkwrite = famfs_filemap_mkwrite
7.3 fault handler dispatch
__famfs_fuse_filemap_fault(vmf, pe_size, write_fault):
if (!IS_DAX(inode)) return SIGBUS
if (write_fault) sb_start_pagefault, file_update_time
ret = dax_iomap_fault(vmf, pe_size, &pfn, NULL, &famfs_iomap_ops)
|
+-- internally calls famfs_fuse_iomap_begin to learn (dax_dev,
offset, length), then maps the resolved PFN into the VMA.
if (ret & VM_FAULT_NEEDDSYNC) ret = dax_finish_sync_fault(...)
7.4 iomap_begin - the resolver
famfs_fuse_iomap_begin(inode, offset, length, flags, iomap, srcmap)
meta = fi->famfs_meta
WARN_ON(i_size != meta->file_size)
return famfs_fileofs_to_daxofs(inode, iomap, offset, length, flags)
famfs_fileofs_to_daxofs (SIMPLE case):
validate dax_devlist + famfs_file_bad
walk meta->se[0..fm_nextents-1]:
if local_offset < se[i].ext_len:
dd = devlist[se[i].dev_index]
famfs_dax_err(dd) -> if errored, set meta->error and return
iomap->addr = se[i].ext_offset + local_offset
iomap->offset = file_offset
iomap->length = min(len, ext_len - local_offset)
iomap->dax_dev= dd->devp
iomap->type = IOMAP_MAPPED
return 0
local_offset -= se[i].ext_len
fall-through: zero-length iomap, return -EIO
famfs_fileofs_to_daxofs delegates to famfs_interleave_fileofs_to_daxofs for
INTERLEAVED_EXTENT (see section 6).
7.5 The full iomap call graph
user process fs/fuse/famfs.c fs/dax.c
------------ ---------------- --------
read(2)/write(2)
|
v
fuse_file_read_iter / write_iter (file.c)
|
+--> famfs_fuse_{read,write}_iter
|
+--> famfs_fuse_rw_prep (sanity, truncate to i_size)
|
+--> dax_iomap_rw -----------------> iter loop
|
v
iomap_iter()
|
+--> .iomap_begin
<----+
|
famfs_fuse_iomap_begin
| \
|
|
famfs_fileofs_to_daxofs
| [+
interleave variant]
| \
|
fc->dax_devlist[idx]
|
dd->devp / ext_offset
| /
+--> dax_iomap_iter()
memcpy via
dax_direct_access
on
iomap->dax_dev / iomap->addr
page fault on mmap region
|
v
.fault / .huge_fault (famfs_file_vm_ops)
|
+--> __famfs_fuse_filemap_fault
|
+--> dax_iomap_fault(vmf, order, ..., &famfs_iomap_ops)
|
+--> .iomap_begin
|
famfs_fuse_iomap_begin
| (resolves
dax_dev + offset)
+--> dax_insert_pfn /
vmf_insert_pfn_pmd
--------------------------------------------------------------------------------
8. Interleaved (striped) extents
--------------------------------------------------------------------------------
An interleaved extent stripes a contiguous logical region across N strips on
N (typically distinct) dax devices, in fixed-size chunks.
ie_nstrips = N
ie_chunk_size = C (must be PMD-aligned)
ie_nbytes = total logical bytes covered
Logical layout (N=4):
file offset: [0 C ][C 2C][2C 3C][3C 4C][4C ...
| strip 0 | strip 1 | strip 2 | strip 3 | strip
0 ...
| stripe 0 | stripe 0 | stripe 0 | stripe 0 | stripe
1...
Resolution arithmetic in famfs_interleave_fileofs_to_daxofs():
chunk_num = local_offset / chunk_size
chunk_offset = local_offset % chunk_size
chunk_remainder = chunk_size - chunk_offset
stripe_num = chunk_num / nstrips
strip_num = chunk_num % nstrips
strip_offset = chunk_offset + stripe_num * chunk_size
iomap->addr = ie_strips[strip_num].ext_offset + strip_offset
iomap->dax_dev = devlist[ie_strips[strip_num].dev_index].devp
iomap->length = min(len, chunk_remainder)
iomap->type = IOMAP_MAPPED
The length is capped at chunk_remainder so the next iomap iteration steps to
the next chunk (which usually lives on a different device).
--------------------------------------------------------------------------------
9. Memory-error / failure handling
--------------------------------------------------------------------------------
A famfs file becomes unusable when any one of three conditions is true. They
are checked on every read/write/fault by famfs_file_bad() and famfs_dax_err().
Source of error Effect Surface
--------------- ------ -------
fs_dax_get() fails dd->dax_err = 1 famfs_dax_err ->
-EIO
notify_failure() upcall dd->error = true famfs_dax_err ->
-EHWPOISON
i_size != meta->file_size meta->error = true famfs_file_bad ->
-ENXIO
IS_DAX(inode) cleared (size change, etc.) famfs_file_bad ->
-ENXIO
notify_failure() flow:
devdax layer detects poison / reconfig
|
v
dax_holder_ops->notify_failure(dax_devp, offset, len, mf_flags)
= famfs_dax_notify_failure
fc = dax_holder(dax_devp)
famfs_set_daxdev_err(fc, dax_devp):
under famfs_devlist_sem (write):
find slot whose dd->devp == dax_devp
dd->error = true
pr_err
On the next iomap_begin, famfs_dax_err sees dd->error and returns -EHWPOISON;
meta->error is also set on that file so subsequent accesses short-circuit
via famfs_file_bad without touching dax.
--------------------------------------------------------------------------------
10. Lifetime / teardown
--------------------------------------------------------------------------------
Per-inode:
fuse_alloc_inode (inode.c)
famfs_meta_set(fi, NULL) (init)
fuse_free_inode (inode.c)
if (S_ISREG && fuse_file_famfs(fi))
famfs_meta_free(fi)
-> __famfs_meta_free: frees se/ie arrays + struct
fuse_evict_inode
if (FUSE_IS_VIRTIO_DAX || fuse_file_famfs)
dax_break_layout_final(inode) (stop ongoing dax mappings)
Per-connection:
fuse_conn_put -> famfs_teardown(fc):
for each valid slot:
if dd->devp:
if (!dd->dax_err) fs_put_dax(dd->devp, fc) /* drop holder */
put_dax(dd->devp)
kfree(dd->name)
kfree(devlist->devlist)
kfree(devlist)
--------------------------------------------------------------------------------
11. Concurrency model
--------------------------------------------------------------------------------
fc->famfs_devlist_sem (rw_semaphore)
readers : iomap_begin paths reading devlist[idx]
famfs_update_daxdev_table while collecting "missing" indices
writers : famfs_fuse_get_daxdev (populating a slot)
famfs_set_daxdev_err (notify_failure)
cmpxchg pairs (NULL -> ptr installation, race-tolerant):
fc->dax_devlist (first-time allocation)
fi->famfs_meta (first GET_FMAP wins, others freed)
wmb() before dd->valid=1 ensures readers that observe `valid` see fully
initialized name/devp/devno fields.
--------------------------------------------------------------------------------
12. End-to-end timeline (read on a freshly opened famfs file)
--------------------------------------------------------------------------------
The "fuse server" column is whatever process is using libfuse (with
op.get_fmap / op.get_daxdev populated as in section 5).
app fuse/famfs (kernel) fuse server
--- ------------------- -----------
open("/mnt/famfs/x") |
----- VFS open -----> fuse_open |
fuse_do_open ----- OPEN -----> handles
<---- ok --------|
fuse_get_fmap |
- GET_FMAP ----->|
<- fmap reply ---|
famfs_fuse_meta_alloc |
famfs_update_daxdev_table |
[new device idx] |
- GET_DAXDEV --->|
<- daxdev reply -|
dax_dev_get + fs_dax_get |
dd->valid = 1 |
famfs_meta_set(fi, meta) |
inode->i_flags |= S_DAX |
i_data.a_ops = famfs_dax_aops |
<----- fd ------------ |
read(fd, buf, len) |
----- VFS read ----> fuse_file_read_iter |
famfs_fuse_read_iter |
famfs_fuse_rw_prep |
dax_iomap_rw |
.iomap_begin -> |
famfs_fuse_iomap_begin |
famfs_fileofs_to_daxofs |
-> iomap{dax_dev, addr} |
memcpy from dax memory |
<---- bytes ---------- |
| <-- no upcall
on fast path
munmap / close
----- VFS release -> fuse_release ----- RELEASE ---> |
<---- ok --------|
unmount
----- umount ------> fuse_conn_put |
famfs_teardown |
fs_put_dax / put_dax all dd's |
--------------------------------------------------------------------------------
13. What deliberately is NOT in the kernel
--------------------------------------------------------------------------------
* Allocation and metadata mutation: handled in userspace; the kernel only
consumes fmaps as opaque-but-versioned blobs.
* Page cache and writeback: famfs_dax_aops is exclusively noop_dirty_folio.
* Truncate / append: any size change marks the file errored; recovery is a
userspace responsibility (typically: re-replay the famfs metadata log).
* fallocate / hole handling: files are never sparse and never have holes,
so iomap_begin only ever returns IOMAP_MAPPED (or zero-length on EOF).
* io-modes (FUSE_OPEN_*): bypassed for famfs files in iomode.c since
everything is direct-to-dax.
--------------------------------------------------------------------------------
14. Commit-by-commit map back to this design
--------------------------------------------------------------------------------
14.1 Kernel (linux.git, branch famfs)
ac071fbd94a6 Basic fuse kernel ABI -> Section 4 (negotiation),
CONFIG_FUSE_FAMFS_DAX,
fc->famfs_iomap bit
9a06500c1e0f Plumb GET_FMAP message/response -> Section 6 (fuse_get_fmap,
fuse_open hook)
6f4e03a4e8e9 Create files with famfs fmaps -> Section 3 (famfs_file_meta),
Section 6
(famfs_file_init_dax,
famfs_fuse_meta_alloc)
dfc9e12bcb99 GET_DAXDEV msg + daxdev_table -> Section 3
(famfs_dax_devlist),
Section 6.GET_DAXDEV,
famfs_teardown
d79f803dbfd1 Plumb dax iomap + r/w/mmap -> Section 7 (iomap_ops, fault,
rw paths) and Section 8
(interleave resolver)
8731eb03c762 holder_ops for notify_failure() -> Section 9 (memory errors)
6ea21f89b361 DAX address_space_operations -> Section 1 / famfs_dax_aops
fae4d807da34 fmap metadata documentation -> kernel header comment in
famfs_kfmap.h (Section 8)
da9edf77cbc4 Documentation/filesystems/famfs -> user-facing docs
14.2 libfuse (libfuse.git, branch famfs)
d75ae2ee fuse_kernel.h: bring up to baseline 6.19
Mechanical sync of include/fuse_kernel.h with the kernel uapi
up to 7.45 (everything BEFORE famfs). No new functionality;
this is the baseline the famfs commits build on.
e87be376 fuse_kernel.h: add famfs DAX fmap protocol definitions
Adds protocol 7.46:
* FUSE_DAX_FMAP capability bit
* FUSE_GET_FMAP / FUSE_GET_DAXDEV opcodes
* struct fuse_famfs_fmap_header / simple_ext / iext
* struct fuse_get_daxdev_in / fuse_daxdev_out
* enum fuse_famfs_file_type / famfs_ext_type
Pure header; mirrors include/uapi/linux/fuse.h on the kernel.
-> Section 5.4 (wire formats).
0b16c7d8 fuse: add famfs DAX fmap support
Wires the protocol into the libfuse lowlevel API:
* include/fuse_common.h: FUSE_CAP_DAX_FMAP (1UL << 32)
* include/fuse_lowlevel.h: op.get_fmap, op.get_daxdev
* lib/fuse_lowlevel.c:
- INIT: capable_ext / want_ext <-> FUSE_DAX_FMAP
- dispatch table entries for GET_FMAP / GET_DAXDEV
- do_get_fmap / do_get_daxdev forward to op callbacks
-> Section 4 (capability), Section 5.1-5.3 (libfuse API),
Section 5.5 (server responsibilities).
d1e6135c build(deps): bump github/codeql-action ... (CI; not relevant)
fa03307c doc: replace "futur irrealis"-like tense ... (man pages; not
relevant)
9c65d781 Merge branch 'master' into famfs-6.19 (merge commit)
================================================================================
End of document.
================================================================================