Extend the CXL and DAX driver-api documentation to cover Dynamic Capacity Devices.
cxl-driver.rst gains a "Dynamic Capacity Extents" section describing the conditions under which the CXL core accepts an offered extent (per-extent: region resolution, full ED-range containment, no-overlap, duplicate tolerance; per-tag-group: host-wide tag-uuid uniqueness, sequence-number integrity, partition equality, alignment) and the conditions under which a release request is honoured (DPA-range containment in some member, tag match, DAX-layer EBUSY deferral, whole-tag-group release). The host-wide uniqueness gate is enforced by the cxl_tag_register registry in drivers/cxl/core/extent.c. For sequence numbers the doc spells out both regimes — device-stamped 1..n on sharable allocations and host-assigned arrival-order 1..n (via cxl_add_pending's logical_seq) on non-sharable allocations — and notes that the DAX layer sees one unified 1..n dense invariant. dax-driver.rst gains a "Dynamic Capacity (DC) Regions" section that lays out the four-object layering device extent → dc_extent → dax_resource → DAX device, with cardinalities: one tagged allocation maps to one cxl_dc_tag_group containing N dc_extents and N dax_resources, claimed into one DAX device with N range entries in seq_num order; an untagged Add delivery becomes its own single-member group. Each dc_extent carries its own hpa_range — there is no aggregated bounding-box range across siblings. Tag-based DAX device creation, DC-only sizing rules (no grow, size=0 to destroy), and the uuid attribute semantics are documented alongside. Signed-off-by: Anisa Su <[email protected]> --- .../driver-api/cxl/linux/cxl-driver.rst | 149 ++++++++++++++++ .../driver-api/cxl/linux/dax-driver.rst | 167 ++++++++++++++++++ 2 files changed, 316 insertions(+) diff --git a/Documentation/driver-api/cxl/linux/cxl-driver.rst b/Documentation/driver-api/cxl/linux/cxl-driver.rst index dd6dd17dc536..cb08fc536da8 100644 --- a/Documentation/driver-api/cxl/linux/cxl-driver.rst +++ b/Documentation/driver-api/cxl/linux/cxl-driver.rst @@ -619,6 +619,155 @@ from HPA to DPA. This is why they must be aware of the entire interleave set. Linux does not support unbalanced interleave configurations. As a result, all endpoints in an interleave set must have the same ways and granularity. +Dynamic Capacity Extents +======================== + +A `Dynamic Capacity Device (DCD)` advertises capacity in `DC partitions` +and surfaces individual chunks of that capacity to the host as `extents`. +The device may add an extent at any time (a `pending add`) and may +request that a previously accepted extent be released (a `pending +release`). Each transition is mediated by a mailbox handshake whose +state machine the CXL driver enforces in +:code:`drivers/cxl/core/{mbox.c,extent.c}`. + +Extents that share a non-null tag form one logical allocation. Each +surviving member becomes its own :code:`struct dc_extent` (per-extent +sysfs device, per-extent HPA range); their containing tag group is an +internal-only :code:`struct cxl_dc_tag_group` keyed by UUID with no +sysfs identity. Each :code:`dc_extent` becomes one +:code:`dax_resource` on the DAX side, and a tagged DAX device is built +by claiming every :code:`dax_resource` that carries the tag. + +For DAX-side semantics — how accepted extents materialize into +:code:`dax_resource` objects and DAX devices — see +:doc:`dax-driver`. + +Accepting Extents +----------------- +Extents are made available to the host from the device through DC ADD events. +Event records contain extents, which may be tagged or untagged, shared or +not shared. Multiple event records can by chained together by the `More` flag. + +The unit of allocation is a `tag`. All extents +sharing a tag form one allocation; the More flag is a delivery boundary +only, meaning when the More chain ends, the host can assume that all extents +have been collected for each tag. +A tag may be the null UUID (an `untagged` allocation, valid in +non-sharable regions) or a non-null UUID identifying a sharable or +non-sharable allocation. + +When a `More`-terminated chain of pending adds closes, the driver +processes the pending list one tag group at a time. A group is +committed only if it passes every gate below; failing any gate drops +the entire group with a firmware-bug warning, and the dropped extents +do not appear in the :code:`ADD_DC_RESPONSE`. There is no +partial-extent acceptance — either an offered extent is accepted whole +or it is dropped whole. + +Per-extent gates (applied in :code:`cxl_add_extent`, +:code:`drivers/cxl/core/extent.c`): + +* The extent's DPA range must resolve to a CXL region via + :code:`cxl_dpa_to_region()`. An extent with no owning region is + dropped; the device sees the omission from :code:`ADD_DC_RESPONSE`. +* The extent's DPA range must be `fully contained` in the endpoint + decoder's DPA range. An extent that straddles the decoder boundary + is rejected with :code:`-ENXIO`; the driver never clips an extent to + fit. +* The extent must not overlap an extent already present in the same + region. Overlap classification is done in + :code:`cxlr_dax_classify_extent()` using :code:`range_overlaps()`. + Exact duplicates of a previously-accepted range are tolerated — + accepting the same range twice is a no-op, which simplifies + probe-time scans of the device's existing accepted list. + +Per-group gates (applied in :code:`cxl_add_pending`, +:code:`drivers/cxl/core/mbox.c`): + +* `Host-wide tag uniqueness`: a non-null tag must not already + correspond to a live :code:`cxl_dc_tag_group` anywhere on this host. + The orchestrator (FM) owns tag-UUID allocation per spec; the + registry in :code:`drivers/cxl/core/extent.c` + (:code:`cxl_tag_register` / :code:`cxl_tag_already_committed`) + catches firmware bugs and orchestrator misbehavior across every + region and memdev. Skipped for the null UUID, which has no + cross-chain identity. +* `Sequence-number integrity`: every member must carry the wire + field :code:`shared_extn_seq == 0` (non-sharable allocation), or + the group's sorted sequence numbers must be exactly + :code:`1, 2, …, n` (sharable allocation). Mixed, gapped, + duplicate, or non-zero-but-not-starting-at-1 sets are rejected. +* `Partition equality`: every tagged extent in the group must + resolve to the same DC partition. A single allocation cannot span + partitions because CDAT describes sharable / writable / coherency + attributes per-partition. Skipped for the null UUID. +* `Alignment`: every extent's :code:`start_dpa` and :code:`length` + must be :code:`CXL_DCD_EXTENT_ALIGN`-aligned. Partial acceptance + of an aligned subset would leave an unusable DAX device, so the + group is dropped instead. + +Surviving extents are sorted by the wire field +:code:`shared_extn_seq` — stable, so arrival order is preserved for +the all-zero non-sharable case — and each becomes a +:code:`dc_extent` inserted into a fresh :code:`cxl_dc_tag_group` +keyed by the group's UUID. Each :code:`dc_extent` carries its own +:code:`hpa_range`; the tag group itself has no aggregate range. + +As each surviving extent is attached the host assigns it a 1..n +:code:`seq_num`: for sharable allocations this equals the +device-stamped :code:`shared_extn_seq` directly; for non-sharable +allocations the device sends :code:`shared_extn_seq == 0` and the +host fills in the arrival-order position (see :code:`logical_seq` in +:code:`cxl_add_pending`). The DAX layer enforces the same +:code:`1..n` dense invariant in both cases. + +The tag group is brought online via :code:`online_tag_group()`, +which registers every member :code:`dc_extent` as an +:code:`extentX.Y` child of :code:`cxlr_dax->dev`, the DAX layer is +notified with :code:`DCD_ADD_CAPACITY`, and the accepted extents are +spliced into the response list for a single :code:`ADD_DC_RESPONSE` +mailbox per More-chain. + +Releasing Extents +----------------- + +A release may be initiated by the device (a pending release +notification) or by the host (when destroying a DAX device or tearing +down a region). Both paths converge on :code:`cxl_rm_extent` +(:code:`drivers/cxl/core/extent.c`). + +Per-extent gates: + +* The DPA range must resolve to a CXL region. If it does not — for + example, an extent left over from a host crash that has not yet + been re-claimed, or a duplicate release racing region teardown — + the release is acknowledged via :code:`memdev_release_extent()` so + the device knows the host is not using the capacity, and the + operation returns :code:`-ENXIO`. +* The DPA range must be `fully contained` in some member + :code:`dc_extent`'s :code:`dpa_range` on the region's + :code:`cxlr_dax`, and the tag (UUID) on that member's + :code:`cxl_dc_tag_group` must match the release request. Releases + are keyed by :code:`(DPA range, tag)` rather than by pointer + because the device, not the host, supplies the identity. A + request that matches no :code:`dc_extent` is rejected with + :code:`-EINVAL`. + +If those gates pass, the DAX layer is notified with +:code:`DCD_RELEASE_CAPACITY` and consulted for permission to proceed. +If the DAX layer returns :code:`-EBUSY` — the capacity is still mapped +or otherwise in use — the release is deferred and +:code:`cxl_rm_extent` returns success without unregistering anything. +When the DAX layer ultimately grants release, +:code:`rm_tag_group()` invalidates the backing memregion once for the +whole group, then unregisters every member :code:`dc_extent` device, +which cascades through the DAX layer to drop the corresponding +:code:`dax_resource`\ s. + +The release path is always whole-tag-group: tagged allocations +release atomically, and the kernel does not split a group in response +to a sub-range release request. + Example Configurations ====================== .. toctree:: diff --git a/Documentation/driver-api/cxl/linux/dax-driver.rst b/Documentation/driver-api/cxl/linux/dax-driver.rst index 10d953a2167b..07f08396f639 100644 --- a/Documentation/driver-api/cxl/linux/dax-driver.rst +++ b/Documentation/driver-api/cxl/linux/dax-driver.rst @@ -27,6 +27,173 @@ CXL capacity in the task's page tables. Users wishing to manually handle allocation of CXL memory should use this interface. +Dynamic Capacity (DC) Regions +============================= +A region backed by a CXL `Dynamic Capacity Device (DCD)` is a `DC region`: +its HPA window is fixed at probe time, but the DPA capacity that fills the +window arrives and departs at runtime as the device offers and reclaims +`extents`. DC regions are distinguished from static regions by the +:code:`IORESOURCE_DAX_DCD` flag on the :code:`dax_region`. + +For the CXL-side rules governing when an offered extent is accepted or a +release request is honoured, see :doc:`cxl-driver`. This section covers +the DAX-side mapping between accepted extents and DAX devices. + +The Extent Layering Model +------------------------- +Four objects sit between the wire-level CXL extent and the +user-visible DAX device. Understanding the cardinality between them +is the key to the DC-region model. + +:: + + device extents dc_extent dax_resource DAX device + (CXL device) (CXL core) (DAX bus) (/dev/daxN.Y) + ------------- ------------- ------------- ------------ + e1 ─┐ ┌─► dc_e1 ──► res_1 (seq=1) ──┐ + e2 ─┼─── tag A ──► ┼─► dc_e2 ──► res_2 (seq=2) ──┼──► daxN.0 + e3 ─┘ └─► dc_e3 ──► res_3 (seq=3) ──┘ (claimed by tag A, + size = Σ |e_i|) + + e4 ─── tag B ────► dc_e4 ──► res_4 (seq=1) ────► daxN.1 + + e5 ─── null tag ─► dc_e5 ──► res_5 (seq=0) ────► daxN.2 + e6 ─── null tag ─► dc_e6 ──► res_6 (seq=0) ────► daxN.3 + +The CXL core groups extents sharing a non-null tag into a single +:code:`cxl_dc_tag_group` (internal-only, no sysfs identity), but each +member extent stays a distinct :code:`dc_extent` with its own HPA +range. The DAX bridge creates one :code:`dax_resource` per +:code:`dc_extent`, and userspace claims a DAX device by writing the +tag's UUID to the seed device's :code:`uuid` attribute, which carves +every matching :code:`dax_resource` (in :code:`seq_num` order) into +the device's :code:`ranges[]` array. + +`Device extent` + The unit the CXL device delivers over the mailbox: a + :code:`(DPA, length, tag, shared_extn_seq)` tuple inside an + Add-Capacity event. The tag is either a non-null UUID (a + `tagged allocation`) or the null UUID (`untagged`). + +:code:`dc_extent` + The CXL core's per-extent object, one per surviving device extent. + Each :code:`dc_extent` is registered as its own :code:`extentX.Y` + sysfs device under :code:`cxlr_dax->dev` and carries its own + :code:`hpa_range` — there is no aggregated / bounding-box HPA + range across siblings. Members of one tag group point at a + shared :code:`cxl_dc_tag_group` (which holds the UUID and a + manual refcount on the surviving siblings) but otherwise exist as + independent kernel objects. + + For a `non-null tag`, the host-wide tag-uniqueness gate + (:doc:`cxl-driver`) guarantees there is at most one + :code:`cxl_dc_tag_group` per UUID on the host, so the set of + :code:`dc_extent`\ s sharing that UUID is a single allocation. + + For the `null tag` there is no cross-event identity — the spec is + silent on aggregating untagged extents across Add-Capacity events. + Each untagged device extent becomes its own :code:`dc_extent` in + its own single-member tag group; two untagged extents delivered + separately are two distinct allocations. + +:code:`dax_resource` + The DAX bus's per-extent view, one-to-one with :code:`dc_extent`. + When the CXL DAX driver receives a :code:`DCD_ADD_CAPACITY` + notification it iterates the tag group and calls + :code:`dax_region_add_resource()` once per member, creating one + :code:`dax_resource` per :code:`dc_extent`. Each + :code:`dax_resource` carries that member's HPA range, the tag + UUID (copied from :code:`dc_extent->group->uuid`), and a 1..n + :code:`seq_num` so :code:`uuid_claim_tagged` can carve the matched + set into the device's :code:`ranges[]` array in the right order + (see :code:`drivers/dax/bus.c`). + +`DAX device` (:code:`/dev/daxN.Y`) + Created by userspace claiming a set of :code:`dax_resource`\ s via + the :code:`uuid` sysfs attribute. Each DAX device corresponds to + exactly one allocation: + + * A `tagged` DAX device is built from every :code:`dax_resource` + carrying the tag — one per :code:`dc_extent` in the allocation + — carved into the device's :code:`ranges[]` in :code:`seq_num` + order. Its size equals the sum of every member's size. + * An `untagged` DAX device is built from one untagged + :code:`dax_resource` and its size equals that one extent. + +So the end-to-end rule is: **one tagged allocation = one +cxl_dc_tag_group = N dc_extents = N dax_resources = one DAX device +with N range entries**. An untagged device extent becomes its own +:code:`dc_extent` / :code:`dax_resource` / single-range DAX device, +claimed one at a time. + +Release follows the same layering in reverse. When the CXL core +calls :code:`rm_tag_group()` (after the device asks for release and +the DAX layer consents), the DAX bridge collects every matching +:code:`dax_resource` and removes them as a set via +:code:`dax_region_rm_resources()`. The removal is refuse-all-or-none +under :code:`dax_region_rwsem`: if any member is in use, the whole +group stays. When removal commits, the HPA capacity returns to the +region's free pool and any DAX device that had claimed it is left +with no backing capacity. Userspace tears the DAX device down via +:code:`daxctl destroy-device` (size=0, then write the device name to +the region's :code:`delete` attribute). + +UUID-Based DAX Device Creation +------------------------------ +A DAX device on a DC region is created by writing a UUID to the +seed device's :code:`uuid` attribute +(:code:`/sys/bus/dax/devices/daxN.Y/uuid`). The seed starts at +size 0; writing :code:`uuid` is a `claim` operation that resolves +the layering above and populates the device: + +* A `non-null UUID` claims `every` :code:`dax_resource` whose tag + matches. :code:`uuid_claim_tagged` (in + :code:`drivers/dax/bus.c`) collects them, sorts by + :code:`seq_num`, enforces the dense :code:`1..n` invariant, and + carves each via :code:`__dev_dax_resize` in :code:`seq_num` order + so the device's :code:`ranges[]` array is dense and ordered. + The resulting DAX device represents exactly the tagged + allocation: its size equals the sum of every member extent's + size. + + The dense :code:`1..n` invariant is the unified rule the CXL + side maintains for both sharable and non-sharable allocations + (see :doc:`cxl-driver`); the match set has exactly one entry per + :code:`dc_extent` in the tag group. + +* The value :code:`"0"` is shorthand for the null UUID and claims + exactly `one` untagged :code:`dax_resource`. Untagged + :code:`dax_resource`\ s correspond to independent untagged + allocations; collapsing several into one device would aggregate + unrelated capacity, so each :code:`uuid` write consumes a single + untagged resource. + +* A write that matches no :code:`dax_resource` returns + :code:`-ENOENT` and the device remains at size 0. + +* Writes to the :code:`uuid` attribute on non-DC regions return + :code:`-EOPNOTSUPP`; the attribute itself is read-only (0444) on + non-DC devices. + +The device's size is determined entirely by the backing allocation: +users do not choose a size on DC regions. Accordingly, the +:code:`size` attribute on a DC DAX device rejects grow requests +with :code:`-EOPNOTSUPP`. Writing :code:`0` is still permitted and is +how :code:`daxctl destroy-device` returns each claimed extent to the +region's available pool before the device's name is written to the +region's :code:`delete` attribute. + +Reads of :code:`uuid` report the tag identifying the capacity +backing the device: + +* For a non-null-UUID-claimed DC DAX device, :code:`uuid` reads + back the claimed UUID. +* For a DC DAX device claimed via :code:`"0"`, or for any + non-DCD DAX device, :code:`uuid` reads :code:`0`. + +See :code:`Documentation/ABI/testing/sysfs-bus-dax` for the +authoritative attribute contracts. + kmem conversion =============== The :code:`dax_kmem` driver converts a `DAX Device` into a series of `hotplug -- 2.43.0

