On Sun, Apr 13, 2025 at 05:52:20PM -0500, Ira Weiny wrote:i
A few notes while going through and removing sparse dax semantics and plumbing
for fs-dax mode:
> A dynamic capacity device (DCD) sends events to signal the host for
> changes in the availability of Dynamic Capacity (DC) memory. These
> events contain extents describing a DPA range and meta data for memory
> to be added or removed. Events may be sent from the device at any time.
>
> Three types of events can be signaled, Add, Release, and Force Release.
>
> On add, the host may accept or reject the memory being offered. If no
> region exists, or the extent is invalid, the extent should be rejected.
> Add extent events may be grouped by a 'more' bit which indicates those
> extents should be processed as a group.
>
> On remove, the host can delay the response until the host is safely not
> using the memory. If no region exists the release can be sent
> immediately. The host may also release extents (or partial extents) at
> any time.
Partial release is no longer valid for tagged release iirc from the calls
> Thus the 'more' bit grouping of release events is of less
> value and can be ignored in favor of sending multiple release capacity
> responses for groups of release events.
>
[snip]
> +
> +static int cxl_send_dc_response(struct cxl_memdev_state *mds, int opcode,
> + struct xarray *extent_array, int cnt)
> +{
> + struct cxl_mailbox *cxl_mbox = &mds->cxlds.cxl_mbox;
> + struct cxl_mbox_dc_response *p;
> + struct cxl_extent *extent;
> + unsigned long index;
> + u32 pl_index;
> +
> + size_t pl_size = struct_size(p, extent_list, cnt);
> + u32 max_extents = cnt;
> +
> + /* May have to use more bit on response. */
> + if (pl_size > cxl_mbox->payload_size) {
> + max_extents = (cxl_mbox->payload_size - sizeof(*p)) /
> + sizeof(struct updated_extent_list);
> + pl_size = struct_size(p, extent_list, max_extents);
> + }
> +
> + struct cxl_mbox_dc_response *response __free(kfree) =
> + kzalloc(pl_size, GFP_KERNEL);
> + if (!response)
> + return -ENOMEM;
> +
> + if (cnt == 0)
> + return send_one_response(cxl_mbox, response, opcode, 0, 0);
> +
> + pl_index = 0;
I was wondering why xarray is used here instead of a list? I didn't see anywhere
that we need to look up a specific index to benefit from the log complexity and
afaict, simply used to iterate over all elements.
> + xa_for_each(extent_array, index, extent) {
> + response->extent_list[pl_index].dpa_start = extent->start_dpa;
> + response->extent_list[pl_index].length = extent->length;
> + pl_index++;
> +
> + if (pl_index == max_extents) {
> + u8 flags = 0;
> + int rc;
> +
> + if (pl_index < cnt)
> + flags |= CXL_DCD_EVENT_MORE;
> + rc = send_one_response(cxl_mbox, response, opcode,
> + pl_index, flags);
> + if (rc)
> + return rc;
> + cnt -= pl_index;
> + pl_index = 0;
> + }
> + }
> +
> + if (!pl_index) /* nothing more to do */
> + return 0;
> + return send_one_response(cxl_mbox, response, opcode, pl_index, 0);
> +}
> +
> +void memdev_release_extent(struct cxl_memdev_state *mds, struct range *range)
> +{
> + struct device *dev = mds->cxlds.dev;
> + struct xarray extent_list;
> +
> + struct cxl_extent extent = {
> + .start_dpa = cpu_to_le64(range->start),
> + .length = cpu_to_le64(range_len(range)),
> + };
> +
> + dev_dbg(dev, "Release response dpa %pra\n", &range);
> +
> + xa_init(&extent_list);
> + if (xa_insert(&extent_list, 0, &extent, GFP_KERNEL)) {
> + dev_dbg(dev, "Failed to release %pra\n", &range);
> + goto destroy;
> + }
> +
> + if (cxl_send_dc_response(mds, CXL_MBOX_OP_RELEASE_DC, &extent_list, 1))
> + dev_dbg(dev, "Failed to release %pra\n", &range);
> +
> +destroy:
> + xa_destroy(&extent_list);
> +}
> +
> +static int validate_add_extent(struct cxl_memdev_state *mds,
> + struct cxl_extent *extent)
> +{
> + int rc;
> +
> + rc = cxl_validate_extent(mds, extent);
> + if (rc)
> + return rc;
> +
> + return cxl_add_extent(mds, extent);
> +}
> +
> +static int cxl_add_pending(struct cxl_memdev_state *mds)
> +{
> + struct device *dev = mds->cxlds.dev;
> + struct cxl_extent *extent;
> + unsigned long cnt = 0;
> + unsigned long index;
> + int rc;
> +
Also according to the spec:
"In response to an Add Capacity Event Record, or multiple Add Capacity Event
records grouped via the More flag (see Table 8-229), the host is expected to
respond with exactly one Add Dynamic Capacity Response acknowledgment,
corresponding to the order of the Add Capacity Events received. If the order
does not match, the device shall return Invalid Input. The Add Dynamic Capacity
Response acknowledgment must be sent in the same order as the Add Capacity
Event Records."
Using xarray does not preserve the order of the extents, which requires a fifo
queue.
> + xa_for_each(&mds->pending_extents, index, extent) {
> + if (validate_add_extent(mds, extent)) {
> + /*
> + * Any extents which are to be rejected are omitted from
> + * the response. An empty response means all are
> + * rejected.
> + */
> + dev_dbg(dev, "unconsumed DC extent DPA:%#llx
> LEN:%#llx\n",
> + le64_to_cpu(extent->start_dpa),
> + le64_to_cpu(extent->length));
> + xa_erase(&mds->pending_extents, index);
> + kfree(extent);
> + continue;
> + }
> + cnt++;
> + }
> + rc = cxl_send_dc_response(mds, CXL_MBOX_OP_ADD_DC_RESPONSE,
> + &mds->pending_extents, cnt);
> + xa_for_each(&mds->pending_extents, index, extent) {
> + xa_erase(&mds->pending_extents, index);
> + kfree(extent);
> + }
> + return rc;
> +}
> +
[snip]
Thanks,
Anisa