On Fri, May 29, 2026 at 02:30:20PM -0700, Dave Jiang wrote:
>
>
> On 5/23/26 2:43 AM, Anisa Su wrote:
> > From: Ira Weiny <[email protected]>
> >
> > Dynamic capacity device extents may be left in an accepted state on a
> > device due to an unexpected host crash. In this case it is expected
> > that the creation of a new region on top of a DC partition can read
> > those extents and surface them for continued use.
> >
> > Once all endpoint decoders are part of a region and the region is being
> > realized, a read of the 'devices extent list' can reveal these
> > previously accepted extents.
> >
> > CXL r3.1 specifies the mailbox call Get Dynamic Capacity Extent List for
> > this purpose. The call returns all the extents for all dynamic capacity
> > partitions. If the fabric manager is adding extents to any DCD
> > partition, the extent list for the recovered region may change. In this
> > case the query must retry. Upon retry the query could encounter extents
> > which were accepted on a previous list query. Adding such extents is
> > ignored without error because they are entirely within a previous
> > accepted extent. Instead warn on this case to allow for differentiating
> > bad devices from this normal condition.
> >
> > Latch any errors to be bubbled up to ensure notification to the user
> > even if individual errors are rate limited or otherwise ignored.
> >
> > The scan for existing extents races with the dax_cxl driver. This is
> > synchronized through the region device lock. Extents which are found
> > after the driver has loaded will surface through the normal notification
> > path while extents seen prior to the driver are read during driver load.
> >
> > Based on an original patch by Navneet Singh.
> >
> > Reviewed-by: Jonathan Cameron <[email protected]>
> > Reviewed-by: Fan Ni <[email protected]>
> > Signed-off-by: Ira Weiny <[email protected]>
> > ---
> > drivers/cxl/core/core.h | 1 +
> > drivers/cxl/core/mbox.c | 116 ++++++++++++++++++++++++++++++++++
> > drivers/cxl/core/region_dax.c | 27 ++++++++
> > drivers/cxl/cxlmem.h | 21 ++++++
> > 4 files changed, 165 insertions(+)
> >
> > diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> > index c28e357c5817..f5b05de5ed83 100644
> > --- a/drivers/cxl/core/core.h
> > +++ b/drivers/cxl/core/core.h
> > @@ -28,6 +28,7 @@ cxled_to_mds(struct cxl_endpoint_decoder *cxled)
> > return container_of(cxlds, struct cxl_memdev_state, cxlds);
> > }
> >
> > +int cxl_process_extent_list(struct cxl_endpoint_decoder *cxled);
> > int cxl_region_invalidate_memregion(struct cxl_region *cxlr);
> >
> > #ifdef CONFIG_CXL_REGION
> > diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> > index 8071c1ed1b36..486110e1c03d 100644
> > --- a/drivers/cxl/core/mbox.c
> > +++ b/drivers/cxl/core/mbox.c
> > @@ -2083,6 +2083,122 @@ int cxl_dev_dc_identify(struct cxl_mailbox *mbox,
> > }
> > EXPORT_SYMBOL_NS_GPL(cxl_dev_dc_identify, "CXL");
> >
> > +/* Return -EAGAIN if the extent list changes while reading */
> > +static int __cxl_process_extent_list(struct cxl_endpoint_decoder *cxled)
> > +{
> > + u32 current_index, total_read, total_expected, initial_gen_num;
>
> initial_gen_num should be initialized to something invalid?
>
Since first is set to true, the if (first) always reads initial_gen_num
from the first extent, so it should be fine... right?
> > + struct cxl_memdev_state *mds = cxled_to_mds(cxled);
> > + struct cxl_mailbox *cxl_mbox = &mds->cxlds.cxl_mbox;
> > + struct device *dev = mds->cxlds.dev;
> > + struct cxl_mbox_cmd mbox_cmd;
> > + u32 max_extent_count;
> > + int latched_rc = 0;
> > + bool first = true;
> > +
> > + struct cxl_mbox_get_extent_out *extents __free(kvfree) =
> > + kvmalloc(cxl_mbox->payload_size, GFP_KERNEL);
> > + if (!extents)
> > + return -ENOMEM;
> > +
> > + total_read = 0;
> > + current_index = 0;
> > + total_expected = 0;
> > + max_extent_count = (cxl_mbox->payload_size - sizeof(*extents)) /
> > + sizeof(struct cxl_extent);
> > + do {
> > + u32 nr_returned, current_total, current_gen_num;
> > + struct cxl_mbox_get_extent_in get_extent;
> > + int rc;
> > +
> > + get_extent = (struct cxl_mbox_get_extent_in) {
> > + .extent_cnt = cpu_to_le32(max(max_extent_count,
> > + total_expected -
> > current_index)),
>
> Should be min() here right?
>
That looks like it was the original intention, but since total_expected is
initialized to 0, using min would set extent_cnt to 0.
So I just did
.extent_cnt = cpu_to_le32(max_extent_count),
to always ask for the max and then the device is free to return < than
that. Should be ok, right?
> > + .start_extent_index = cpu_to_le32(current_index),
> > + };
> > +
> > + mbox_cmd = (struct cxl_mbox_cmd) {
> > + .opcode = CXL_MBOX_OP_GET_DC_EXTENT_LIST,
> > + .payload_in = &get_extent,
> > + .size_in = sizeof(get_extent),
> > + .size_out = cxl_mbox->payload_size,
> > + .payload_out = extents,
> > + .min_out = 1,
> > + };
> > +
> > + rc = cxl_internal_send_cmd(cxl_mbox, &mbox_cmd);
> > + if (rc < 0)
> > + return rc;
> > +
> > + /* Save initial data */
> > + if (first) {
> > + total_expected =
> > le32_to_cpu(extents->total_extent_count);
> > + initial_gen_num = le32_to_cpu(extents->generation_num);
> > + first = false;
> > + }
> > +
> > + nr_returned = le32_to_cpu(extents->returned_extent_count);
> > + total_read += nr_returned;
> > + current_total = le32_to_cpu(extents->total_extent_count);
> > + current_gen_num = le32_to_cpu(extents->generation_num);
> > +
> > + dev_dbg(dev, "Got extent list %d-%d of %d generation Num:%d\n",
> > + current_index, total_read - 1, current_total,
> > current_gen_num);
> > +
> > + if (current_gen_num != initial_gen_num || total_expected !=
> > current_total) {
> > + dev_warn(dev, "Extent list change detected; gen %u !=
> > %u : cnt %u != %u\n",
> > + current_gen_num, initial_gen_num,
> > + total_expected, current_total);
> > + return -EAGAIN;
> > + }
> > +
> > + for (int i = 0; i < nr_returned ; i++) {
> > + struct cxl_extent *extent = &extents->extent[i];
> > +
> > + dev_dbg(dev, "Processing extent %d/%d\n",
> > + current_index + i, total_expected);
> > +
>
> Should probably hold the mds->add_ctx->lock before calling
> add_to_pending_list()? handle_add_event() holds the lock before calling.
> Maybe also add a lock assert in add_to_pending_list().
>
Oh yeah... it definitely should. Also lock assert added to
add_to_pending_list()
> > + rc = add_to_pending_list(&mds->add_ctx.pending_extents,
> > + extent);
> > + if (rc) {
> > + latched_rc = rc;
>
> Is the intention here to report the last found error and not the first error?
>
Changed to report first error:
if (rc && !latched_rc)
latched_rc = rc;
> > + }
>
> {} not needed if single line
>
removed
> > + }
> > +
> > + current_index += nr_returned;
> > + } while (total_expected > total_read);
> > +
> > + if (!latched_rc && !list_empty(&mds->add_ctx.pending_extents)) {
> > + latched_rc = cxl_add_pending(mds);
> > + }
> > + clear_pending_extents(mds);
> > +
> > + return latched_rc;
> > +}
> > +
> > +#define CXL_READ_EXTENT_LIST_RETRY 10
> > +
> > +/**
> > + * cxl_process_extent_list() - Read existing extents
> > + * @cxled: Endpoint decoder which is part of a region
> > + *
> > + * Issue the Get Dynamic Capacity Extent List command to the device
> > + * and add existing extents if found.
> > + *
> > + * A retry of 10 is somewhat arbitrary, however, extent changes should be
> > + * relatively rare while bringing up a region. So 10 should be plenty.
> > + */
> > +int cxl_process_extent_list(struct cxl_endpoint_decoder *cxled)
> > +{
> > + int retry = CXL_READ_EXTENT_LIST_RETRY;
> > + int rc;
> > +
> > + do {
> > + rc = __cxl_process_extent_list(cxled);
> > + } while (rc == -EAGAIN && retry--);
>
> I think it's retrying 11 times here.
>
Changed to while (rc == -EAGAIN && --retry);
> > +
> > + return rc;
> > +}
> > +
> > static void add_part(struct cxl_dpa_info *info, u64 start, u64 size, enum
> > cxl_partition_mode mode)
> > {
> > int i = info->nr_partitions;
> > diff --git a/drivers/cxl/core/region_dax.c b/drivers/cxl/core/region_dax.c
> > index 519e203c486a..e7a812e8b2e7 100644
> > --- a/drivers/cxl/core/region_dax.c
> > +++ b/drivers/cxl/core/region_dax.c
> > @@ -82,6 +82,26 @@ static void cxlr_dax_unregister(void *_cxlr_dax)
> > device_unregister(&cxlr_dax->dev);
> > }
> >
> > +static int cxlr_add_existing_extents(struct cxl_region *cxlr)
> > +{
> > + struct cxl_region_params *p = &cxlr->params;
> > + int i, latched_rc = 0;
> > +
> > + for (i = 0; i < p->nr_targets; i++) {
> > + struct device *dev = &p->targets[i]->cxld.dev;
> > + int rc;
> > +
> > + rc = cxl_process_extent_list(p->targets[i]);
> > + if (rc) {
> > + dev_err(dev, "Existing extent processing failed %d\n",
> > + rc);
> > + latched_rc = rc;
> > + }
> > + }
> > +
> > + return latched_rc;
> > +}
> > +
> > int devm_cxl_add_dax_region(struct cxl_region *cxlr)
> > {
> > struct device *dev;
> > @@ -110,6 +130,13 @@ int devm_cxl_add_dax_region(struct cxl_region *cxlr)
> > dev_dbg(&cxlr->dev, "%s: register %s\n", dev_name(dev->parent),
> > dev_name(dev));
> >
> > + if (cxlr->mode == CXL_PARTMODE_DYNAMIC_RAM_A) {
> > + rc = cxlr_add_existing_extents(cxlr);
>
> cxlr_add_existing_extents() -> cxl_process_extent_list() ->
> __cxl_process_extent_list() -> cxl_add_pending() ->
> CXL_MBOX_OP_ADD_DC_RESPONSE sent.
>
> CXL r4.0 8.2.10.9.9.3:
> Device shall report Invalid Physical Address if:
> One or more extents in the updated extent list specify a DPA range that has
> already been added with a previous call to the Add Dynamic Capacity Response.
>
> Aren't existing extents already been added previously and responded by
> ADD_DC_RESPONSE? For add existing extent path it seems like no response is
> needed to send to the device and can be skipped. Otherwise the software will
> receive error from the device when sending ADD_DC_RESPONSE.
>
very true
cxl_add_pending() gains the param: bool existing
__cxl_process_extent_list() calls cxl_add_pending(mds, true);
And before cxl_add_pending sends add DC response, check existing
/*
* Recovered (already-accepted) extents must not be re-reported in an
* Add-DC-Response: the device rejects a DPA range already added by a
* previous response (CXL r4.0 8.2.10.9.9.3, Invalid Physical Address).
*/
if (existing)
return 0;
return cxl_send_dc_response(mds, CXL_MBOX_OP_ADD_DC_RESPONSE,
pending, total_accepted);
> Would be good to get this tested on hw.
>
>
> > + if (rc)
> > + dev_err(&cxlr->dev,
> > + "Existing extent processing failed %d\n", rc);
>
> No return on error?
>
oops, fixed
> DJ
>
Thanks,
Anisa
> > + }
> > +
> > return devm_add_action_or_reset(&cxlr->dev, cxlr_dax_unregister,
> > no_free_ptr(cxlr_dax));
> > }
> > diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> > index d992cc9b7811..1ad3dc7e413c 100644
> > --- a/drivers/cxl/cxlmem.h
> > +++ b/drivers/cxl/cxlmem.h
> > @@ -564,6 +564,27 @@ struct cxl_mbox_dc_response {
> > } __packed extent_list[] __counted_by(extent_list_size);
> > } __packed;
> >
> > +/*
> > + * Get Dynamic Capacity Extent List; Input Payload
> > + * CXL rev 3.1 section 8.2.9.9.9.2; Table 8-166
> > + */
> > +struct cxl_mbox_get_extent_in {
> > + __le32 extent_cnt;
> > + __le32 start_extent_index;
> > +} __packed;
> > +
> > +/*
> > + * Get Dynamic Capacity Extent List; Output Payload
> > + * CXL rev 3.1 section 8.2.9.9.9.2; Table 8-167
> > + */
> > +struct cxl_mbox_get_extent_out {
> > + __le32 returned_extent_count;
> > + __le32 total_extent_count;
> > + __le32 generation_num;
> > + u8 rsvd[4];
> > + struct cxl_extent extent[];
> > +} __packed;
> > +
> > struct cxl_mbox_get_supported_logs {
> > __le16 entries;
> > u8 rsvd[6];
>