On Tue, Apr 19, 2022 at 2:50 PM Dan Williams <[email protected]> wrote: > > On Tue, Apr 19, 2022 at 9:43 AM Jason Gunthorpe <[email protected]> wrote: > > > > On Mon, Apr 18, 2022 at 09:42:00AM -0700, Dan Williams wrote: > > > [ add the usual HMM suspects Christoph, Jason, and John ] > > > > > > On Wed, Apr 13, 2022 at 11:38 AM Ben Widawsky <[email protected]> > > > wrote: > > > > > > > > Define an API which allows CXL drivers to manage CXL address space. > > > > CXL is unique in that the address space and various properties are only > > > > known after CXL drivers come up, and therefore cannot be part of core > > > > memory enumeration. > > > > > > I think this buries the lead on the problem introduced by > > > MEMORY_DEVICE_PRIVATE in the first place. Let's revisit that history > > > before diving into what CXL needs. > > > > > > > > > Commit 4ef589dc9b10 ("mm/hmm/devmem: device memory hotplug using > > > ZONE_DEVICE") introduced the concept of MEMORY_DEVICE_PRIVATE. At its > > > core MEMORY_DEVICE_PRIVATE uses the ZONE_DEVICE capability to annotate > > > an "unused" physical address range with 'struct page' for the purpose > > > of coordinating migration of buffers onto and off of a GPU / > > > accelerator. The determination of "unused" was based on a heuristic, > > > not a guarantee, that any address range not expressly conveyed in the > > > platform firmware map of the system can be repurposed for software > > > use. The CXL Fixed Memory Windows Structure (CFMWS) definition > > > explicitly breaks the assumptions of that heuristic. > > > > So CXL defines an address map that is not part of the FW list? > > It defines a super-set of 'potential' address space and a subset that > is active in the FW list. It's similar to memory hotplug where an > address range may come online after the fact, but unlike ACPI memory > hotplug, FW is not involved in the hotplug path, and FW cannot predict > what address ranges will come online. For example ACPI hotplug knows > in advance to publish the ranges that can experience an online / > insert event, CXL has many more degrees of freedom. > > > > > > > It would be desirable to simply insert this address space into > > > > iomem_resource with a new flag to denote this is CXL memory. This would > > > > permit request_free_mem_region() to be reused for CXL memory provided it > > > > learned some new tricks. For that, it is tempting to simply use > > > > insert_resource(). The API was designed specifically for cases where new > > > > devices may offer new address space. This cannot work in the general > > > > case. Boot firmware can pass, some, none, or all of the CFMWS range as > > > > various types of memory to the kernel, and this may be left alone, > > > > merged, or even expanded. > > > > And then we understand that on CXL the FW is might pass stuff that > > intersects with the actual CXL ranges? > > > > > > As a result iomem_resource may intersect CFMWS > > > > regions in ways insert_resource cannot handle [2]. Similar reasoning > > > > applies to allocate_resource(). > > > > > > > > With the insert_resource option out, the only reasonable approach left > > > > is to let the CXL driver manage the address space independently of > > > > iomem_resource and attempt to prevent users of device private memory > > > > And finally due to all these FW problems we are going to make a 2nd > > allocator for physical address space and just disable the normal one? > > No, or I am misunderstanding this comment. The CXL address space > allocator is managing space that can be populated and become an > iomem_resource. So it's not supplanting iomem_resource it is > coordinating dynamic extensions to the FW map. > > > Then since DEVICE_PRIVATE is a notable user of this allocator we now > > understand it becomes broken? > > > > Sounds horrible. IMHO you should fix the normal allocator somehow to > > understand that the ranges from FW have been reprogrammed by Linux > > There is no reprogramming of the ranges from FW. CXL memory that is > mapped as System RAM at boot will have the CXL decode configuration > locked in all the participating devices. The remaining CXL decode > space is then available for dynamic reconfiguration of CXL resources > from the devices that the FW explicitly ignores, which is all > hot-added devices and all persistent-memory capacity. > > > and > > not try to build a whole different allocator in CXL code. > > I am not seeing much overlap for DEVICE_PRIVATE and CXL to share an > allocator. CXL explicitly wants ranges that have been set aside for > CXL and are related to 1 or more CXL host bridges. DEVICE_PRIVATE > wants to consume an unused physical address range to proxy > device-local-memory with no requirements on what range is chosen as > long as it does not collide with anything else.
...or are you suggesting to represent CXL free memory capacity in iomem_resource and augment the FW list early with CXL ranges. That seems doable, but it would only represent the free CXL ranges in iomem_resource as the populated CXL ranges cannot have their resources reparented after the fact, and there is plenty of code that expects "System RAM" to be a top-level resource.
