On Jul 22 10:00, Andrzej Jakowski wrote: > On 7/22/20 12:43 AM, Klaus Jensen wrote: > > @keith, please see below - can you comment on the Linux kernel 2 MB > > boundary requirement for the CMB? Or should we hail Stephen (or Logan > > maybe) since this seems to be related to p2pdma? > > > > On Jul 21 14:54, Andrzej Jakowski wrote: > >> On 7/15/20 1:06 AM, Klaus Jensen wrote: > >>> Hi Andrzej, > >>> > >>> I've not been ignoring this, but sorry for not following up earlier. > >>> > >>> I'm hesitent to merge anything that very obviously breaks an OS that we > >>> know is used a lot to this using this device. Also because the issue has > >>> not been analyzed well enough to actually know if this is a QEMU or > >>> kernel issue. > >> > >> Hi Klaus, > >> > >> Thx for your response! I understand your hesitance on merging stuff that > >> obviously breaks guest OS. > >> > >>> > >>> Now, as far as I can test, having the MSI-X vector table and PBA in BAR > >>> 0, PMR in BAR 2 and CMB in BAR 4 seems to make everyone happy > >>> (irregardless of IOMMU on/off). > >>> > >>> Later, when the issue is better understood, we can add options to set > >>> offsets, BIRs etc. > >>> > >>> The patch below replaces your "[PATCH v4 2/2] nvme: allow cmb and pmr to > >>> be enabled" (but still requires "[PATCH v4 1/2] ...") and applies to > >>> git://git.infradead.org/qemu-nvme.git nvme-next branch. > >>> > >>> Can you reproduce the issues with that patch? I can't on a stock Arch > >>> Linux 5.7.5-arch1-1 kernel. > >> > >> While I'm happy that approach with MSIX and PBA in BAR0 works fine, I > >> feel that investigation part why it works while mine doesn't is > >> missing. It looks to me that both patches are basically following same > >> approach: create memory subregion and overlay on top of other memory > >> region. Why one works and the other doesn't then? > >> > >> Having in mind that, I have recently focused on understanding problem. > >> I observed that when guest assigns address to BAR4, addr field in > >> nvme-bar4 memory region gets populated, but it doesn't get automatically > >> populated in ctrl_mem (cmb) memory subregion, so later when > >> nvme_addr_is_cmb() > >> is called address check works incorrectly and as a consequence vmm does > >> dma > >> read instead of memcpy. > >> I created a patch that sets correct address on ctrl_mem subregion and > >> guest > >> OS boots up correctly. > >> > >> When I looked into pci and memory region code I noticed that indeed address > >> is only assigned to top level memory region but not to contained > >> subregions. > >> I think that because in your approach cmb grabs whole bar exclusively it > >> works > >> fine. > >> > >> Here is my question (perhaps pci folks can help answer :)): if we consider > >> memory region overlapping for pci devices as valid use case should pci > >> code on configuration cycles walk through all contained subregion and > >> update addr field accordingly? > >> > >> Thx! > >> > > > > Hi Andrzej, > > > > Thanks for looking into this. I think your analysis helped me nail this. > > The problem is that we added the use of a subregion and have some > > assumptions that no longer hold. > > > > nvme_addr_is_cmb() assumes that n->ctrl_mem.addr is an absolute address. > > But when the memory region is a subregion, addr holds an offset into the > > parent container instead. Thus, changing all occurances of > > n->ctrl_mem.addr to (n->bar0.addr + n->ctrl_mem.addr) fixes the issue > > (this is required in nvme_addr_is_cmb and nvme_map_prp). I patched that > > in your original patch[1]. The reason my version worked is because there > > was no subregion involved for the CMB, so the existing address > > validation calculations were still correct. > > I'm a little bit concerned with this approach: > (n->bar0.addr + n->ctrl_mem.addr) and hoping to have some debate. Let me > describe my understanding of the problem.
Oh. In the context of your patch I meant bar4 of course, but anyway. > It looks to me that addr field sometimes contains *absolute* address (when no > hierarchy is used) and other times it contains *relative* address (when > hierarchy is created). From my perspective use of this field is inconsistent > and thus error-prone. > Because of that I think that doing n->bar0.addr + n->ctrl_mem.addr doesn't > solve root problem and is still prone to the same problem if in the future > we potentially build even more complicated hierarchy. > I think that we could solve it by introducing helper function like > > hwaddr memory_region_get_abs_addr(MemoryRegion *mr) > > to retrieve absolute address and in the documentation indicate that addr field > can be relative or absolute and it is recommended to use above function to > retrieve absolute address. > What do you think? > I'm all for a helper - I was not gonna cheer for the quick'n'dirty fix I did just to convince myself that this was the issue ;) I think the helper might already be there in memory.c. It's just not exported. static hwaddr memory_region_to_absolute_addr(MemoryRegion *mr, hwaddr offset) > > > > This leaves us with the Linux kernel complaining about not being able to > > register the CMB if it is not on a 2MB boundary - this is probably just > > a constraint in the kernel that we can't do anything about (but I'm no > > kernel hacker...), which can be fixed by either being "nice" towards the > > Linux kernel by forcing a 2 MB alignment in the device or exposing the > > SZU field such that the user can choose 16MiB size units (or higher) > > instead of 1MiB. I'm leaning towards ensuring the 2 MB alignment in the > > device such that we do not have to introduce new cmb_size parameters, > > while also making it easier for the user to configure. But I'm not > > really sure. > This is kernel limitation that we have to live with. The minimum granularity > of devm_memremap_pages() is 2MB and it must be 2MB aligned. It used to worse > at 128MB size+align (section-size), but sub-section memory-hotplug patches > adjusted that to a 2MB section. Thanks for that explanation! > Ensuring 2MiB size and alignment in the device emulation makes sense to me. > Perhaps we could document that limitations - making user more aware of it. > Sounds good to me.