On 7/22/20 10:21 AM, Klaus Jensen wrote: > On Jul 22 10:00, Andrzej Jakowski wrote: >> On 7/22/20 12:43 AM, Klaus Jensen wrote: >>> @keith, please see below - can you comment on the Linux kernel 2 MB >>> boundary requirement for the CMB? Or should we hail Stephen (or Logan >>> maybe) since this seems to be related to p2pdma? >>> >>> On Jul 21 14:54, Andrzej Jakowski wrote: >>>> On 7/15/20 1:06 AM, Klaus Jensen wrote: >>>>> Hi Andrzej, >>>>> >>>>> I've not been ignoring this, but sorry for not following up earlier. >>>>> >>>>> I'm hesitent to merge anything that very obviously breaks an OS that we >>>>> know is used a lot to this using this device. Also because the issue has >>>>> not been analyzed well enough to actually know if this is a QEMU or >>>>> kernel issue. >>>> >>>> Hi Klaus, >>>> >>>> Thx for your response! I understand your hesitance on merging stuff that >>>> obviously breaks guest OS. >>>> >>>>> >>>>> Now, as far as I can test, having the MSI-X vector table and PBA in BAR >>>>> 0, PMR in BAR 2 and CMB in BAR 4 seems to make everyone happy >>>>> (irregardless of IOMMU on/off). >>>>> >>>>> Later, when the issue is better understood, we can add options to set >>>>> offsets, BIRs etc. >>>>> >>>>> The patch below replaces your "[PATCH v4 2/2] nvme: allow cmb and pmr to >>>>> be enabled" (but still requires "[PATCH v4 1/2] ...") and applies to >>>>> git://git.infradead.org/qemu-nvme.git nvme-next branch. >>>>> >>>>> Can you reproduce the issues with that patch? I can't on a stock Arch >>>>> Linux 5.7.5-arch1-1 kernel. >>>> >>>> While I'm happy that approach with MSIX and PBA in BAR0 works fine, I >>>> feel that investigation part why it works while mine doesn't is >>>> missing. It looks to me that both patches are basically following same >>>> approach: create memory subregion and overlay on top of other memory >>>> region. Why one works and the other doesn't then? >>>> >>>> Having in mind that, I have recently focused on understanding problem. >>>> I observed that when guest assigns address to BAR4, addr field in >>>> nvme-bar4 memory region gets populated, but it doesn't get automatically >>>> populated in ctrl_mem (cmb) memory subregion, so later when >>>> nvme_addr_is_cmb() >>>> is called address check works incorrectly and as a consequence vmm does >>>> dma >>>> read instead of memcpy. >>>> I created a patch that sets correct address on ctrl_mem subregion and >>>> guest >>>> OS boots up correctly. >>>> >>>> When I looked into pci and memory region code I noticed that indeed address >>>> is only assigned to top level memory region but not to contained >>>> subregions. >>>> I think that because in your approach cmb grabs whole bar exclusively it >>>> works >>>> fine. >>>> >>>> Here is my question (perhaps pci folks can help answer :)): if we consider >>>> memory region overlapping for pci devices as valid use case should pci >>>> code on configuration cycles walk through all contained subregion and >>>> update addr field accordingly? >>>> >>>> Thx! >>>> >>> >>> Hi Andrzej, >>> >>> Thanks for looking into this. I think your analysis helped me nail this. >>> The problem is that we added the use of a subregion and have some >>> assumptions that no longer hold. >>> >>> nvme_addr_is_cmb() assumes that n->ctrl_mem.addr is an absolute address. >>> But when the memory region is a subregion, addr holds an offset into the >>> parent container instead. Thus, changing all occurances of >>> n->ctrl_mem.addr to (n->bar0.addr + n->ctrl_mem.addr) fixes the issue >>> (this is required in nvme_addr_is_cmb and nvme_map_prp). I patched that >>> in your original patch[1]. The reason my version worked is because there >>> was no subregion involved for the CMB, so the existing address >>> validation calculations were still correct. >> >> I'm a little bit concerned with this approach: >> (n->bar0.addr + n->ctrl_mem.addr) and hoping to have some debate. Let me >> describe my understanding of the problem. > > Oh. In the context of your patch I meant bar4 of course, but anyway. > >> It looks to me that addr field sometimes contains *absolute* address (when >> no >> hierarchy is used) and other times it contains *relative* address (when >> hierarchy is created). From my perspective use of this field is inconsistent >> and thus error-prone. >> Because of that I think that doing n->bar0.addr + n->ctrl_mem.addr doesn't >> solve root problem and is still prone to the same problem if in the future >> we potentially build even more complicated hierarchy. >> I think that we could solve it by introducing helper function like >> >> hwaddr memory_region_get_abs_addr(MemoryRegion *mr) >> >> to retrieve absolute address and in the documentation indicate that addr >> field >> can be relative or absolute and it is recommended to use above function to >> retrieve absolute address. >> What do you think? >> > > I'm all for a helper - I was not gonna cheer for the quick'n'dirty fix I > did just to convince myself that this was the issue ;) > > I think the helper might already be there in memory.c. It's just not > exported. > > static hwaddr memory_region_to_absolute_addr(MemoryRegion *mr, hwaddr > offset)
Awesome! I didn't notice. Let me check and repost series soon :) > >>> >>> This leaves us with the Linux kernel complaining about not being able to >>> register the CMB if it is not on a 2MB boundary - this is probably just >>> a constraint in the kernel that we can't do anything about (but I'm no >>> kernel hacker...), which can be fixed by either being "nice" towards the >>> Linux kernel by forcing a 2 MB alignment in the device or exposing the >>> SZU field such that the user can choose 16MiB size units (or higher) >>> instead of 1MiB. I'm leaning towards ensuring the 2 MB alignment in the >>> device such that we do not have to introduce new cmb_size parameters, >>> while also making it easier for the user to configure. But I'm not >>> really sure. >> This is kernel limitation that we have to live with. The minimum granularity >> of devm_memremap_pages() is 2MB and it must be 2MB aligned. It used to worse >> at 128MB size+align (section-size), but sub-section memory-hotplug patches >> adjusted that to a 2MB section. > > Thanks for that explanation! > >> Ensuring 2MiB size and alignment in the device emulation makes sense to me. >> Perhaps we could document that limitations - making user more aware of it. >> > > Sounds good to me. >