Re: Crash with CXL + TCG on 8.2: Was Re: qemu cxl memory expander shows numa_node -1

Mattias Nissler Sun, 18 Feb 2024 23:56:00 -0800

On Thu, Feb 15, 2024 at 4:29 PM Jonathan Cameron <
jonathan.came...@huawei.com> wrote:


> On Thu, 8 Feb 2024 14:50:59 +0000
> Jonathan Cameron <jonathan.came...@huawei.com> wrote:
>
> > On Wed, 7 Feb 2024 17:34:15 +0000
> > Jonathan Cameron <jonathan.came...@huawei.com> wrote:
> >
> > > On Fri, 2 Feb 2024 16:56:18 +0000
> > > Peter Maydell <peter.mayd...@linaro.org> wrote:
> > >
> > > > On Fri, 2 Feb 2024 at 16:50, Gregory Price <
> gregory.pr...@memverge.com> wrote:
> > > > >
> > > > > On Fri, Feb 02, 2024 at 04:33:20PM +0000, Peter Maydell wrote:
>
> > > > > > Here we are trying to take an interrupt. This isn't related to
> the
> > > > > > other can_do_io stuff, it's happening because do_ld_mmio_beN
> assumes
> > > > > > it's called with the BQL not held, but in fact there are some
> > > > > > situations where we call into the memory subsystem and we do
> > > > > > already have the BQL.
> > > >
> > > > > It's bugs all the way down as usual!
> > > > > https://xkcd.com/1416/
> > > > >
> > > > > I'll dig in a little next week to see if there's an easy fix. We
> can see
> > > > > the return address is already 0 going into mmu_translate, so it
> does
> > > > > look unrelated to the patch I threw out - but probably still has
> to do
> > > > > with things being on IO.
> > > >
> > > > Yes, the low level memory accessors only need to take the BQL if the
> thing
> > > > being accessed is an MMIO device. Probably what is wanted is for
> those
> > > > functions to do "take the lock if we don't already have it",
> something
> > > > like hw/core/cpu-common.c:cpu_reset_interrupt() does.
> >
> > Got back to x86 testing and indeed not taking the lock in that one path
> > does get things running (with all Gregory's earlier hacks + DMA limits as
> > described below).  Guess it's time to roll some cleaned up patches and
> > see how much everyone screams :)
> >
>
> 3 series sent out:
> (all also on gitlab.com/jic23/qemu cxl-2024-02-15 though I updated patch
> descriptions
> a little after pushing that out)
>
> Main set of fixes (x86 'works' under my light testing after this one)
>
> https://lore.kernel.org/qemu-devel/20240215150133.2088-1-jonathan.came...@huawei.com/
>
> ARM FEAT_HADFS (access and dirty it updating in PTW) workaround for
> missing atomic CAS
>
> https://lore.kernel.org/qemu-devel/20240215151804.2426-1-jonathan.came...@huawei.com/T/#t
>
> DMA / virtio fix:
>
> https://lore.kernel.org/qemu-devel/20240215142817.1904-1-jonathan.came...@huawei.com/
>
> Last thing I need to do is propose a suitable flag to make
> Mattias' bounce buffering size parameter apply to "memory" address space.


For background, I actually had a global bounce buffer size parameter apply
to all address spaces in an earlier version of my series. After discussion
on the list, we settled on an address-space specific parameter so it can be
configured per device. I haven't looked into where the memory accesses in
your context originate from - can they be attributed to a specific entity
to house the parameter?


> Currently
> I'm carrying this: (I've no idea how much is need but it's somewhere
> between 4k and 1G)
>
> diff --git a/system/physmem.c b/system/physmem.c
> index 43b37942cf..49b961c7a5 100644
> --- a/system/physmem.c
> +++ b/system/physmem.c
> @@ -2557,6 +2557,7 @@ static void memory_map_init(void)
>      memory_region_init(system_memory, NULL, "system", UINT64_MAX);
>      address_space_init(&address_space_memory, system_memory, "memory");
>
> +    address_space_memory.max_bounce_buffer_size = 1024 * 1024 * 1024;
>      system_io = g_malloc(sizeof(*system_io));
>      memory_region_init_io(system_io, NULL, &unassigned_io_ops, NULL, "io",
>                            65536);
>
> Please take a look. These are all in areas of QEMU I'm not particularly
> confident
> about so relying on nice people giving feedback even more than normal!
>
> Thanks to all those who helped with debugging and suggestions.
>
> Thanks,
>
> Jonathan
>
> > Jonathan
> >
> >
> > > >
> > > > -- PMM
> > >
> > > Still a work in progress but I thought I'd give an update on some of
> the fun...
> > >
> > > I have a set of somewhat dubious workarounds that sort of do the job
> (where
> > > the aim is to be able to safely run any workload on top of any valid
> > > emulated CXL device setup).
> > >
> > > To recap, the issue is that for CXL memory interleaving we need to have
> > > find grained routing to each device (16k Max Gran).  That was fine
> whilst
> > > pretty much all the testing was DAX based so software wasn't running
> out
> > > of it.  Now the kernel is rather more aggressive in defaulting any
> volatile
> > > CXL memory it finds to being normal memory (in some configs anyway)
> people
> > > started hitting problems. Given one of the most important functions of
> the
> > > emulation is to check data ends up in the right backing stores, I'm not
> > > keen to drop that feature unless we absolutely have to.
> > >
> > > 1) For the simple case of no interleave I have working code that just
> > >    shoves the MemoryRegion in directly and all works fine.  That was
> always
> > >    on the todo list for virtualization cases anyway were we pretend the
> > >    underlying devices aren't interleaved and frig the reported perf
> numbers
> > >    to present aggregate performance etc.  I'll tidy this up and post
> it.
> > >    We may want a config parameter to 'reject' address decoder
> programming
> > >    that would result in interleave - it's not remotely spec compliant,
> but
> > >    meh, it will make it easier to understand.  For virt case we'll
> probably
> > >    present locked down decoders (as if a FW has set them up) but for
> emulation
> > >    that limits usefulness too much.
> > >
> > > 2) Unfortunately, for the interleaved case can't just add a lot of
> memory
> > >    regions because even at highest granularity (16k) and minimum size
> > >    512MiB it takes for ever to eventually run into an assert in
> > >    phys_section_add with the comment:
> > >    "The physical section number is ORed with a page-aligned
> > >     pointer to produce the iotlb entries.  Thus it should
> > >     never overflow into the page-aligned value."
> > >     That sounds hard to 'fix' though I've not looked into it.
> > >
> > > So back to plan (A) papering over the cracks with TCG.
> > >
> > > I've focused on arm64 which seems a bit easier than x86 (and is
> arguably
> > > part of my day job)
> > >
> > > Challenges
> > > 1) The atomic updates of accessed and dirty bits in
> > >    arm_casq_ptw() fail because we don't have a proper address to do
> them
> > >    on.  However, there is precedence for non atomic updates in there
> > >    already (used when the host system doesn't support big enough cas)
> > >    I think we can do something similar under the bql for this case.
> > >    Not 100% sure I'm writing to the correct address but a simple frig
> > >    superficially appears to work.
> > > 2) Emulated devices try to do DMA to buffers in the CXL emulated
> interleave
> > >    memory (virtio_blk for example).  Can't do that because there is no
> > >    actual translation available - just read and write functions.
> > >
> > >    So should be easy to avoid as we know how to handle DMA limitations.
> > >    Just set the max dma address width to 40 bits (so below the CXL
> Fixed Memory
> > >    Windows and rely on Linux to bounce buffer with swiotlb). For a
> while
> > >    I couldn't work out why changing IORT to provide this didn't work
> and
> > >    I saw errors for virtio-pci-blk. So digging ensued.
> > >    Virtio devices by default (sort of) bypass the dma-api in linux.
> > >    vring_use_dma_api() in Linux. That is reasonable from the
> translation
> > >    point of view, but not the DMA limits (and resulting need to use
> bounce
> > >    buffers).  Maybe could put a sanity check in linux on no iommu +
> > >    a DMA restriction to below 64 bits but I'm not 100% sure we wouldn't
> > >    break other platforms.
> > >    Alternatively just use emulated real device and all seems fine
> > >    - I've tested with nvme.
> > >
> > > 3) I need to fix the kernel handling for CXL CDAT table originated
> > >    NUMA nodes on ARM64. For now I have a hack in place so I can make
> > >    sure I hit the memory I intend to when testing. I suspect we need
> > >    some significant work to sort
> > >
> > > Suggestions for other approaches would definitely be welcome!
> > >
> > > Jonathan
> > >
> >
> >
>
>

Re: Crash with CXL + TCG on 8.2: Was Re: qemu cxl memory expander shows numa_node -1

Reply via email to