On 12/2/22 15:59, Mike Pattrick wrote:
On Fri, Dec 2, 2022 at 5:37 AM Maxime Coquelin
<maxime.coque...@redhat.com> wrote:



On 12/2/22 11:09, David Marchand wrote:
On Wed, Nov 30, 2022 at 9:30 PM Ilya Maximets <i.maxim...@ovn.org> wrote:
Shouldn't this be 0x7f instead?
0x3f doesn't enable bit #6, which is responsible for dumping
shared huge pages.  Or am I missing something?

That's a good point, the hugepage may or may not be private. I'll send
in a new one.

OK.  One thing to think about though is that we'll grab
VM memory, I guess, in case we have vhost-user ports.
So, the core dump size can become insanely huge.

The downside of not having them is inability to inspect
virtqueues and stuff in the dump.

Did you consider madvise()?

         MADV_DONTDUMP (since Linux 3.4)
                Exclude from a core dump those pages in the range
specified by addr and length.  This is useful in applications that
have large areas of memory that are known not to be useful in a core
dump.  The effect of  MADV_DONT‐
                DUMP takes precedence over the bit mask that is set via
the /proc/[pid]/coredump_filter file (see core(5)).

         MADV_DODUMP (since Linux 3.4)
                Undo the effect of an earlier MADV_DONTDUMP.

I don't think OVS actually knows location of particular VM memory
pages that we do not need.  And dumping virtqueues and stuff is,
probably, the point of this patch (?).

vhost-user library might have a better idea on which particular parts
of the memory guest may use for virtqueues and buffers, but I'm not
100% sure.

Yes, distinguishing hugepages of interest is a problem.

Since v20.05, DPDK mem allocator takes care of excluding (unused)
hugepages from dump.
So with this OVS patch, if we catch private and shared hugepages,
"interesting" DPDK hugepages will get dumped, which is useful for
debugging post mortem.

Adding Maxime, who will have a better idea of what is possible for the
guest mapping part.



I wonder if we could do a MADV_DONTDUMP on all the guest memory at mmap
time, then there are two cases:
    a. vIOMMU = OFF. In this case we could do MADV_DODUMP on virtqueues
memory. Doing so, we would have the rings memory, but not their buffers
(except if they are located on same hugepages).
    b. vIOMMU = ON. In this case we could do MADV_DODUMP on IOTLB_UPDATE
new entries and MADV_DONTDUMP on invalidated entries. Doing so we will
get both vrings and their buffers the backend is allowed to access.

I can prepare a PoC quickly if someone is willing to experiment.

A big motivation for this patch is DPDK becomes a big black hole in
coredumps, preventing even netdev structures from being enumerated.

I threw together a small PoC for this yesterday, only marking vhost
mmaps as DONTDUMP. The result was that a simple OVS configuration with
one vhost attached to an 8gb VM dropped the zstd compressed coredump
including shared huge pages from 486.4M to 19.8M. The resulting
coredump was also significantly more debuggable, all internal OVS
datastructures became available as well as many DPDK ones.

I'll look at cleaning things up and submitting to the DPDK mailing list.

Thanks for doing the PoC, that will definitely help debugging!

I have quickly drafted a patch doing the MADV_DODUMP on the area of
interest for Vhost as described in my initial reply, could you have a try with it? (disclaimer: not even build tested)

Maxime

=========================================================================

diff --git a/lib/vhost/iotlb.c b/lib/vhost/iotlb.c
index 6a729e8804..555b4ebba0 100644
--- a/lib/vhost/iotlb.c
+++ b/lib/vhost/iotlb.c
@@ -149,6 +149,7 @@ vhost_user_iotlb_cache_remove_all(struct vhost_virtqueue *vq)
        rte_rwlock_write_lock(&vq->iotlb_lock);

        RTE_TAILQ_FOREACH_SAFE(node, &vq->iotlb_list, next, temp_node) {
+               madvise(node->uaddr, node->size, MADV_DODUMP);
                TAILQ_REMOVE(&vq->iotlb_list, node, next);
                vhost_user_iotlb_pool_put(vq, node);
        }
@@ -170,6 +171,7 @@ vhost_user_iotlb_cache_random_evict(struct vhost_virtqueue *vq)

        RTE_TAILQ_FOREACH_SAFE(node, &vq->iotlb_list, next, temp_node) {
                if (!entry_idx) {
+                       madvise(node->uaddr, node->size, MADV_DODUMP);
                        TAILQ_REMOVE(&vq->iotlb_list, node, next);
                        vhost_user_iotlb_pool_put(vq, node);
                        vq->iotlb_cache_nr--;
@@ -222,12 +224,14 @@ vhost_user_iotlb_cache_insert(struct virtio_net *dev, struct vhost_virtqueue *vq
                        vhost_user_iotlb_pool_put(vq, new_node);
                        goto unlock;
                } else if (node->iova > new_node->iova) {
+ madvise(new_node->uaddr, new_node->size, MADV_DODUMP);
                        TAILQ_INSERT_BEFORE(node, new_node, next);
                        vq->iotlb_cache_nr++;
                        goto unlock;
                }
        }

+       madvise(new_node->uaddr, new_node->size, MADV_DODUMP);
        TAILQ_INSERT_TAIL(&vq->iotlb_list, new_node, next);
        vq->iotlb_cache_nr++;

@@ -255,6 +259,7 @@ vhost_user_iotlb_cache_remove(struct vhost_virtqueue *vq,
                        break;

                if (iova < node->iova + node->size) {
+                       madvise(node->uaddr, node->size, MADV_DODUMP);
                        TAILQ_REMOVE(&vq->iotlb_list, node, next);
                        vhost_user_iotlb_pool_put(vq, node);
                        vq->iotlb_cache_nr--;
diff --git a/lib/vhost/vhost_user.c b/lib/vhost/vhost_user.c
index 371d6304d6..208377ddc6 100644
--- a/lib/vhost/vhost_user.c
+++ b/lib/vhost/vhost_user.c
@@ -767,6 +767,8 @@ translate_ring_addresses(struct virtio_net **pdev, struct vhost_virtqueue **pvq)
                        return;
                }

+               madvise(vq->desc_packed, len, MADV_DODUMP);
+
                numa_realloc(&dev, &vq);
                *pdev = dev;
                *pvq = vq;
@@ -782,6 +784,8 @@ translate_ring_addresses(struct virtio_net **pdev, struct vhost_virtqueue **pvq)
                        return;
                }

+               madvise(vq->driver_event, len, MADV_DODUMP);
+
                len = sizeof(struct vring_packed_desc_event);
                vq->device_event = (struct vring_packed_desc_event *)
                                        (uintptr_t)ring_addr_to_vva(dev,
@@ -793,6 +797,8 @@ translate_ring_addresses(struct virtio_net **pdev, struct vhost_virtqueue **pvq)
                        return;
                }

+               madvise(vq->device_event, len, MADV_DODUMP);
+
                vq->access_ok = true;
                return;
        }
@@ -809,6 +815,8 @@ translate_ring_addresses(struct virtio_net **pdev, struct vhost_virtqueue **pvq)
                return;
        }

+       madvise(vq->desc, len, MADV_DODUMP);
+
        numa_realloc(&dev, &vq);
        *pdev = dev;
        *pvq = vq;
@@ -824,6 +832,8 @@ translate_ring_addresses(struct virtio_net **pdev, struct vhost_virtqueue **pvq)
                return;
        }

+       madvise(vq->avail, len, MADV_DODUMP);
+
        len = sizeof(struct vring_used) +
                sizeof(struct vring_used_elem) * vq->size;
        if (dev->features & (1ULL << VIRTIO_RING_F_EVENT_IDX))
@@ -836,6 +846,8 @@ translate_ring_addresses(struct virtio_net **pdev, struct vhost_virtqueue **pvq)
                return;
        }

+       madvise(vq->used, len, MADV_DODUMP);
+
        if (vq->last_used_idx != vq->used->idx) {
                VHOST_LOG_CONFIG(dev->ifname, WARNING,
"last_used_idx (%u) and vq->used->idx (%u) mismatches;\n",

_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to